Posted by Corey Northcutt

This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.

For anyone that's experienced the joys of doing SEO on an exceedingly large site, you know that keeping your content in check isn't easy. Continued iterations of the Panda algorithm have made this fact brutally obvious for anyone that's responsible for more than a few hundred thousand pages.

As an SEO with a programming background and a few large sites to babysit, I was forced to fight the various Panda updates throughout this year through some creative server-side scripting. I'd like to share some with you now, and in case you're not well-versed in nerdspeak (data formats, programming, and Klingon), I'll start each item with a conceptual problem, the solution (so at least you can tell your developer what to do), and a few code examples for implementation (assumes that they didn't understand you when you told them what to do). My links to the actual code are in PHP/MySQL, but realize that these methods translate pretty simply into most any scenario.

OBLIGATORY DISCLAIMER: Although I've been successful at implementing each of these tricks, be careful. Keep current backups, log everything you do so that you can roll-back, and if necessary, ask an adult for help.

1.) Fix Duplicate Content between Your Own Articles

The Problem

Sure, you know not to copy someone else's content. But what happens when over time, your users load your database full of duplicate articles (jerks)? You can write some code that checks if articles are an exact match, but no two are going to be completely identical. You need something that's smart enough to analyze similarity, and you need to be about as clever as Google is at it.

The Solution

There's a sophisticated measure of how similar two bodies of text are using something called Levenshtein distance analysis. It measures how many edits would be necessary to transform one string into another, and can be translated into a related percentage/ratio of how similar one string is to another. When running this maintenance script on 1 million+ articles that were 50-400 words, deleting only duplicate articles with a 90% similarity in Levenshtein ratio, the margin of error was 0 in each of my trials (and the list of deletions was a little scary, to say the least).

The Technical

Levenshtein comparison functions are available in basically every programming language and are pretty simple to use. Running comparisons on 10,000 individual articles against one another all at once is definitely going to make your web/database server angry, however, so it takes a bit of creativity to finish this process while we're all still alive to see your ugly database.

levenshtein distance function

What follows may not be ideal practice, or something you want to experiment with heavily on a live server, but it gets this tough job done in my experience.

  1. Create a new database table where you can store a single INT value (or if this is your own application and you're comfortable doing it, just add a row somewhere for now). Then create one row that has a default value of 0.
     
  2. Have your script connect to the database, and get the value form the table above. That will represent the primary key of the last article we've checked (since there's no way you're getting through all articles in one run).
     
  3. Select that article, and check it against all other articles by comparing Levenshtein distance. Doing this in the application layer will be far faster than running comparisons as a database stored procedure (I found the best results occurred when using levenshteinDistance2(), available in the comments section of levenshtein() on php.net). If your database size makes this run like poop through a funnel (checking just 1 article against all others at once), consider only comparing articles by the same author, of similar length, posted in a similar date range, or other factors that might help reduce your data set of likely duplicates.
     
  4. Handle the duplicates as you see fit. In my case, I deleted the newer entry and stored a log in a new table with full text of both, so individual mistakes could later be reverted (there were none, however). If your database isn't so messy or you still fear mistakes after testing a bit, it may very well be good enough just to store a log and later review them by hand.
     
  5. After you're done, store the primary key of the last article that you checked in the database entry from i.). You can loop through ii.) – iv.) a few more times on this run if this didn't take too long to execute. Run this script as many times as necessary on a one minute cronjob or with the Windows Task Scheduler until complete, and keep a close eye on your system load.

2.) Spell-Check Your Database

The Problem

Sure, it would be best if your users were all above a third grade reading level, but we know that's not the case. You could have a professional editor run through content before it went live on your site, but now it's too late. Your content is now a jumbled mess of broken English, and in dire need of a really mean English teacher to set it all straight.

The Solution

Since you don't have an English teacher, we'll need automation. In PHP, for example, we have fun built-in tools like soundex(), or even levenshtein(), but when analyzing individual words, these just don't cut it. You could grab a list of the most common misspelled English words, but that's going to be hugely incomplete. The best solution that I've found is an open source (free) spell checking tool called the Portable Spell Checker Interface Library (Pspell), which uses the Aspell library and works very well.

The Technical

Once you get it setup, working with Pspell is really simple. After you've installed it using the link above, include the libraries in your code, and this function to return an array of suggestions for each word, with the word at array key 0 being the closest match found. Consider the basic logic from 1.) if it looks like it's going to be too much to tackle at once, incrementing your place as you step through the database, logging all actions in a new table, and (carefully) choosing whether or not you like the results well enough to automate the fixes or if you'd prefer to chase them by hand.

pspell example

3.) Implement rel="canonical" in Bulk

The Problem

link rel="canonical" is very useful tag for eliminating confusion when two URLs might potentially return the same content, such as when Googlebot makes its way to your site using an affiliate ID. In fact, the SEOmoz automated site analysis will yell at you on every page that doesn't have one. Unfortunately since this tag is page-specific, you can't just paste some HTML in the static header of your site.

The Solution

As this assumes that you have a custom application, let's say that you can't simply install ALL IN ONE SEO on your WordPress, or install a similar SEO plugin (because if you can, don't re-invent the wheel). Otherwise, we can tailor a function to serve your unique purposes.

The Technical

I've quickly crafted this PHP function with the intent of being as flexible as possible. Note that desired URL structures are different on different sites and scripts, so think about everything that's installed under a given umbrella. Use the flags that it mention in the description section so that it can best mesh with the needs of your site.
canonical link function

4.) Remove Microsoft Word's "Smart Quote" Characters

The Problem

In what could be Microsoft's greatest crime against humanity, MS Word was shipped with a genius feature that automatically "tilts" double and single quotes towards a word (called "smart quotes"), in a style that's sort of like handwriting. You can turn this off, but most don't, and unfortunately, these characters are not a part of the ASCII set. This means that various character sets used on the web and in databases that store them will often fail to present them, and instead, return unusable junk that users (and very likely, search engines) will hate.

The Solution

This one's easy: use find/replace on the database table that stores your articles.

The Technical

Here it is an example of how to fix this using MySQL database queries. Place a script on an occasional cron in Linux or using the Task Scheduler in Windows, and say goodbye to these ever appearing on your site again.

smart quotes mysql

5.) Fix Failed Contractions

The Problem

Your contributors are probably going to make basic grammar mistakes like this all over the map, and Google definitely cares. While it's important never to make too many assumptions, I've generally found that fixing common contractions is very sensible.

The Solution

You can use find/replace here, but it's not as simple as the solution fixing smart quotes, so you need to be careful. For example "wed" might need to be "we'd", or it might not. Other contractions might make sense while standing on their own, but find/replace by itself will also return results that are pieces of other words. So, we need to account for this as well.

The Technical

Note that there are two versions of each word. This is because in my automated proofreading trials, I've found it's common not only for an apostrophe to be omitted., but also for a simple typo to occur that puts the apostrophe after the last letter when Word's automated fix for this isn't on-hand. Words have also been surrounded by a space to eliminate a margin of error (this is key- just look at how many other words include 'dont' on one of these sites that people use to cheat in word games). Here's an example of how this works. This list is a bit incomplete, and leaves probably the most room for improvement in the list. Feel free to generate your own using this list of English contractions.

That should about do it. I hope everyone enjoyed my first post here on SEOMoz, and hopefully this stirs some ideas on how to clean up some large sites!

Do you like this post? Yes No

Posted by Cyrus Shepard

We love exact match anchor text! It’s the Holy Grail of links that make our rankings soar – or does it? Many SEOs predict Google will continue to devalue exact match anchors as their algorithm evolves in the age of Panda. We’ve seen evidence of this phenomenon over the past year and many expect to see the value of exact match drop even further.

Many webmasters wonder if they should give up link building altogether. Not at all! Search engines collect a ton of data through links to better understand your content and how valuable it is. Recognizing these link signals can help you make the most out of every link you gain. Do you have any tips on anchor text? Let us know in the comment below!

 

Video Transcription

Howdy SEOmoz! Welcome to another edition of Whiteboard Friday. My name is Cyrus. I do SEO here at SEOmoz. This week I want to talk about anchor text. Every week I get emails, I am sure you do too, from webmasters asking for a link, and they always want that exact match anchor text for the specific term they’re trying to rank for. It is a good practice. It works well. But things are changing in the SEO world.

1. Exact Match

In the old days, if you wanted to rank for something, your tactic was very simple. If your target keyword was Bing cherries, you just tried to get as many exact match anchor text that said Bing cherries as possible to your website. Those of you who have been practicing SEO for a long time noticed something about a year and a half ago or so, that this method did not work as well as it used to. If you got too many exact match anchor texts, it could actually hurt you. That’s why you say, that’s such a 2009 tactic.

Now with the Google Panda update, we’re talking about a whole other realm of ranking signals, such as engagement metrics, social signals, but we don’t want to forget these link signals. Even if exact match isn’t the end all be all, there is still a lot of information that Google and other search engines are getting from these link signals, and that’s what we want to talk about today.

2. Partial Match

Now, one of the most overlooked types of anchor text links is the partial match, and I am in love with partial match. I really quit going for these a long time ago. Now it is all about partial match. People sort of misunderstand what partial match is. The technical definition of partial match is any anchor text that contains at least one of your keyword phrases. So, if your keyword phrase was Bing cherries, these would all count as partial match anchor texts: Bing are the best cherries; I love cherries; Bing is awesome. Yeah, it’s probably not what they are talking about, but it is still technically partial match anchor text.

If you are a fan of the 2011 Ranking Factors that SEOmoz did – we’ll link to it in the text below – we took a look, one of the factors we looked at was the power of partial match anchor text versus exact match anchor text. Now, in general, if you look at the root domain metrics, the correlation between the number of exact match anchor text was 0.17. All things being equal, the power of partial match anchor text was 0.25. Significantly more power and more correlation between the number of partial match anchor text and exact match anchor text. So, all things being equal, it seems like people rank higher, just a little bit, if they have more partial match as opposed to these exact match that everybody is always going for.

This is how I’d like to explain it. If you give me a choice, if you could say I could have any 300 links I want but they have to be 300 partial match anchor text or 300 exact match anchor text, a lot of webmasters would go for this thinking it is the best policy. Statistically though, this is your best choice. This is going to contain some of your exact matches, but you’re going to have such a bigger broad tail, long tail queries that you can rank for. You’re going to get more traffic. You’re going to rank better for your targeted keywords, and this method is future proof. As Google deemphasizes these exact matches, this is going to take you forward in the long run. Those links are going to have a lot longer long-term value, and it is just going to give you a better natural looking link profile.

3. Context, Placement and Relevance

Other link signals, how do you make these links count? If you’re not getting the exact match anchor text, what are other context signals that Google could be looking at? Well, first of all, they are going to be looking at the on-page signals of the page that’s giving you the link. If you are trying to rank for Bing cherries, you want the title tag of that page to be cherries. There is an article Rand wrote a couple years ago, "The Perfectly Optimized Page." All those on page signals, those are what you want on the page linking to you – the title tag, the H1 headers, keyword usage, alt text in the photo. Those are all signals to Google that this page is about Bing cherries. It’s linking to you. You’re more likely to interpret that as this link is about Bing cherries.

Context, Google is getting increasingly more sophisticated at being able to do block analysis and determine what the page is about. So, if you have a section of ads, Google can kind of tell that is a section of ads. If you have a link in that section of ads, probably not going to count for very much. Same on the sidebar. If you have a link about Bing cherries on a page about monkeys and it is hidden in this link of text, well, the context and the placement of that link, Google says that’s probably not about cherries. It looks kind of like a paid ad, and that’s not going to count for very much. So, context, on page signals, all those traditional on page optimization, those things that you would want on your own page, you want to look for from the link.

4. The Future of Link Signals

Google is spending a lot of money to learn how to understand pages, to learn context. The days of the dumb search engine are kind of leaving us behind. Google is getting better and better at figuring out what these pages are about. If you read Google patents, which a lot of us like to do, SEO by the Sea is a great blog to read, they’re seeing patents such as sentiment analysis, such as in online reviews. Google will actually try to figure out if that review is a positive review or a negative review. So, even if you get the link, if there are words around it like Joe’s Pizza sucks, well that might not be, in the future, as good as link as Joe’s Pizza is awesome. Now, this is all theory. We don’t have the data and the facts to back this up, but the patents tell us this is where the future is going. Author profiling, the author tags that Google is using, they might be asking is this person an authority? If Rand Fishkin links to you with anchor text about SEO, Google may in the future decide Rand Fishkin is an expert about SEO. That link is so much more important than Joe Schmoe SEO because they know his author profile.

In the end, this system was easy to game. Exact match profiles, very easy to game. That’s why it went away. In the future, it is much harder to game. Search engines are becoming sophisticatedly more like human beings. So, when we look at these pages, we have to be human as SEOs. We have to judge these pages like a human. We have to write them like a human. We have to link like a human. The higher quality you do that, the longer your strategy is going to work and anchor text, linking signals, they’re all going to work for you.

That’s all. Thank you very much.

Video transcription by Speechpad.com

Bonus – A Final Note

Different SEOs hold widely varying opinions as to how much exact match anchor text is "too much." Estimates range between 25-80%. I don’t believe there is any perfect ratio, as other factors such as source, context and authority play significant roles. While there certainly needs to be more study in this area, I found the following articles interesting:

As always, I’m interested in your thoughts and recommendations about "perfecting" anchor text.

Do you like this post? Yes No

Posted by Dr. Pete

Okay, deep breath. I AM SUPER EXCITED… Sorry, let’s try again *breathes into bag*. I am very excited to announce SEOmoz’s first “living document”, a complete history of named Google algorithm changes, from “Boston” in 2003 to Panda 2.3 (or whatever the kids are calling it these days). Why don’t you check out this sneak peek while I try to calm down…

Screenshot of algo history page

This started as a simple blog post, trying to pull together the complete list of named updates, but we soon realized the value of keeping a history of Google updates as a long-term archive. While Google makes hundreds of changes every year, Panda has proven once again that the major updates do matter to businesses, and it’s useful to know when the rules changes on a large scale.

Within each year, you’ll see a breakdown like this, complete with description and links:

Sample algo history listing

For 2003-2010, updates are listed by month only. For 2011 changes (and going forward), we’ve provided exact dates, when possible. Some of these are estimates, but we want to try to isolate changes as precisely as we can, so that you can map them against their SEO impact on your own sites.

If you can’t wait, here’s the permanent link to the Google Algorithm Change History page.

What’s A Living Document?

The algorithm is constantly changing, so we designed this document to change with it. I was adding updates to the list (Google+ and Panda 2.3) as recently as last week. In addition, we recognize that the timeline isn’t exact. We rely a lot on the archival knowledge of the SEO community, and Google doesn’t publish official lists of updates, even the big ones. So, we welcome your feedback, both corrections and additions. Although this was a team effort, a lot of the initial research was mine, and, as they say, the buck stops here. If you see something you think is wrong, let me know – comment, DM, Tweet, email me, whatever you like. We’ve also included a dedicated update email on the main document.

Google Algo Change History

I’d Like to Thank…

This wouldn’t have been possible without a lot of help, both internally and from the industry as a whole. First off, I’d like to thank Cyrus, Casey and Matt for moral support and heroically turning my barely comprehensible Google doc into a thing of beauty.

Special thanks go to SEOmoz member Barry Smith, who answered a public Q&A question about the Google algo history with an incredible off-the-top-of-his-head response. We had been pondering this for a bit, and the amazing public response to his answer demonstrated just how much people wanted to see this data all in one place.

Finally, I’d like to thank all of the industry people who chimed in on dates and details on Twitter, including heavy hitters like Bill Slawski, Brett Tabke, and Ted Ulle. You’ll notice that the vast majority of links on the document are to sites other than SEOmoz – our goal is to build the best reference we can.

Want to Read More?

When I was doing my initial pass, trying to build a skeleton of named updates and rough dates (which got researched update-by-update later), I came across the following useful resources that you might also be interested in:

I hope you find the resource useful, and please feel free to contact us with any corrections or additions. Once again, here’s the link to the permanent Google Algorithm Change History page.

Do you like this post? Yes No

Algorithm – SEO Dictionary

An algorithm is a set of finite, ordered steps for solving a mathematical problem. Each Search Engine uses a proprietary algorithm set to calculate the relevance of its indexed web pages to your particular Query. The result of this process is a list of sites ranked in the order that the search engine deemed most relevant. Search engine algorithms are closely guarded in order to prevent exploitation of algorithmic results. Search algorithms are also changed frequently to incorporate new data and improve relevancy.