Everything by MichaelC at SEMpdx - 1

Designing Websites with Panda, RankBrain, and Phantom in Mind

MichaelC — Thu, 28 Jan 2016 02:14:47 +0000

Much has been written about the Google Panda algorithm, and other quality and core algorithm updates such as Phantom or RankBrain.

Images courtesy Ed Schipul, Allan Ajifo, and Will Sowards on Flickr.

Have these algo changes affected your site’s rankings? I guarantee it…even if you didn’t see a boost or a push-down in the rankings, you can be sure that some of your competitors got hit and lost rankings, and probably some of them got a boost as well. What makes it tough on webmasters and SEO people is that while we know vaguely the kinds of things Google is looking at when it comes to site quality, Google is thin on delivering specifics of what they’re targeting, and even thinner on specific technical advice–presumably to avoid giving the black-hat SEOs too much information.

Google’s Goals on Quality

Panda was created originally back in February of 2011, with the idea being to analyze the content on a web page, scoring the page on how great the content was, and how great the user experience was. Prior to Panda, on-page analysis by Google seemed to be mostly scanning the HTML elements like the page title, headings, body text, ALT text on images, etc. to see what terms it found related to the search term in question. Panda brought the ability to render a page much like a browser would, and analyze the layout of the content, see what percentage of a page was actual page content vs. navigation, template elements, and ads. Panda also brought the ability to look at a site’s quality overall–and if your site seems to be mostly really thin pages that are mostly template and little content, or duplicates of each other, or mostly syndicated content from other sites, then sending a user to even one of the great pages on your site could be a bad user experience, since most of the other pages they might click to after that page really suck.

Phantom, version 3 of which hit on November 19th, 2015, appears to target (and punish) pages that are really just indexes or catalogs of content, like blog category or tag archive pages. Marcus Tober of SearchMetrics dives into this in more detail here. The general idea is that if a user is looking for a page all about purple widgets, then a page that’s just a list of snippets and links to other pages about purple widgets isn’t nearly as good a result for the user as a page that’s had content written specifically to explain all about purple widgets, so they want to push those category/tag archive pages down in the results and let the custom, dedicated type pages bubble up in the rankings.

RankBrain was announced in October 2015, despite apparently being integrated into core algorithm sometime in the spring of that year. This part of the algorithm is now apparently the 3rd most important signal in terms of relevance, according to Greg Corrado, a Google senior research scientist.

RankBrain appears to be able to score a page for relevance for a given topic (notice I didn’t say “search term”) based on co-occurrence of other topics or sub-topics. It’s theorized by some (and I’d agree) that they’re doing this by measuring the occurrence of relatively unusual phrases on other pages about that topic on the web. The idea is this: let’s say the topic is “tahiti weather”. RankBrain might look at the top 100 pages or so (according to their other relevance algo scores), and perhaps finds “humidity” on 95% of those pages; finds “cyclones” on 40% of those pages; finds “surfing conditions” on 10% of those pages. So, if YOUR page doesn’t have “humidity” on it, it’s probably not really about Tahiti weather; if you’ve got “cyclones” on the page, then you’re covering the basics; if you’ve got “surfing conditions” on your page, then your page is likely to be one of the most thorough pieces of content on that topic.

Last but certainly not least, it’s critical to understand that by and large, Google deals with single pages. The search results consist of links to single pages, not groups of pages. Yes, overall site quality, and overall domain authority contribute to ranking. But if Google is looking for the perfect page about purple widgets, and you’ve split your content across 10 pages, Google is going to pick ONE of those ten pages, and its content is going to have to try to stand and compete–with ONLY its content, forgetting about the other 9 pages–against your competitor, who’s got all their juicy content packed into one page.

The Algo Update Process

Image courtesy x6e38 on Flickr.

Google’s got some smart engineers, for sure–but they’re not demi-gods. They’re regular software developers…imperfect humans, with schedules, deadlines, vacations, and likely a great fear of being responsible for an algo change mistake that gets them in the news.

As they work on a new piece of the algorithm, here’s what they’re probably going to do:

target a particular quality measurement
write a new piece of the algorithm that analyzes a page and comes up with a score
take what they hope is a representative sample of different kinds of web pages to test the algorithm against
compare the scores that it comes up with against what a human would feel about those pages
refine and adjust the amount of impact to try and strike a balance between improved search results for the vast majority of searches, and false positives/negatives because of unusual HTML or implementations that fool the algo

Google also has process to consider: things like integrating Panda into the main algorithm, integrating various organic search algorithm pieces into local search, bug fixes, and adapting the new pieces of the algorithm where there are large numbers of false positives/negatives.

What is Google Likely to be Trying to Mechanically Measure

Image courtesy Bill Brooks on Flickr.

At a very high level, it’s:

layout
originality
quality of writing
breadth and depth of coverage of topic
user experience
user experience on the site after visiting the landing page.

Too Cool For School

As webmasters and designers push the envelope, developing new kinds of user experience, they run the risk of creating something that runs afoul of the quality scoring pieces of Google’s algorithms. Content that Googlebot can’t see, such as content loaded by Javascript/Ajax on a delay, or on a user action like a scroll of the page or the click of a Next Image button. Pages that are beautiful, simple, light, and airy….but make the user scroll to find the information they came looking for. Gorgeous, big images that are implemented as background images in CSS (is it just decoration? Wallpaper? Or real content?), or using experimental new HTML such as .

Luckily, it’s easy to see if Google can see the content at all–just go to Google Search Console (aka Webmaster Tools), and do a Fetch and Render as Googlebot. You’ll see a snapshot of what Googlebot is PROBABLY seeing on your page. I say “probably” because the tool is notorious for timing out on fetching page elements like images, stylesheets, etc., and I’d guess that the actual rendering engine they use to measure the content is more forgiving. But maybe I’m wrong…

What we can’t see is Google’s interpretation of a given piece of content. Is it looking at that medium-sized image that’s linked to another page as content? Or as a navigational button? Is it seeing your lovely product image slider as content about the product? Or because it’s a background-url in CSS, does Google think it’s just decoration, and the only content is the 3 words superimposed over the image?

The most beautiful, modern, stylish, slick user-experience website in the world is worthless if no customers come to visit it. If your implementation makes the content invisible to Google, or makes Google misinterpret what the content elements are, then you might be kissing organic search traffic goodbye.

It’s All a Bet

Do we know whether Google Panda is counting images that are linked to other pages as relevant content on the page, or presuming it’s just a navigational element–a button. We don’t. In the case of image carousels and sliders, so many are implemented as background-url in CSS, it’s likely that Panda is going to see those as content.

For the business owner, that’s a scary bet. Imagine if your boss came up to you and said there was a 70% chance you’d get a paycheck next week. That’s the risk you’re running for the website owner if you’re taking chances on how Google is seeing your content.

Best Practices

Googlers like Matt Cutts are fond of telling us to do the right thing for the user experience, and it’ll all be rainbows and unicorns.

By and large, this is really good advice. Hand-write a 3000 word, really thorough, well-written discussion of a topic, with original images/illustrations, embed a video, show where it is on an interactive map–this kind of content is going to generally score well in Google’s various quality algorithms.

But it can pay to take a technical approach to the content as well–looking to see how Google is likely to perceive each piece of content on the page (content? background? navigation?), make sure Googlebot renders it properly (Search Console -> Fetch as Google).

Does your fabulous “everything about purple widgets” page actually look a lot like a category archive page? (I have a client with this problem right now….big traffic drop on November 19th).

Are you splitting content about a topic across multiple pages, to make the pages load fast? But perhaps now, Google is seeing 10 pages–each of which barely cover any of the topic. Consider combining those pages into 1 using a tabbed approach…and, if there are some parts of the content that make the page load too slowly this way, you can lazy-load a select few of those bits of content.

Does your user experience result in a ton of really thin pages for functions like Add to Cart, Add to Wishlist, Email to a Friend, Leave a Review? You can keep your UX as-is….just make sure you set your meta robots to “noindex,follow” on those pages, so you’re not telling Google you think those pages are rockin’ hot content that should be indexed. You want Google to look at the set of pages you’re pushing at them to be indexed (via your XML sitemap AND the meta robots on every page–don’t forget to make sure these are consistent), and have Google’s quality score average for all those pages be really nice and high.

And, for RankBrain, you’re well advised to do the search for the target term for your page, and read the pages that come up in the top 20-30 results. Are they discussing sub topics or related topics that you’ve totally left out? Are they using synonyms that you aren’t–and maybe you could replace 1 of the 3 references to a given term with the synonym?

Another place to look is Google Search Console, in Search Analytics. As of last summer, you can now see all the terms for which you’re appearing in the first few pages of the results, along with clicks, impressions, and rankings–on a per-term, per-page basis. Look here for terms that you’re ranking badly for, get a fair number of impressions, and you don’t have that term on the page today.

Wrapping Up

You don’t have to forgo beautiful design and great UX in order to please the Google ranking algorithms–but it’s worth your while to also take a technical look at your site, with Googlebot, Panda, RankBrain, and Phantom in mind.

And don’t worry too much about what Google can and cannot measure today–in the immortal words of Canadian hockey legend Wayne Gretzky (whose birthday was yesterday): “I skate to where the puck is going to be, not to where it has been.”

Photo courtesy waynegretzky.com.

The post Designing Websites with Panda, RankBrain, and Phantom in Mind appeared first on SEMpdx.

Parasitic SEO: InstallAware’s Competitor Attacks Using Negative SEO Tricks

MichaelC — Mon, 08 Jun 2015 20:13:28 +0000

Often when people talk about negative SEO tactics, they’re talking about using extremely spammy links pointing to a competitor in an attempt to get the competitor’s page or site penalized. And yes, this actually happens…I have 2 clients in fact whose competitors apparently hired the same person to build spammy links to their sites (I could tell because both sites were getting links from the same set of link farms and crappy blogs, during the same month, despite being in different countries and different industries).

Here’s the Twist

But my client, InstallAware, is seeing something a little more devious being done by their competitor. In their case, the competitor has found a post on a very strong, trusted forum (owned by Dell) where the post title questions whether their product is approved by Microsoft.

Here’s how it appears in the SERPs when you search for “InstallAware”:

The actual post, if you read it all, isn’t really so bad–a reasonable discussion follows and the net of the discussion is positive. Where the damage is done is in the SERPs themselves, where the title of the post becomes the page title and hence the headline that shows in the SERPs when you search for the product name, “InstallAware”. I’m not going to link to that post because that would just make our problem worse, of course.

Spam, or Coincidence?

So how do we know this is an attack, and not simply an example of a post in a really strong, trusted site ranking as it should given the domain authority, PageRank, etc.?

We take a look at the backlinks to that particular page, using something like Open Site Explorer. We find a bunch of links from nasty porn and online casino profiles, fake social media accounts with nothing other than the company name and a link to this page, and blog comment spam (over 300 instances of the exact same typo-filled comment with the link).

Here’s a small excerpt from the backlinks…I’ve fuzzed out some of the more explicit words in the domains and titles:

And here’s the blog comment spam, which conveniently has a typo in it, making it easier to find all instances of it via a Google search. Note that they’re using the target term (“installaware”) as the anchor text on the link to the post:

A large number of the comments trace back to a Google+ profile from India belonging to “Marie”, who apparently works at BlockBuster (yeah right).

Why does this work?

If you just created a new site to put a negative post on, or did the blog post on some other small site, it would be extremely hard to get that page to rank at all, let alone high on page one for the company’s own name. And, building a pile of spammy links like this wouldn’t really help, as those links would quickly get the page and/or site penalized.

But, starting with a page on a super strong trusted domain, with a ton of legitimate links to the domain, the additional spammy links to this particular page registers as such a low percentage of the overall link profile that the penalty isn’t likely to happen. So, those links work pretty much as well as they would have worked on any site 5 or so years ago, before the advent of manual and algorithmic penalties.

This is a bit of a perversion of a well-known positive SEO technique known as Barnacle SEO (a term coined by Will Scott), where you can point a bunch of links at your page on a site like Yelp to get that page to show on page 1.

The perpetrator in this case actually has an advantage over the positive Barnacle SEO practitioner, as he/she really has no risk of damage to themselves from using aggressive and nasty spammy link sources. Their OWN website is not involved at all, and it’s really impossible to trace the links back to the person who’s creating this.

So how do you combat this?

Removing the links from all the spammy sites is going to be pretty much impossible, as they’re on sites that are overwhelmed with other comment spam etc. Those webmasters already have a cleanup problem on a galactic scale–if they’re not cleaning spam comments and profiles up already, they’re unlikely to do all that work just because you asked nicely. In the case of the Google+ profile, you can report that to Google, but that’s not going to remove the links on all the blogs posted by them.

Before we proceed, it’s important to realize that while a super strong, trusted site like the one in question isn’t really likely to get penalized because of a few hundred crappy links to one page, where the risk really DOES lie is in having parasitic SEO practitioners discover that this particular site will not take action on this kind of tactic, and hence those people will start using the site as a host for similar attacks for their other clients. This COULD cause the overall percentage of backlinks from spammy sites to grow to where it does get the site penalized, especially if the parasitic SEO folks start using some of the more aggressive comment spam engines.

So, it’s really in the best interest of the webmaster of the trusted forum to nip this in the bud, save a possible penalty, and lessen the amount of IP address banning, cleanup, etc. they have to do going forwards.

Let’s look at your options for getting this off of page 1 of the SERPs, along with the issues and effort involved with each approach.

Get the Post Removed

The simplest answer of course would be to see if you could get the post taken down. In this case, though, the actual post is a reasonable and semi-useful discussion that the forum administrator isn’t likely to want to just delete.

Move the Post to a New URL

We can ask the webmaster to move the post to a new URL, updating links to the post within the forum, and NOT doing a 301 redirect from the old URL to the new URL. Now, all of those spammy links will point to a page that 404s, which eventually will cause Google to drop it from the index (although this can sometimes take a month or two). A better approach is to make that old URL return a 410 (HTTP response code meaning it’s gone permanently, vs. 404 which means it’s gone but possibly only temporarily).

Keep in mind, when you’re approaching a webmaster with a request like this, you’re asking a favor which is going to require some work on their part. Work that they’re not getting paid for, and that’s going to be getting in the way of whatever other projects are also on their plate. It behooves you to not only be respectful and non-accusatory in your request (after all, THEY didn’t cause the problem, it’s someone else abusing their site that’s the cause), but also do what you can to help them mitigate the effects of the spam on their site as well.

A CMS like WordPress makes this approach very simple. But others (like the one in question) make this sort of change very difficult and time-consuming.

Disavow the Spammy Links

In this example, we’ve done all the analysis to find all the spammy backlinks, so we can easily provide the IT Ninja webmaster with a disavow file excerpt that includes all the nasty porn etc. domains so that any further attempts by the same spammer to build links to the moved page (or other pages in the future) are derailed in advance, with no extra work for this webmaster.

If the webmaster uploads this disavow file (or adds it to their existing disavow file), that will accomplish the same goal–but it won’t have an effect until the next time Google reads the disavow file, which most likely means not until the next Penguin data update.

This approach is probably the least effort for the webmaster, and there’s the benefit to them of removing the effects of the potentially toxic links as well.

The post Parasitic SEO: InstallAware’s Competitor Attacks Using Negative SEO Tricks appeared first on SEMpdx.

Top 10 Giant Site Audit FAILs

MichaelC — Tue, 14 Apr 2015 16:25:40 +0000

I spend a lot of my day doing technical site audits for clients. It’s often pretty tedious work, but I’ve run into a number of little problems–things that often aren’t even visible to the user–that had giant repercussions for search. There have been a few of them where the fix was BIG…where the development team explained how everything depended on X, and changing X would take forever and break everything and the site would be ugly and users would cry…etc. Then we’d have the little conversation that goes “Well, then, how ARE you going get your customers…since you won’t be getting any from Google!”

Without further ado, here are some of the biggies I’ve run into.

Photo courtesy Robert Huffstutter on Flickr.

Staging servers getting indexed

It’s pretty common to have a staging environment where you can put the latest version of your website for testing and review before going “live” with it. And if your team is in multiple locations, then the easy thing to do is just put it out there on the internet…maybe on a subdomain, like staging.mywebsite.com. The problem comes when somehow, somewhere, Google discovers the site (perhaps you sent a link to it using your Gmail account?). And indexes it.

Now what happens when you move the new version of the site to the live site. What does Google see? Clearly you’re a scraper–Google’s seen all that content weeks ago (and still sees it). Your live site looks like just a copy of the staging site, which appears to Google to be the original (saw all the content there first, after all).

But it’s not super obvious what’s happened–because the staging site has virtually no links to it, it doesn’t rank. But your live site, with all its links, is seen by Google as a festering pile of duplicate content.

The solution? Block all user agents in the robots.txt file on the staging server using Disallow: /.

Oh, and when you go to move the pages from the staging server to the live site? You’re not going to want to move that robots.txt file too Think about it….

Ajax and Content

Ajax is awesome for site performance, user experience, etc. But be careful how you use it to populate your page with content. Google will call Javascript functions to render content, sure–but typically only what’s in the onload() function. If your page requires another event in order for the Ajax to be called (like a page scroll down, for example), don’t count on Google executing THAT. Your lovely content-rich, Panda-happiness-making 3000-word 10-big-photo page that loads only 1 photo and the first 3 sentences before scrolling…well, Panda is only going to see that first part.

The Site of Many Flavors

So your site responds to requests for mysite.com as well as www.mysite.com? And, you jumped on the bandwagon and made it work under https when Google announced that https pages would get a ranking boost (yeah, right :-/)? Fabulous. But did you do your redirects? If you DON’T 301 redirect from your non-www to your www version (or the other way around is ok too), and you DON’T 301 redirect requests for https to https, then Google will see 4 complete, separate websites…all with the same content.

Just updating your menus to link to everything with www and https isn’t enough. Google’s still got a memory of those non-www and non-https pages (probably from other sites that linked to you a while ago).

Side note here: when you DO move to https, make sure you create a new project in Google Webmaster Tools for the https site. You’ll find that only some of the stuff will still show up under your old https project there.

Robots.txt blocking style sheets

With the avalanche of hackers out there ripping into WordPress sites, people are doing all sorts of things in a desperate attempt to keep the wolves at bay. And so they block wp-content, wp-includes, and wp-admin in their robots.txt file.

But, first of all, only spiders respect robots.txt…hackers giggle at your lame attempt to block them, and go right on in.

The problem that is caused by blocking these is that you may have style sheets in those folders that are needed to render images, menus, etc. When Google Panda goes to take a peek at your page and see all your lovely content–especially that content above-the-fold–if the stylesheet is blocked by robots.txt, there might be nothing for Panda to see. You can see how Google sees your page by doing a Fetch & Render in Google Webmaster Tools. I’ve had clients whose sites have been totally image-free because of a blocked style sheet; multiple clients have had what should have been a horizontal menu with pulldowns turn into a vertical 3-page list of black menu items on a white background. Oops.

Blocking in robots.txt instead of doing a noindex,follow

There’s really very few reasons to EVER block anything in robots.txt. One good exception is the staging site example from above. But besides that, it’s NOT the best way to shape what Google indexes on your site, and here’s why.

When you block a set of pages in robots.txt, you’re telling Googlebot STAY OUT. The pages won’t be crawled, and the links on them to other pages on your site won’t be counted.

What you ACTUALLY want to do is to set meta robots directives in the pages themselves like this:

This tells Google to go ahead and crawl the page, and count the link juice outbound from that page to other pages, but don’t bother indexing the page.

Let’s say you have a “share this page” link on all pages of your 10,000 page site. And that sharing page of course has really nothing on it, so you don’t want it indexed. But, that sharing page has the main navigation on it, like any other page, with links to your 300 most important pages.

Blocking /share-page.html in robots.txt means all the link juice of the 10,000 different share-page.html pages (because you’re probably passing the page to be shared as a parameter, e.g. share-page.html?page=purple-widgets.html) that WOULD have gone to your 300 most important pages is flushed down the toilet. If, instead, you did a noindex,follow on those pages, then you’d have 10,000 more little bits of link juice flowing to those 300 most important pages.

Joomla’s Big Bad Default Setting

Joomla, by default, disallows the /images/ folder. So Google sees no images on ANY of your pages. Pretty dry, boring site you’ve got there, dude.

.htaccess is NOT your firewall

Back to them evil hackers. Yes, they’re out there, and yes, there’s a TON of them. The SEMpdx blog has probably had a few hundred hacking attempts in just the time I’ve spent writing this blog post.

Did you know you can block IP addresses in .htaccess? And did you know that lists of IP addresses for China, Russia, Nigeria, etc. are out there?

Don’t do it. You’re using a hammer to drive in a screw.

I had a client who had blocked some IP addresses in his .htaccess file. By “some” I mean over 73,000 lines of this in his .htaccess (see below…this is his actual file, and them are real, live line numbers!). Now, the .htaccess file gets read and parsed for every https request made. Which means that on a page with 3 style sheets, 20 images, and 7 Javascript files included, the .htaccess file got read and parsed 30 times. A relatively lightweight site was seeing page load times of over 20 seconds. Ouch.

To hell with URL standards

Don’t stifle my creative side, man. I’ve got a new smooth way to use characters in URLs. Y’all are gonna love it.

I had a client who was using # characters instead of ? and & for parameter separators. They couldn’t figure out why Google only indexed their home page, when they had hundreds of thousands of pages of content.

The # character is supposed to be used to indicate an in-page anchor. Everything AFTER that isn’t technically part of the URL; it’s a location within the page.

A/B Testing Gone Wild

My client was using regular parameters in the URL for A/B testing mods to their home page, e.g.:

https://www.mysite.com/?version=A
https://www.mysite.com/?version=B

That’s not NECESSARILY a bad idea–you can use rel=canonical to point both of those to the base page, https://www.mysite.com/, and you should be OK. But if you neglect to do that, all of a sudden you have 3 different home pages, in Google’s eyes anyway.

Here’s a case where using the # for something other than an inpage anchor wouldn’t have been such a bad idea.

404 to the home page

If you thought an easy way to handle not-found pages (and capture otherwise lost link juice) would be to set up your 404 handler to 301 redirect to the home page, you’d be right. It IS an easy way. But like a lot of easy things, it’ll bite you.

Google wants to see an HTTP 404 error code returned when a non-existent page is fetched. My theory is that it’s because some spammy people at one time figured they could make Google think they had a million-page site by creating links to URLs to a million pages, then fabricate content on-the-fly in their 404 handler by taking the words out of the URLs and injecting them into a template of other words. Then, if that template had a link to somewheres else on it, well then, some little page might be gettin’ a heap of link juice, might’n it.

Doesn’t matter if I’m right about this, or if I did it myself at one point. I mean, if I had THIS FRIEND who did that at one point. What matters is that Google will check your site for this every few weeks. Look in your webserver logs long enough, and you’ll see Googlebot trying to fetch a URL that’s really long, a big jumble of letters and numbers.

Not only does Google check in this fashion, but if Google finds there’s pages on your site that come back nearly empty and SEEM to be page-not-found pages, Google will mark those as “soft 404 errors” in Webmaster Tools. If you want to see exactly what HTTP responses are being returned by your server, I’m a big fan of the httpsfox plugin for Firefox–it will show you not only the final HTTP code, but each hop along the way, if there are multiple redirects.

Conclusion

There’s a million ways to shoot yourself in the foot when it comes to search optimization. With a little luck, some of you have some juicy horror stories to share in the comments!

The post Top 10 Giant Site Audit FAILs appeared first on SEMpdx.