In the last few days I’ve encountered a surprising number of clients and even SEOs who don’t fully understand XML sitemaps, so I’m here to clear up some things.
Let’s say you half read a blog post somewhere that said “if your site doesn’t have an XML sitemap, your site will never be indexed and you will be poor, miserable and die lonely”. So, you got your developer or SEO to make an XML sitemap for your website, or maybe you did it yourself with a free tool (because you’re cheap). All giddy and excited, you submit your sitemap through Google Webmaster Tools and wait for the magical day for Google to crawl it. Like Xmas morning, you creep down the stairs, log into GWT and start to cry because you see a report that looks like this:
“Only 262 pages indexed!” you scream. ”Why does Googlez hate me? Imma fire my SEO and kick a baby!”
In a fevered response, you (or your SEO) goes line by line through your sitemap.xml file to make sure there are no broken links, or malformed URLs (good for you!), but you can’t find anything. So instead, you resign yourself to being poor, miserable and dying lonely.
Well.. here’s something you may not have considered..
All URLs in a sitemap.xml file must return a 200 OK response
I find myself constantly amused by the number of XML Sitemaps I come across that have URLs that either 404 or redirect with a 301 or 302. What’s even more amusing, is when I find URLs that have been disallowed via robots.txt.
So, to help you all understand why the URLs in your XML sitemap may not be indexing fully, I’ve made some easy-to-follow pictures! Why? Because I know how much you hate reading.
URLs in XML Sitemap returning 404 Not Found responses
URLs in XML Sitemap returning 301 or 302 redirect responses
URLs in XML Sitemap disallowed via Robots.txt
URLs in XML Sitemap returning 200 OK responses
Now, before you start looking… no, this site doesn’t have an XML sitemap file. Why? Because they’re not necessary! An XML sitemap is only a tool to help crawlers discover pages they might not normally find, usually because you have a crappy, unspiderable javascript menu that plays a Megadeth song every time you hover over it with your mouse, because your usability expert told you that was the future of the web.





Good post with good illustration, thank you.
No problem… and as I can tell from your Nigerian IP address (197.255.175.138), XML Sitemap posts with illustrations are clearly the tonic for crippling poverty and governmental corruption! Happy to oblige!
Sorry I removed your link, but you still haven’t responded to my email about requiring your bank account info so I can transfer some funds the government wants to seize from my family dynasty in Winnipeg…
so are you saying that if a page returns a 200 response it will definitely be indexed?
or does the ‘*well, maybe’ imply something else?!
A 200 response is required for Google to index a page, but it doesn’t mean it will definitely be indexed. That’s up to Google to determine if its relevant, useful, blah blah…
That is an awesome post. I understand now what the heck is going on with page indexing. You never told me what I can do to fix it! No problem, At least I know what I should be doing to get it fixed ^_^
What to do to fix it? Its pretty simple.. make sure all the urls listed in your XML sitemap resolve with a 200 response.
Google wants the final URLs, not your stinky redirected ones…
Thanks, Your awesome. Make some more posts. I love reading them
How can you test whether your URLs get the 200 Response? Where do you find what response you get? Thanks for the info!!
Kateryna
Clearly you are no l33t hax0r… however, if you are a web developer who’s actually getting work, you’re already using Firebug. You’ll want to learn how to use the “net” tab which will show you server responses for assets.
If all else fails, use the SEO Book status code checker – http://tools.seobook.com/server-header-checker/
Thanks Keith!
Kateryna
Great blog! My pain: Google Webmaster indicates everyting OK: sitemap, content, etc. Still, for over 2 months only 9 pages are indexed – I have more than 15,000. Even the first level/click urls are not indexed. Could you help? Best, Mark
Ya sitemaps aren’t your issue. You’re hiding text (good ol’ 90s style cloaking… way to keep it old school!).
Visit one of your artists pages and disable external CSS using the “Webdeveloper Toolbar” plugin for Firefox, then smack yourself in the face for using inline CSS to color your text white, but making the background black in your external CSS.
Remember kids.. Google can parse inline CSS, so don’t use it unless you’re really lazy and hate the internet.. then it’s ok.
Fun, Fun & More fun. This post literally pulled me from a depressed state and back to mirth & merriment.
You’re funny. Thanks for this great post!
You’re spammy. Thanks for posting your russian dating website URL despite the fact all these links are nofollowed!
Thanks for the advice! I submitted my sitemaps yesterday evening. Some have been indexed others are still pending. should i be nervous not all of them are indexed yet or does it take a couple of days?
Unless your site is regularly crawled by Google on an hourly basis, you might as well hurry up and wait. It’s like anything with Google.. if your not famous or popular, the crawlers don’t care.
Is your website followed around by gawking tourists or people with cameras?
the site has a search query in a page and it was disallowed for indexing by the robots code. now only 3 pages have been indexed, the main problem is when you search
keywords it show inside pages and not my home page.
Ok. Here’s my advice.. Go buy yourself a bag of chunky chocolate chip cookies and spend the evening googling “link canonical tag”, delete all wildcard disallows in your robots.txt, and then punch yourself in the face for wanting your homepage to show up for every keyword, as opposed to the internal pages that users would actually be searching for.
AHA! Finally I understand what’s going on with my sitemaps. Was driving me insane. Thanks for making it clear.
Great post, I have a pretty basic question so sorry for boring you. with my sitemap it indexed 3/4 of my pages…I can fix that on my own, but under the “total indexed” in the health tab it says 0 indexed….sucks, whats the difference and how can i fix that?
Thanks for your time, jeff
Most questions bore me, particularly stupid ones… however, yours isn’t so stupid. There is a difference between the “indexed” sitemap pages and the “indexed status” under the health tab.
The “indexed status” found under health is pointing out pages that haven’t been excluded from search results due to the meta noindex tag, duplicate content pages, or all those other factors that Google uses to keep pages out of search results. They’re not quite related to the xml sitemap. What I would do in your situation is compare your xml sitemap and/or your page inventory to the “site:” search parameter in Google. Might give you an idea of what isn’t there that should be there.
Now I know why half of my links doesn’t get indexed. I checked “Error’s” page and there was ~40 broken links. I repaired it, w’ll see how google responds to this.
hi, great post about sitemaps and great answers about common issues…
i’m facing a problem with my sitemaps indexing: i saw it was falling down in the last month and found there was problem fetching the page with lynx… this resulted in MANY “not followed” errors…
last week i fixed the problem, lynx can fetch pages again and “not followed” errors are decreasing, but the number of indexed pages is still going down…
any hint?