How Are Search Engines Crawling Your Website?

⚠️ Don’t Miss This Important Technical SEO Tip Below ⬇️

The only tool that can give you a real overview of how search engines crawl your site are log files.

Despite this, many people are still obsessed with crawl budget — the number of URLs Googlebot can and wants to crawl during each visit to your website

🎯 The fact is that most sites don’t need to worry that much about crawl budget.

Log file analysis allows you to discover URLs on your site that you had no idea about but that search engines are crawling anyway.

There is huge value in analysing logs produced from those crawls, though. They will show what pages Google is crawling and if anything needs to be fixed.

When you know exactly what your log files are telling you, you’ll gain valuable insights about how Google and other search engines are crawling and viewing your website.

This means you can optimise for this data to increase traffic. And the bigger the site, the greater the impact fixing these issues will have.

🎯 You can find logfiles by accessing your hosting environment or by asking your web hosting provider.

The entries from a logfile often look like this

66.249.65.107 – – [08/Dec/2021:06:19:10 -0400] “GET /about/ HTTP/1.1” 200 13479 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

We can break them down as follows:

66.249.65.107 is the IP address (who)
[08/Dec/2017:04:54:20 -0400] is the Timestamp (when)
GET is the Method
/about/ is the Requested URL (what)
200 is the Status Code (result)
13479 is the Bytes Transferred (size)
“-” is the Referrer URL (source) — it’s empty because this request was made by a crawler
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) is the User Agent (signature) — this is user agent of Googlebot (Desktop)

Now you know what they the information means you can use a logfile analyser such as

botify
oncrawl
sitebulb
semrush
to answer the following questions:

Here are a few samples of questions I use at the start of my analysis:

❓ Which search engines crawl my website?
❓ Which URLs are crawled most often?
❓ Which content types are crawled most often?
❓ Which status codes are returned?

When looking at the data in the log file analyser, grouping it into segments will provide aggregate numbers that give you the bigger picture. This makes it easier to spot trends you might have missed by looking only at individual URLs.

You can then use the server log data with multiple other sources such as Google Analytics data, keyword ranking, sitemaps, crawl data, and start asking and answering questions like:

What pages are not included in the sitemap but get crawled extensively?
❓ What pages are included in the sitemap file but are not crawled?
❓ Are revenue-driving (money pages) crawled often?
❓ Are the majority of crawled pages indexable (status code 200)?

Being able to answer these questions allows you to:
✅ improve poor pages so they get crawled.
✅ hide pages that you don’t want crawled.
✅ remove and redirect pages that add no value.

By taking control of what is crawled regularly you can greatly increase the traffic into the pages that matter and the ROI of your SEO Strategy. 📈😊

Previous Post
5 Content Strategy Considerations When Taking A Client From 10k – 1M Visits
Next Post
Google Algorithm Updates

Other SEO Posts