best of

How Google Crawls Your Website: A Hands-On Tutorial

Article written by

Vismaya

11 min

2026-03-31

Diagram showing Googlebot following links between web pages to discover content

Google crawls over 20 billion web pages every single day. So why can't it find yours?

That's the question most people never think to ask. They write content, hit publish, and assume Google will show up. Sometimes it does. Sometimes it takes weeks. Sometimes it never comes at all.

Crawling is the first thing that has to happen before your page can rank for anything. If Google can't find it, nothing else matters — not your keywords, not your content quality, not your backlinks. It's invisible.

This tutorial teaches you exactly how Google discovers pages, what controls how often it visits your site, what accidentally blocks it, and how to verify everything yourself.

What Is Crawling, Exactly?

Crawling is Google's discovery process. It's how Google finds new pages and checks existing pages for changes.

Google uses an automated program called Googlebot. Googlebot starts from a list of known URLs — pages from previous crawls, URLs submitted through sitemaps, links discovered from other sites. It visits each URL, reads the page content, and follows every link on that page to discover more URLs. Then it visits those URLs, follows their links, and continues.

It's a chain reaction. One page leads to another, which leads to another. Google doesn't manually visit your site because you published something. It finds you by following the web's link structure.

This is the foundational concept: if no other page on the internet links to your page, and you haven't submitted it through Search Console or a sitemap, Google has no path to find it.

Crawl Budget: Why Google Doesn't Crawl Everything

Google has limited resources. It can't crawl every page on the internet every day. So it makes decisions about how much time and how many requests to spend on each site.

This is called crawl budget — the number of pages Google will crawl on your site within a given time period.

Crawl budget depends on two things:

Crawl capacity: How fast can your server handle requests without breaking? If your site is slow or returns errors, Google backs off to avoid crashing it. If your site responds quickly, Google crawls more aggressively.

Crawl demand: How much does Google want to crawl your site? New content, frequently updated pages, and pages with lots of external links get crawled more often. Stale content that hasn't changed in months gets crawled less.

For most small sites (under 1,000 pages), crawl budget isn't a concern. Google can handle it. But for larger sites — e-commerce stores with thousands of product pages, media sites with massive archives — crawl budget matters. If Google spends its budget crawling 500 pagination pages and URL parameters, it might never get to your 50 new blog posts.

Sites with stronger authority and trust signals get more generous crawl budgets. This is why established sites can publish 10 articles in a day and see them all indexed within hours, while newer sites struggle to get a single page discovered in a week.

Try this right now (if you have Search Console access):

Open Google Search Console
Go to Settings → Crawl Stats
Look at "Total crawl requests" over the last 90 days
Check "Average response time" — if it's over 500ms, your server might be slowing down Google

Robots.txt: The File That Controls Googlebot

Every website can have a file called robots.txt that tells crawlers which parts of the site they're allowed (or not allowed) to visit. It lives at yourdomain.com/robots.txt.

Here's what a basic robots.txt looks like:

User-agent: *

Allow: /

Disallow: /admin/

Disallow: /cart/

Sitemap: https://yourdomain.com/sitemap.xml

This tells all crawlers: you can access the whole site, except the admin area and the cart pages. And here's where the sitemap is.

Where things go wrong:

A single misplaced line in robots.txt can block your entire site from being crawled. During development, developers often add Disallow: / (which blocks everything) and forget to remove it when the site goes live. Your site looks normal to visitors, but Google can't access any of it.

Another common mistake: blocking important resources like CSS and JavaScript files. Google needs to render your pages to understand them. If robots.txt blocks your CSS, Google sees a broken page and may not index it properly.

Try this right now:

Open your browser
Go to yourdomain.com/robots.txt
Read what's there
Check: Is anything important being blocked? Is there a Disallow: / that shouldn't be there? Is your sitemap URL listed?

If you see Disallow: / and your site isn't intentionally private — that's the problem. Fix it immediately.

Sitemaps: Giving Google a Map of Your Site

A sitemap is an XML file that lists all the important URLs on your site. It's not required — Google can find pages through links alone. But a sitemap speeds up discovery significantly, especially for new sites or large sites with complex structures.

Your sitemap lives at a URL like yourdomain.com/sitemap.xml and looks something like this:

<?xml version="1.0" encoding="UTF-8"?>

<url>

<loc>https://yourdomain.com/blog/how-search-engines-think</loc>

</url>

</urlset>

Each <url> entry tells Google: this page exists, here's when it was last updated, and here's how important it is relative to other pages.

Common sitemap mistakes:

Including pages that return 404 errors (tells Google your sitemap is unreliable)
Including pages with noindex tags (contradicts itself — sitemap says "index this" while the page says "don't")
Not updating <lastmod> when content changes (Google uses this to prioritise crawling)
Not submitting the sitemap in Search Console (Google might find it through robots.txt, but submitting it directly is faster)

Try this right now:

Go to yourdomain.com/sitemap.xml
Does it exist? If not, your CMS (WordPress, Payload, etc.) probably generates one — check your CMS settings
If it exists, open it. How many URLs are listed? Do they match the pages you want Google to find?
Open Google Search Console → Sitemaps → submit your sitemap URL if it isn't already there

Internal Linking: The Crawl Paths You Control

External links from other sites help Google discover your pages. But internal links — links between pages on your own site — are the crawl paths you control completely.

Every page on your site should be reachable within 3-4 clicks from the homepage. If a page requires 7 clicks to reach, or isn't linked from any other page at all, Google treats it as low-priority. It might get crawled eventually, but it won't be prioritised.

Orphan pages — pages with zero internal links pointing to them — are one of the most common crawling failures we see in site audits. The page exists. The content is good. But Google rarely visits it because there's no path leading to it.

Think of internal linking as road infrastructure. Your homepage is the highway. Category pages are main roads. Blog posts and product pages are streets. If a street isn't connected to any road, nobody drives down it — not users, not Googlebot.

Try this right now:

Pick a page on your site that you know isn't getting much traffic
Search site:yourdomain.com "page title or URL" — does it appear? If not, Google might not be crawling it
Check: how many internal links point to this page? You can check in Search Console → Links → Internal links
If the answer is zero or one, add internal links from 3-4 related pages

JavaScript and Crawling: The Hidden Trap

Modern websites built with JavaScript frameworks (React, Next.js, Angular) present a specific challenge for crawling.

Googlebot has two stages of crawling: the initial HTML fetch and the rendering phase. In the first stage, it downloads your page's raw HTML. If your content is generated by JavaScript (client-side rendering), the HTML is essentially empty — just a bunch of <script> tags. The actual content only appears after JavaScript runs.

Google can render JavaScript, but it's expensive and slow. It queues JavaScript-heavy pages for a second crawl pass, which can add days or weeks to the discovery timeline. During that wait, your content is invisible.

For Payload CMS users (like WizGrowth): Payload supports server-side rendering (SSR) through its Next.js integration. Make sure your pages render content on the server, so Googlebot gets the full HTML on the first pass. If your setup relies on client-side rendering, your crawling will be delayed.

Try this right now:

Open your website in Chrome
Right-click → View Page Source (not Inspect Element — View Source shows the raw HTML Google sees first)
Can you see your article text in the source code? Or do you only see <script> tags?
If you can't see your content in the source, Google's first crawl pass sees the same empty page

How to Check If Google Has Crawled a Specific Page

Google Search Console gives you a direct way to check any URL:

Open Search Console
Enter the full URL in the inspection bar at the top
Click Enter

Google will tell you:

"URL is on Google" — crawled and indexed. Working.
"URL is not on Google" — not indexed. Click to see why.
"Page fetch: Successful" — Google could access the page
"Page fetch: Failed" — something blocked access

If the page was crawled but not indexed, you'll see a reason: "Crawled — currently not indexed" means Google found it but decided it wasn't worth keeping. "Discovered — currently not indexed" means Google knows about it but hasn't crawled it yet — a crawl budget or priority issue.

The Complete Crawl Audit: A Step-by-Step Exercise

Run this on your own site right now. It takes about 30 minutes and gives you a clear picture of your site's crawl health.

Step 1: Check robots.txt Go to yourdomain.com/robots.txt. Verify nothing important is blocked. Confirm sitemap URL is listed.

Step 2: Check sitemap Go to your sitemap URL. Verify it exists, contains your key pages, and doesn't include dead links or noindexed pages.

Step 3: Check crawl stats In Search Console → Settings → Crawl Stats. Check crawl requests per day and average response time. If response time is high, your server is slowing Google down.

Step 4: Check page indexing In Search Console → Pages. Look at "Not indexed" reasons. Focus on "Crawled — currently not indexed" (quality issue) and "Discovered — currently not indexed" (priority issue).

Step 5: Check internal links In Search Console → Links → Internal links. Sort by "Target page." Any important page with fewer than 3 internal links is at risk of being under-crawled.

Step 6: Check a specific page Pick your most important page. Run it through URL Inspection. Confirm it's indexed, fetch was successful, and the last crawl date is recent.

Document everything you find. This is your first real technical audit. The pages that aren't being crawled? Those are the pages you fix first.

Key Takeaways

Crawling is how Google discovers your pages. Without crawling, nothing else in SEO matters — your page simply doesn't exist in Google's world.

Google crawls billions of pages daily but uses crawl budget to prioritise. Your site's speed, authority, internal linking, and sitemap quality all affect how much attention Google gives you.

Robots.txt controls what Google can access. A single wrong line can block your entire site. Always check it. Sitemaps speed up discovery — submit yours through Search Console.

Internal links are the crawl paths you control. Orphan pages without internal links rarely get crawled. Every important page should be reachable within 3-4 clicks from the homepage.

JavaScript-heavy sites need server-side rendering to ensure Google sees content on the first crawl pass, not just empty script tags.

What's Next

This is Part 2 of WizGrowth Academy's Search Fundamentals series. Next: Indexing — What Gets Stored and Why — where we cover canonical tags, noindex, duplicate content, and how to diagnose why Google found your page but decided not to keep it.

Frequently Asked Questions

How often does Google crawl my website? It depends on your site's authority and update frequency. High-authority sites get crawled multiple times per day. Smaller or newer sites might be crawled weekly or less. Publishing new content regularly and having strong internal linking increases crawl frequency.

What is crawl budget and should I worry about it? Crawl budget is the number of pages Google will crawl on your site in a given time. For small sites (under 1,000 pages), it's rarely an issue. For large sites with thousands of pages, it matters — you need to ensure Google spends its budget on your important pages, not on URL parameters or duplicate content.

Can I force Google to crawl my page faster? You can request indexing through Google Search Console's URL Inspection tool. This puts your page in a priority queue but doesn't guarantee immediate crawling. Strong internal links, an updated sitemap, and external links from other sites also speed up discovery.

My page was crawled but not indexed. Why? Google found your page but decided it wasn't worth adding to its index. Common reasons: the content is too thin, it's too similar to another indexed page, or it doesn't provide unique value. Check Search Console for the specific reason and improve the content accordingly.

Does robots.txt prevent indexing? Not reliably. Robots.txt prevents crawling, which usually prevents indexing. But if other pages link to a blocked URL, Google can still index it based on the link context alone — just without seeing the actual content. For pages that absolutely shouldn't appear in search, use a noindex meta tag instead.

Opening Hours

Mon - Fri 9AM - 8PM

Sat - Sun 10AM - 5PM

“Beware of little expenses, a small leak will sink a great ship”

— Benjamin Franklin

Contact Info

7907551261

marketing@wizgrowth.com

Kerala, India

“Beware of little expenses, a small leak will sink a great ship”

— Benjamin Franklin