Skip to main content

Web page crawl analyzer

Full Googlebot crawl diagnostic in seconds

Is your page being crawled correctly by Google? Measure HTML weight, check robots.txt, analyze sub-resources, and estimate your crawl budget score. Detect meta refresh redirects, compare mobile vs desktop, and generate a SHA-256 content fingerprint. Free diagnostic with results in seconds.

Send custom headers with the crawl request. User-Agent, Host, and transfer headers are not allowed (max 10 headers, 1 KB per value).

Size analysis and truncation

Measure the exact weight of decompressed HTML. Visualize the ratio against the 2 MB (HTML) or 64 MB (PDF) limit with a progress bar.

Crawl budget score

Get a score out of 100 evaluating how efficiently your page uses crawl resources. Identify factors that waste your budget unnecessarily.

Sub-resource inventory

List all scripts, CSS, images, fonts, and iframes loaded by the page. Spot third-party resources and loading errors.

Client-side redirect detection

Identify meta refresh and JavaScript redirects invisible to Googlebot. These client-side redirects can block indexing.

Mobile vs desktop comparison

Compare the smartphone and desktop versions of your page. Detect differences in size, content, and headers between the two versions.

Why analyze the crawl behavior of your pages?

If your HTML exceeds 2 MB, Googlebot silently truncates it. No error in Search Console, no warning: the content at the bottom of the page disappears from Google's index (source: Google documentation). And that's only part of the problem: a misconfigured robots.txt, excessive sub-resources, invisible JavaScript redirects, and poor compression all consume your crawl budget without you knowing.

Five reasons to analyze the crawl behavior of your pages:

  • Avoid truncation - Pages with heavy inline HTML (SVG, CSS, bulky JSON-LD) often exceed the limit without you knowing
  • Verify Googlebot access - A misconfigured robots.txt can block crawling of important pages
  • Optimize crawl budget - Lighter pages with fewer sub-resources = more pages crawled by Google in its allotted time
  • Detect invisible redirects - Meta refresh and JavaScript redirects are not always followed by Googlebot
  • Compare mobile vs desktop - Mobile-first indexing means the smartphone version is the one that counts for indexing

How to use the page crawl checker in 3 steps

Step 1: Enter the page URL

Enter the full URL of the page to analyze in the field above. The tool accepts any publicly accessible URL, including PDF files:

https://www.captaindns.com/en

Test your longest pages first: category pages, product pages with many variants, blog posts with numerous inline images.

Step 2: Choose the User-Agent and options

Select the User-Agent to simulate the crawl:

  • Googlebot smartphone (recommended): simulates mobile-first crawling, which Google uses for primary indexing
  • Googlebot desktop: useful for comparing the desktop version if your site serves different HTML
  • Comparison mode: test both User-Agents simultaneously to detect differences in content, size, and headers

Under advanced options, you can add custom HTTP headers. Useful for testing a site behind a CDN, a reverse proxy, or for sending a specific authentication cookie.

Step 3: Review the full report

The report displays:

  • KPIs at the top: size, crawl budget score, number of sub-resources, response time, client-side redirects
  • Progress bar: visual ratio against the 2 MB limit (or 64 MB for PDFs)
  • robots.txt: verification that Googlebot is allowed to crawl the URL, crawl-delay and detected sitemaps
  • HTTP headers: Content-Type, Content-Encoding, Cache-Control, X-Robots-Tag, and your custom headers sent
  • HTML analysis: meta tags, headings, links, structured data, inline resources
  • Sub-resources: full inventory of scripts, CSS, images, fonts, iframes with size and status
  • Crawl budget: score out of 100 with factor breakdown and individual impact
  • Client-side redirects: meta refresh and JavaScript detected in the HTML
  • Content fingerprint: SHA-256 hash to detect changes between analyses
  • Truncation simulation: if applicable, see exactly where Googlebot would cut off
  • Recommendations: concrete actions prioritized by impact

What is Googlebot's 2 MB limit?

Google documents a size limit for crawling: Googlebot can download and index the first 2,097,152 bytes (2 MB) of a page's HTML source code. Beyond that, content is truncated. For PDF files, the limit is 64 MB.

What this means in practice:

Content typeLimitConsequence if exceeded
HTML2 MB (2,097,152 bytes)Truncation: content at the end of the page is ignored
PDF64 MBTruncation of extracted text content

Important: the HTML limit applies to decompressed content. Gzip/brotli compression changes nothing: a 3 MB HTML file compressed in transit will still be truncated at 2 MB after decompression.

Pages at risk:

  • E-commerce pages listing hundreds of products in HTML
  • Landing pages with inline SVG or bulky embedded CSS
  • Pages with highly detailed structured JSON-LD (e.g., FAQ with 50+ questions)
  • Server-rendered pages with abundant inline JavaScript

What exactly does the tool analyze?

Size analysis

ElementDescription
Raw sizeExact weight of the HTML returned by the server, in bytes
Decompressed sizeSize after gzip/brotli decoding (the one that matters for Googlebot)
Limit ratioPercentage of the limit consumed (2 MB for HTML, 64 MB for PDF)
Content typeAutomatic detection of HTML, PDF, or other with visual badge

robots.txt verification

ElementWhat the tool checks
Googlebot accessIs the tested URL allowed or blocked by robots.txt?
Matched agentWhich rule applies (Googlebot, *, etc.)
Crawl-delayDelay imposed between crawl requests
SitemapsSitemap files declared in robots.txt

HTTP headers

HeaderWhy it matters
Content-TypeConfirms the server returns HTML (or a PDF)
Content-EncodingIndicates whether compression is active (gzip, br)
X-Robots-TagDetects a potential noindex/nofollow at the HTTP level
Cache-ControlCache configuration that impacts crawl frequency
Custom headersYour sent headers are displayed for confirmation

HTML analysis

ElementWhat the tool checks
Meta tagsPresence and content of title, description, robots, canonical
StructureHeading hierarchy (H1-H6) with byte position
LinksNumber of internal, external, and nofollow links detected
Structured dataJSON-LD detected with size and identified types
Inline resourcesScripts, styles, SVG, and data URIs embedded in the HTML

Sub-resources

ElementWhat the tool checks
ScriptsExternal JavaScript files loaded by the page
CSSExternal stylesheets
ImagesImages referenced in the HTML
FontsWeb fonts loaded
IframesThird-party embedded content
Third-party resourcesSub-resources loaded from other domains
Loading errorsResources returning an HTTP error (404, 500, etc.)

Crawl budget score

ElementWhat the tool evaluates
Overall scoreRating out of 100, weighted by the importance of each factor
Page sizeImpact of HTML weight on crawl budget
Number of sub-resourcesEach request consumes budget
Third-party resourcesExternal domains add latency
Response timeA slow response reduces the number of pages crawled
CompressionMissing compression wastes bandwidth

Client-side redirects

ElementWhat the tool detects
Meta refresh<meta http-equiv="refresh"> tags with URL and delay
JavaScriptPatterns: window.location, document.location, location.href
Position in HTMLByte location of the detected redirect

Content fingerprint

ElementDescription
SHA-256 hashUnique fingerprint of the page content
Change detectionCompare the hash between two analyses to check if content has changed
Mobile/desktop comparisonIf both versions have the same hash, the content is identical

Mobile vs desktop comparison

ElementWhat the tool compares
SizeHTML weight difference between the two versions
HeadersDifferences in Content-Type, compression, cache, X-Robots-Tag
Meta tagsDifferent title, description, canonical, robots?
StructureNumber of headings, links, structured data
FingerprintSame hash = identical content, different hash = distinct content
VerdictSummary: identical, minor differences, or critical differences

Enter your URL above to get the full analysis of your page.


Real-world use cases

Case 1: E-commerce page with thousands of products

Symptom: Your category page lists 500 products in HTML. The bottom of the page (pagination, FAQ, links to subcategories) doesn't appear in Google results.

Diagnosis with the tool: The page is 3.2 MB of HTML. Googlebot truncates at 2 MB, losing the last 200 products, the FAQ, and all footer navigation links.

Action: Switch to pagination with dynamic loading (lazy load), limit the initial listing to 50 products, move the FAQ higher on the page.


Case 2: Low crawl budget score due to sub-resources

Symptom: Google crawls few pages on your site despite regularly updated content. Your new pages take weeks to appear in the index.

Diagnosis with the tool: Each page loads 85 sub-resources including 40 third-party scripts (analytics, widgets, A/B testing). The crawl budget score is 35/100. Third-party resources account for 60% of requests.

Action: Load third-party scripts with defer/async, remove unused scripts, bundle CSS and JS files, use lazy loading for images below the fold.


Case 3: JavaScript redirect invisible to Googlebot

Symptom: Your page correctly redirects users to the new URL, but the old page remains indexed in Google and the new one doesn't appear.

Diagnosis with the tool: The tool detects a window.location.href in the HTML. This is a JavaScript redirect that Googlebot does not consistently follow. No HTTP redirect (301/302) is configured.

Action: Replace the JavaScript redirect with a server-side HTTP 301 redirect. If a transition period is needed, add a <link rel="canonical"> tag pointing to the new URL.


Case 4: robots.txt blocks an important section

Symptom: Your /en/blog/ pages are no longer indexed since you updated your robots.txt. No visible error in Search Console.

Diagnosis with the tool: The robots.txt analysis shows "URL blocked" with the rule Disallow: /en/ blocking all English-language content. The robots.txt was meant to block /en/admin/ but the rule is too broad.

Action: Fix robots.txt by replacing Disallow: /en/ with Disallow: /en/admin/. Verify with the tool that important pages are allowed.


Case 5: Different content between mobile and desktop

Symptom: Your Google rankings are dropping even though your desktop content is complete and well optimized.

Diagnosis with the tool: Comparison mode reveals that the smartphone version serves stripped-down HTML: the FAQ, customer reviews, and 3 content sections are missing. The SHA-256 fingerprints are different. Google indexes the mobile version, which is incomplete.

Action: Ensure the mobile version contains the same SEO content as the desktop version. Use responsive design rather than server-side conditional content.


Case 6: Migration with lost compression

Symptom: After a server migration, your pages load more slowly and Google crawls fewer pages.

Diagnosis with the tool: The Content-Encoding header is missing. The server is no longer compressing HTML. The crawl budget score dropped from 78/100 to 52/100.

Action: Re-enable gzip/brotli compression on the new server. Check your nginx/Apache configuration.

Test your pages with the tool above to identify issues specific to your site.


❓ FAQ - Frequently asked questions

Q: What is the average web page size?

A: In 2025, the median web page weighs about 2.5 MB (all resource types combined). But the HTML alone is typically between 50 KB and 500 KB. It's the HTML size that matters for Googlebot's crawl limit, not the total weight including images, CSS, and JavaScript.


Q: What happens when a page exceeds 2 MB?

A: Googlebot truncates HTML beyond 2,097,152 bytes. All content past that point is ignored for indexing. In practice: internal links, structured FAQ, SEO text at the bottom of the page are no longer considered for ranking in search results.


Q: What is crawl budget?

A: Crawl budget is the number of pages Googlebot can crawl on your site within a given time. Heavy pages with many sub-resources consume more server and network resources, reducing the total number of pages crawled. Our tool calculates a score out of 100 to evaluate the efficiency of each page.


Q: Why do sub-resources impact crawling?

A: Each sub-resource (script, CSS, image, font) requires an additional HTTP request. Googlebot has a limited crawl capacity per domain. A page loading 80+ sub-resources consumes far more budget than one loading 20. Third-party resources add latency and external dependencies.


Q: What is a client-side redirect?

A: It's a redirect performed by the browser via a meta refresh tag or JavaScript (window.location). Unlike HTTP redirects (301, 302), Googlebot does not always follow them. If your only redirect is client-side, the destination page may never be indexed.


Q: Does the tool check the robots.txt file?

A: Yes. The tool automatically fetches the domain's robots.txt and checks whether Googlebot is allowed to crawl the tested URL. It also detects crawl-delay and declared sitemaps. If robots.txt blocks the URL, a warning is displayed, but the page analysis continues so you can still see the content.


Q: Does the tool work with PDF files?

A: Yes. The tool automatically detects PDF files and adjusts the size limit: 64 MB instead of 2 MB for HTML. A PDF badge appears in the report and the HTML analysis is disabled (not applicable to PDFs).


Q: What is the content fingerprint (hash) for?

A: The tool generates a SHA-256 hash of the page content. This fingerprint lets you detect whether content has changed between two analyses, or whether the mobile and desktop versions serve identical content. Useful for monitoring unintended changes after a deployment.


Q: Why choose Googlebot smartphone over desktop?

A: Google has used mobile-first indexing since 2019: the smartphone version of your page is indexed first. Test with the smartphone User-Agent to see exactly what Google indexes. Comparison mode lets you verify that both versions are consistent.


Q: Does gzip/brotli compression count toward the 2 MB limit?

A: No. The 2 MB limit applies to decompressed HTML. A 3 MB HTML file compressed to 500 KB during network transfer will still be truncated at 2 MB once decompressed by Googlebot. Compression improves transfer speed but does not bypass the size limit.


Q: Why compare the mobile and desktop versions?

A: Google has used mobile-first indexing since 2019: the smartphone version is indexed first. If your mobile version serves different content (less text, missing FAQ, missing links), your rankings suffer. Comparison mode detects these differences and classifies them by severity.


Q: How do I reduce my web page size?

A: The most effective actions:

  • Remove unnecessary inline CSS/JS - Move them to external files
  • Enable compression - gzip or brotli at the server level
  • Minify HTML - Remove whitespace and comments
  • Externalize SVGs - Replace inline SVGs with img tags
  • Lazy loading - Load heavy content on demand

Q: What are custom HTTP headers for?

A: Custom headers let you test specific configurations: send a cookie to access a protected site, simulate a particular Accept-Language header, or reproduce CDN conditions. The tool displays the headers sent in the report for confirmation.


Complementary tools

ToolPurpose
DNS LookupCheck your domain's DNS records
DNS Propagation CheckerConfirm your DNS changes have propagated globally
Email Deliverability AuditAnalyze MX, SPF, DKIM, and DMARC for your domain
SPF Record CheckerAnalyze and validate your SPF record
Hash GeneratorCompute SHA-256 fingerprints to compare page content
Domain RedirectReplace JavaScript redirects with proper 301/302 HTTPS redirects

Useful resources