Why analyze the crawl behavior of your pages?
If your HTML exceeds 2 MB, Googlebot silently truncates it. No error in Search Console, no warning: the content at the bottom of the page disappears from Google's index (source: Google documentation). And that's only part of the problem: a misconfigured robots.txt, excessive sub-resources, invisible JavaScript redirects, and poor compression all consume your crawl budget without you knowing.
Five reasons to analyze the crawl behavior of your pages:
- Avoid truncation - Pages with heavy inline HTML (SVG, CSS, bulky JSON-LD) often exceed the limit without you knowing
- Verify Googlebot access - A misconfigured robots.txt can block crawling of important pages
- Optimize crawl budget - Lighter pages with fewer sub-resources = more pages crawled by Google in its allotted time
- Detect invisible redirects - Meta refresh and JavaScript redirects are not always followed by Googlebot
- Compare mobile vs desktop - Mobile-first indexing means the smartphone version is the one that counts for indexing
How to use the page crawl checker in 3 steps
Step 1: Enter the page URL
Enter the full URL of the page to analyze in the field above. The tool accepts any publicly accessible URL, including PDF files:
https://www.captaindns.com/en
Test your longest pages first: category pages, product pages with many variants, blog posts with numerous inline images.
Step 2: Choose the User-Agent and options
Select the User-Agent to simulate the crawl:
- Googlebot smartphone (recommended): simulates mobile-first crawling, which Google uses for primary indexing
- Googlebot desktop: useful for comparing the desktop version if your site serves different HTML
- Comparison mode: test both User-Agents simultaneously to detect differences in content, size, and headers
Under advanced options, you can add custom HTTP headers. Useful for testing a site behind a CDN, a reverse proxy, or for sending a specific authentication cookie.
Step 3: Review the full report
The report displays:
- KPIs at the top: size, crawl budget score, number of sub-resources, response time, client-side redirects
- Progress bar: visual ratio against the 2 MB limit (or 64 MB for PDFs)
- robots.txt: verification that Googlebot is allowed to crawl the URL, crawl-delay and detected sitemaps
- HTTP headers: Content-Type, Content-Encoding, Cache-Control, X-Robots-Tag, and your custom headers sent
- HTML analysis: meta tags, headings, links, structured data, inline resources
- Sub-resources: full inventory of scripts, CSS, images, fonts, iframes with size and status
- Crawl budget: score out of 100 with factor breakdown and individual impact
- Client-side redirects: meta refresh and JavaScript detected in the HTML
- Content fingerprint: SHA-256 hash to detect changes between analyses
- Truncation simulation: if applicable, see exactly where Googlebot would cut off
- Recommendations: concrete actions prioritized by impact
What is Googlebot's 2 MB limit?
Google documents a size limit for crawling: Googlebot can download and index the first 2,097,152 bytes (2 MB) of a page's HTML source code. Beyond that, content is truncated. For PDF files, the limit is 64 MB.
What this means in practice:
| Content type | Limit | Consequence if exceeded |
|---|---|---|
| HTML | 2 MB (2,097,152 bytes) | Truncation: content at the end of the page is ignored |
| 64 MB | Truncation of extracted text content |
Important: the HTML limit applies to decompressed content. Gzip/brotli compression changes nothing: a 3 MB HTML file compressed in transit will still be truncated at 2 MB after decompression.
Pages at risk:
- E-commerce pages listing hundreds of products in HTML
- Landing pages with inline SVG or bulky embedded CSS
- Pages with highly detailed structured JSON-LD (e.g., FAQ with 50+ questions)
- Server-rendered pages with abundant inline JavaScript
What exactly does the tool analyze?
Size analysis
| Element | Description |
|---|---|
| Raw size | Exact weight of the HTML returned by the server, in bytes |
| Decompressed size | Size after gzip/brotli decoding (the one that matters for Googlebot) |
| Limit ratio | Percentage of the limit consumed (2 MB for HTML, 64 MB for PDF) |
| Content type | Automatic detection of HTML, PDF, or other with visual badge |
robots.txt verification
| Element | What the tool checks |
|---|---|
| Googlebot access | Is the tested URL allowed or blocked by robots.txt? |
| Matched agent | Which rule applies (Googlebot, *, etc.) |
| Crawl-delay | Delay imposed between crawl requests |
| Sitemaps | Sitemap files declared in robots.txt |
HTTP headers
| Header | Why it matters |
|---|---|
| Content-Type | Confirms the server returns HTML (or a PDF) |
| Content-Encoding | Indicates whether compression is active (gzip, br) |
| X-Robots-Tag | Detects a potential noindex/nofollow at the HTTP level |
| Cache-Control | Cache configuration that impacts crawl frequency |
| Custom headers | Your sent headers are displayed for confirmation |
HTML analysis
| Element | What the tool checks |
|---|---|
| Meta tags | Presence and content of title, description, robots, canonical |
| Structure | Heading hierarchy (H1-H6) with byte position |
| Links | Number of internal, external, and nofollow links detected |
| Structured data | JSON-LD detected with size and identified types |
| Inline resources | Scripts, styles, SVG, and data URIs embedded in the HTML |
Sub-resources
| Element | What the tool checks |
|---|---|
| Scripts | External JavaScript files loaded by the page |
| CSS | External stylesheets |
| Images | Images referenced in the HTML |
| Fonts | Web fonts loaded |
| Iframes | Third-party embedded content |
| Third-party resources | Sub-resources loaded from other domains |
| Loading errors | Resources returning an HTTP error (404, 500, etc.) |
Crawl budget score
| Element | What the tool evaluates |
|---|---|
| Overall score | Rating out of 100, weighted by the importance of each factor |
| Page size | Impact of HTML weight on crawl budget |
| Number of sub-resources | Each request consumes budget |
| Third-party resources | External domains add latency |
| Response time | A slow response reduces the number of pages crawled |
| Compression | Missing compression wastes bandwidth |
Client-side redirects
| Element | What the tool detects |
|---|---|
| Meta refresh | <meta http-equiv="refresh"> tags with URL and delay |
| JavaScript | Patterns: window.location, document.location, location.href |
| Position in HTML | Byte location of the detected redirect |
Content fingerprint
| Element | Description |
|---|---|
| SHA-256 hash | Unique fingerprint of the page content |
| Change detection | Compare the hash between two analyses to check if content has changed |
| Mobile/desktop comparison | If both versions have the same hash, the content is identical |
Mobile vs desktop comparison
| Element | What the tool compares |
|---|---|
| Size | HTML weight difference between the two versions |
| Headers | Differences in Content-Type, compression, cache, X-Robots-Tag |
| Meta tags | Different title, description, canonical, robots? |
| Structure | Number of headings, links, structured data |
| Fingerprint | Same hash = identical content, different hash = distinct content |
| Verdict | Summary: identical, minor differences, or critical differences |
Enter your URL above to get the full analysis of your page.
Real-world use cases
Case 1: E-commerce page with thousands of products
Symptom: Your category page lists 500 products in HTML. The bottom of the page (pagination, FAQ, links to subcategories) doesn't appear in Google results.
Diagnosis with the tool: The page is 3.2 MB of HTML. Googlebot truncates at 2 MB, losing the last 200 products, the FAQ, and all footer navigation links.
Action: Switch to pagination with dynamic loading (lazy load), limit the initial listing to 50 products, move the FAQ higher on the page.
Case 2: Low crawl budget score due to sub-resources
Symptom: Google crawls few pages on your site despite regularly updated content. Your new pages take weeks to appear in the index.
Diagnosis with the tool: Each page loads 85 sub-resources including 40 third-party scripts (analytics, widgets, A/B testing). The crawl budget score is 35/100. Third-party resources account for 60% of requests.
Action: Load third-party scripts with defer/async, remove unused scripts, bundle CSS and JS files, use lazy loading for images below the fold.
Case 3: JavaScript redirect invisible to Googlebot
Symptom: Your page correctly redirects users to the new URL, but the old page remains indexed in Google and the new one doesn't appear.
Diagnosis with the tool: The tool detects a window.location.href in the HTML. This is a JavaScript redirect that Googlebot does not consistently follow. No HTTP redirect (301/302) is configured.
Action: Replace the JavaScript redirect with a server-side HTTP 301 redirect. If a transition period is needed, add a <link rel="canonical"> tag pointing to the new URL.
Case 4: robots.txt blocks an important section
Symptom: Your /en/blog/ pages are no longer indexed since you updated your robots.txt. No visible error in Search Console.
Diagnosis with the tool: The robots.txt analysis shows "URL blocked" with the rule Disallow: /en/ blocking all English-language content. The robots.txt was meant to block /en/admin/ but the rule is too broad.
Action: Fix robots.txt by replacing Disallow: /en/ with Disallow: /en/admin/. Verify with the tool that important pages are allowed.
Case 5: Different content between mobile and desktop
Symptom: Your Google rankings are dropping even though your desktop content is complete and well optimized.
Diagnosis with the tool: Comparison mode reveals that the smartphone version serves stripped-down HTML: the FAQ, customer reviews, and 3 content sections are missing. The SHA-256 fingerprints are different. Google indexes the mobile version, which is incomplete.
Action: Ensure the mobile version contains the same SEO content as the desktop version. Use responsive design rather than server-side conditional content.
Case 6: Migration with lost compression
Symptom: After a server migration, your pages load more slowly and Google crawls fewer pages.
Diagnosis with the tool: The Content-Encoding header is missing. The server is no longer compressing HTML. The crawl budget score dropped from 78/100 to 52/100.
Action: Re-enable gzip/brotli compression on the new server. Check your nginx/Apache configuration.
Test your pages with the tool above to identify issues specific to your site.
❓ FAQ - Frequently asked questions
Q: What is the average web page size?
A: In 2025, the median web page weighs about 2.5 MB (all resource types combined). But the HTML alone is typically between 50 KB and 500 KB. It's the HTML size that matters for Googlebot's crawl limit, not the total weight including images, CSS, and JavaScript.
Q: What happens when a page exceeds 2 MB?
A: Googlebot truncates HTML beyond 2,097,152 bytes. All content past that point is ignored for indexing. In practice: internal links, structured FAQ, SEO text at the bottom of the page are no longer considered for ranking in search results.
Q: What is crawl budget?
A: Crawl budget is the number of pages Googlebot can crawl on your site within a given time. Heavy pages with many sub-resources consume more server and network resources, reducing the total number of pages crawled. Our tool calculates a score out of 100 to evaluate the efficiency of each page.
Q: Why do sub-resources impact crawling?
A: Each sub-resource (script, CSS, image, font) requires an additional HTTP request. Googlebot has a limited crawl capacity per domain. A page loading 80+ sub-resources consumes far more budget than one loading 20. Third-party resources add latency and external dependencies.
Q: What is a client-side redirect?
A: It's a redirect performed by the browser via a meta refresh tag or JavaScript (window.location). Unlike HTTP redirects (301, 302), Googlebot does not always follow them. If your only redirect is client-side, the destination page may never be indexed.
Q: Does the tool check the robots.txt file?
A: Yes. The tool automatically fetches the domain's robots.txt and checks whether Googlebot is allowed to crawl the tested URL. It also detects crawl-delay and declared sitemaps. If robots.txt blocks the URL, a warning is displayed, but the page analysis continues so you can still see the content.
Q: Does the tool work with PDF files?
A: Yes. The tool automatically detects PDF files and adjusts the size limit: 64 MB instead of 2 MB for HTML. A PDF badge appears in the report and the HTML analysis is disabled (not applicable to PDFs).
Q: What is the content fingerprint (hash) for?
A: The tool generates a SHA-256 hash of the page content. This fingerprint lets you detect whether content has changed between two analyses, or whether the mobile and desktop versions serve identical content. Useful for monitoring unintended changes after a deployment.
Q: Why choose Googlebot smartphone over desktop?
A: Google has used mobile-first indexing since 2019: the smartphone version of your page is indexed first. Test with the smartphone User-Agent to see exactly what Google indexes. Comparison mode lets you verify that both versions are consistent.
Q: Does gzip/brotli compression count toward the 2 MB limit?
A: No. The 2 MB limit applies to decompressed HTML. A 3 MB HTML file compressed to 500 KB during network transfer will still be truncated at 2 MB once decompressed by Googlebot. Compression improves transfer speed but does not bypass the size limit.
Q: Why compare the mobile and desktop versions?
A: Google has used mobile-first indexing since 2019: the smartphone version is indexed first. If your mobile version serves different content (less text, missing FAQ, missing links), your rankings suffer. Comparison mode detects these differences and classifies them by severity.
Q: How do I reduce my web page size?
A: The most effective actions:
- Remove unnecessary inline CSS/JS - Move them to external files
- Enable compression - gzip or brotli at the server level
- Minify HTML - Remove whitespace and comments
- Externalize SVGs - Replace inline SVGs with
imgtags - Lazy loading - Load heavy content on demand
Q: What are custom HTTP headers for?
A: Custom headers let you test specific configurations: send a cookie to access a protected site, simulate a particular Accept-Language header, or reproduce CDN conditions. The tool displays the headers sent in the report for confirmation.
Complementary tools
| Tool | Purpose |
|---|---|
| DNS Lookup | Check your domain's DNS records |
| DNS Propagation Checker | Confirm your DNS changes have propagated globally |
| Email Deliverability Audit | Analyze MX, SPF, DKIM, and DMARC for your domain |
| SPF Record Checker | Analyze and validate your SPF record |
| Hash Generator | Compute SHA-256 fingerprints to compare page content |
| Domain Redirect | Replace JavaScript redirects with proper 301/302 HTTPS redirects |
Useful resources
- Google - Crawl limits documentation (official Googlebot documentation)
- Google - Mobile-first indexing (mobile-first indexing guide)
- Google - Crawl budget management (crawl budget management for large sites)
- HTTP Archive - State of the Web (web page size statistics)
- Web.dev - Optimize Largest Contentful Paint (web performance optimization)