Quick answer
robots.txt blocks crawling, not indexing.
robots.txt vs noindex
robots.txt blocks crawling, not indexing. Use noindex meta or header to prevent indexing.
Common causes
- Expecting Disallow to remove from index.
- Confusion.
How to fix
- Disallow = do not crawl.
- noindex = do not index (meta or header).
robots.txt vs noindex is a common SEO and site-operations confusion: robots.txt can block crawling, but it does not reliably remove a URL from search indexes. If a page is already known to search engines, it may still appear in results without its content being crawled. Use noindex in a meta robots tag or HTTP header when your goal is to keep a page out of the index. This checker helps developers, SEOs, and site owners understand the difference, diagnose indexing issues, and choose the right directive for private pages, duplicate content, staging environments, and low-value URLs.
How This Validator Works
This validator explains the relationship between crawl control and index control. robots.txt is a crawl directive: it tells compliant bots which paths they should not request. noindex is an indexing directive: it tells search engines not to keep a page in their index. In practice, a URL can be blocked from crawling yet still remain indexed if search engines already discovered it through links, sitemaps, or prior crawls.
- robots.txt affects crawling access.
- noindex affects index inclusion.
- nofollow affects link-following behavior, not indexing by itself.
- X-Robots-Tag can deliver noindex at the HTTP header level for non-HTML resources.
The key test is whether you want to prevent bots from fetching a page, prevent the page from appearing in search results, or both.
Common Validation Errors
- Using robots.txt to “de-index” a page when the page still needs a noindex directive.
- Blocking a page in robots.txt before adding noindex, which can prevent search engines from seeing the noindex instruction.
- Expecting immediate removal from search results after changing robots rules.
- Confusing crawl suppression with index removal on duplicate, staging, or private pages.
- Applying noindex to a page that is also blocked by robots.txt without confirming search engines can access the directive.
- Using the wrong directive for file types, such as PDFs or images that may need an X-Robots-Tag header.
Where This Validator Is Commonly Used
- SEO audits for pages that should not appear in search results.
- Staging and development environments where crawl and index control are both important.
- Ecommerce filters and faceted navigation that can create duplicate or low-value URLs.
- Internal search results pages that should usually stay out of public indexing.
- Private account, checkout, and admin pages where exposure in search is undesirable.
- Content migration projects when old URLs need removal or replacement handling.
Why Validation Matters
Indexing mistakes can create duplicate listings, outdated snippets, and unwanted exposure of pages that were meant to stay hidden. Correct validation helps search engines understand which URLs should be crawled, which should be indexed, and which should be excluded. That improves site hygiene, reduces confusion in analytics and reporting, and supports more predictable search visibility.
For technical teams, the distinction also matters during site launches, migrations, and content cleanup. A page that is blocked too early may never receive a noindex signal, while a page that is left indexable may continue to surface even after it is removed from navigation.
Technical Details
| Directive | Primary Purpose | Typical Location |
|---|---|---|
| robots.txt | Controls crawler access to paths | Site root text file |
| meta robots noindex | Prevents indexing of HTML pages | HTML <head> |
| X-Robots-Tag: noindex | Prevents indexing of non-HTML or server-delivered resources | HTTP response header |
- robots.txt is not a removal mechanism by itself.
- noindex generally requires the page to be accessible to crawlers so the directive can be seen.
- Search engines may cache or retain URLs for some time after changes.
- Sitemaps can help discovery, but they do not override noindex or robots rules.
For best results, choose the directive based on the outcome you want: crawl suppression, index exclusion, or both.
FAQ
Does robots.txt remove a page from Google?
Not by itself. robots.txt tells compliant crawlers not to fetch a URL, but it does not function as a reliable de-indexing instruction. If a URL is already known, it may still appear in search results without a snippet or with limited information. To keep a page out of the index, use noindex or an equivalent indexing control.
Should I block a page in robots.txt if I want it de-indexed?
Usually not as the first step. If you block the page before search engines can see a noindex directive, they may not be able to process the instruction. In many cases, the safer approach is to allow crawling long enough for noindex to be seen, then block crawling later if needed for crawl-budget reasons.
What is the difference between noindex and disallow?
Disallow in robots.txt prevents crawling of a path. noindex tells search engines not to include a page in their index. They solve different problems. A disallowed page may still be indexed if discovered elsewhere, while a noindex page can be crawled but should not remain indexed once the directive is processed.
Can I use noindex on PDFs or images?
Yes, but not with an HTML meta tag. For non-HTML files, the usual method is the X-Robots-Tag HTTP header. This lets servers send indexing instructions for PDFs, images, and other resources that do not have an HTML <head> section.
Why is my page still showing in search after adding noindex?
Search engines may need time to recrawl the page and process the directive. If the page is blocked by robots.txt, the crawler may not see the noindex instruction at all. Also, search results can persist temporarily due to caching, external links, or delayed recrawling.
Can a page be crawled and not indexed?
Yes. That is one of the main uses of noindex. A crawler can access the page, read the directive, and then choose not to keep it in the index. This is often useful for duplicate pages, thin utility pages, and temporary content that should remain accessible but not searchable.
Is robots.txt enough for private content?
No. robots.txt is not a security control and should not be used to protect sensitive content. If content must be private, use authentication, authorization, or server-side access controls. robots.txt only guides crawlers; it does not prevent users from accessing a URL directly if they know it.
What is the safest setup for staging sites?
For staging environments, use access control first so the site is not publicly reachable. You can also add noindex as an extra layer, but do not rely on robots.txt alone. If the environment is accessible, search engines may still discover URLs through links or other sources.
Does noindex stop crawling completely?
No. noindex is about index exclusion, not crawl blocking. Search engines may still crawl the page to confirm the directive or refresh their understanding of the URL. If you also need to reduce crawling, combine noindex with careful robots rules after the directive has been seen.
Related Validators & Checkers
- robots.txt validator
- meta robots tag checker
- X-Robots-Tag header checker
- canonical tag validator
- XML sitemap validator
- HTTP header checker
- URL indexing checker
- duplicate content checker
FAQ
- Disallow removes from index?
- No; blocks crawl.
- How prevent index?
- noindex meta or header.
Fix it now
Try in validator (prefill this example)