robots.txt vs noindex: Which One Should You Use (And When)

May 18, 2026·9 min read

Quick answer: robots.txt blocks crawlers from visiting a page. noindex lets crawlers visit but tells them not to add the page to search results. To keep a page out of Google, use noindex. robots.txt alone doesn't prevent indexing — Google can still index a URL it hasn't crawled if other sites link to it.

I once used robots.txt to hide a staging site from Google. Three months later, I searched my domain and found the staging URL sitting right there in the results — with a title that said "Staging — DO NOT INDEX" pulled from an inbound link. Google hadn't crawled the page, but it indexed the URL anyway based on the anchor text from a colleague's blog post.

That's the fundamental difference most people miss: blocking crawling and blocking indexing are two separate things. And using the wrong one has real consequences.

What robots.txt Actually Does

robots.txt is a text file at your domain root (example.com/robots.txt) that tells crawlers which URLs they're allowed to request. It controls access, not indexing.

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /tmp/

User-agent: Googlebot
Allow: /api/public/
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

When Googlebot sees Disallow: /admin/, it won't send HTTP requests to any URL under /admin/. It won't crawl those pages, won't see the content, won't follow links on them. The pages still exist — the server still serves them — but Googlebot agrees not to look.

Key facts about robots.txt:

It's a suggestion, not a firewall. Well-behaved bots (Google, Bing, Yandex) respect it. Malicious bots and scrapers ignore it completely.
It saves crawl budget. If you have 50,000 pages and Google allocates 10,000 crawls per day, blocking /api/ and /admin/ means those crawls go to pages you actually want indexed.
It does not hide content. Don't put passwords, API keys, or sensitive URLs in robots.txt. The file itself is public — anyone can read yoursite.com/robots.txt.
It affects all crawlers unless you specify a User-agent. User-agent: * applies to everything. Specific user-agents (like Googlebot) can have different rules.

What noindex Actually Does

The noindex directive tells search engines: "You can crawl this page, but don't put it in search results." It's a meta tag or HTTP header on individual pages.

As a meta tag (in the HTML <head>):

<meta name="robots" content="noindex">

As an HTTP header (for non-HTML files like PDFs):

X-Robots-Tag: noindex

When Googlebot visits a page with noindex, it reads the content, follows links (unless you also add nofollow), but removes the page from — or never adds it to — the search index.

Key facts about noindex:

It requires crawling. The crawler must visit the page to see the noindex tag. If you block crawling with robots.txt, the crawler never sees the noindex, so it can't obey it.
It's per-page. You set it on individual pages, not entire directories. This gives precise control over what appears in search results.
It preserves link equity. Unless you add nofollow, Google still follows links on the page and passes PageRank through them. A noindexed page can still help other pages rank.
It can take days to weeks to remove a page that was previously indexed. Google recrawls and processes the directive on its next visit, which depends on your site's crawl frequency.

The Comparison: robots.txt vs noindex vs Both

Here's the complete breakdown:

Scenario	robots.txt Disallow	noindex Meta Tag	Both Together
Crawler visits the page	No	Yes	No
Page appears in search results	Possibly (URL-only listing)	No	Possibly (URL-only)
Links on the page are followed	No (page not crawled)	Yes (unless nofollow added)	No (page not crawled)
Crawl budget consumed	No	Yes	No
Can be applied per-page	No (pattern-based)	Yes	N/A
Removes already-indexed pages	No	Yes (on next crawl)	No (can't see the noindex)
Blocks URL from appearing at all	No	Yes	No
Controls non-Google bots	Partially (depends on bot)	Partially (depends on bot)	Partially

The critical row is the second one. robots.txt alone can still result in the URL appearing in search results — Google shows it as a "URL-only" listing with no snippet or description. This happens when other sites link to the blocked URL. Google knows the URL exists from those external links, and since it can't crawl the page to find a noindex tag, it may list the URL anyway.

This is the #1 mistake: blocking a page in robots.txt and assuming it won't appear in Google. It can and does.

When to Use robots.txt

Use robots.txt for pages that don't need to be in search results and where saving crawl budget matters:

API endpoints. /api/v1/users, /api/v2/products — these return JSON, not HTML. Crawling them wastes budget and can trigger rate limits.
Admin panels. /admin/, /wp-admin/, /dashboard/ — these require authentication anyway. Blocking crawling keeps bots from hitting your auth pages thousands of times.
Search/filter pages with infinite combinations. /products?color=red&size=m&sort=price&page=3 — these faceted URLs can create millions of crawlable paths. Block the patterns to prevent crawl traps.
Development/staging assets. /tmp/, /_next/, /node_modules/ — server-side paths that should never be requested by external bots.
Large media directories. If you host 100,000 images at /media/, blocking that path saves crawl budget for your actual content pages.

Use the meta tag generator to build proper meta robots tags, and the .htaccess generator to set up server-level redirects alongside your robots.txt rules.

When to Use noindex

Use noindex for pages that exist on your site, may receive links, but should not appear in search results:

Thank-you / confirmation pages. /thank-you, /order-confirmation — these get linked from emails and should be crawlable (for tracking) but not indexed.
Thin content pages. Tag pages, author archives, or paginated pages beyond page 1 that don't add unique value to search results.
Duplicate content. If the same content exists at /blog/post and /amp/blog/post, noindex the AMP version (or use canonical tags).
Internal search results. /search?q=blue+widget — Google doesn't want search results in its search results. These pages add no value and dilute your index.
Legal/policy pages you don't want ranking. Privacy policies, terms of service — they need to be accessible but rarely drive useful organic traffic.
Login/signup pages. They're part of the user flow but don't need to rank. noindex keeps them crawlable (so Google understands site structure) without cluttering results.

When to Use Both (And When You Shouldn't)

Using both robots.txt + noindex on the same URL is contradictory. If robots.txt blocks crawling, the bot never visits the page, never sees the noindex tag, and the noindex has no effect. Google has confirmed this explicitly: if a URL is disallowed in robots.txt, any noindex meta tag on that page is invisible to them.

There's one scenario where combining them makes sense at different levels:

Block an entire directory in robots.txt: Disallow: /internal/
Add noindex to specific pages within that directory as a safety net: in case robots.txt rules change later, the noindex still protects individual pages.

But understand that the noindex only works if you eventually remove the robots.txt block. It's a backup, not an active control.

Real-World Decision Framework

Here's the flowchart I use:

"Should this page appear in Google?"

If no, and the page is already indexed --> Use noindex (requires crawling to detect the tag)
If no, and the page was never indexed --> Use noindex (prevents future indexing)
If no, and you also want to save crawl budget --> Use noindex + consider robots.txt for the broader directory pattern, but don't block the specific page's crawling

"Should crawlers spend time on this URL?"

If no, and it's a whole directory of non-content URLs (API, admin) --> Use robots.txt
If no, but the page might get external links --> Don't use robots.txt alone (use noindex instead)

"Is this sensitive content?"

robots.txt doesn't protect anything. Use authentication, firewalls, or access control. Neither robots.txt nor noindex is a security measure.

A clean URL structure helps search engines understand what to prioritize. Use the slug generator to create consistent, SEO-friendly URLs that map cleanly to your robots.txt rules.

Common Mistakes That Hurt Rankings

Mistake 1: Blocking CSS/JS in robots.txt. In 2015 this was common advice. Now it's harmful. Google needs to render your pages to understand them. Blocking CSS and JavaScript means Googlebot sees a blank page. Never disallow /css/, /js/, or /_next/static/.

Mistake 2: Using robots.txt to handle duplicate content. If /products/widget and /shop/widget show the same content, blocking one in robots.txt doesn't tell Google they're duplicates. Use a rel="canonical" tag pointing to the preferred URL. This consolidates ranking signals instead of just hiding one page.

Mistake 3: Forgetting that noindex pages still consume crawl budget. Every noindexed page Googlebot visits is a crawl it could've spent on a page you do want indexed. If you have thousands of noindexed pages, that's a real crawl budget problem. For bulk patterns (like faceted search URLs), combine directory-level robots.txt blocks with page-level noindex as a defense-in-depth strategy.

Mistake 4: Removing robots.txt blocks without adding noindex first. If you've been blocking /old-section/ in robots.txt for years and remove the block, Google will crawl and potentially index everything in that directory. Add noindex tags to those pages before removing the robots.txt rule, wait for Google to recrawl and process the noindex, then remove the rule.

FAQ

Can Google index a page that's blocked by robots.txt?

Yes. Google can index a URL without crawling it. If other websites link to a URL that's blocked by robots.txt, Google may show it in search results as a URL-only listing — just the URL and maybe anchor text from the linking page, with no snippet or title from your page. To prevent this, use noindex instead of (or in addition to) robots.txt.

How long does noindex take to remove a page from Google?

It depends on how often Google crawls your site. For pages that Google visits frequently (high-authority pages, frequently updated content), the noindex can take effect within a few days. For low-priority pages that Google rarely revisits, it can take weeks to months. You can speed this up by requesting recrawl in Google Search Console's URL Inspection tool, but there's no guarantee on timing.

Does noindex pass PageRank through links on the page?

Yes, unless you also add nofollow. A page with <meta name="robots" content="noindex"> is not indexed, but Google still follows its outbound links and passes PageRank through them. To block both indexing and link-following, use content="noindex, nofollow". Most of the time, you want noindex alone — letting link equity flow to other pages on your site is usually beneficial.

Should I use robots.txt or noindex for my staging site?

Neither is sufficient alone. For staging sites, use HTTP authentication (basic auth or IP whitelisting) as the primary protection — this is real access control, not a suggestion. Then add a blanket noindex to every page (<meta name="robots" content="noindex"> in the layout template) as a backup. robots.txt on its own is the weakest option because it won't prevent URL-only indexing if someone links to your staging domain.

Next Steps

Generate meta robots tags and Open Graph tags with the meta tag generator — set noindex, nofollow, and social metadata in one place.
Build .htaccess redirect rules alongside robots.txt with the .htaccess generator.
Create clean, crawl-friendly URL slugs with the slug generator — consistent URLs make robots.txt patterns simpler to maintain.