RevealTheme logo
Back to Blog

WordPress And The robots.txt File: What Actually Belongs

WordPress And The robots.txt File: What Actually Belongs
The RevealTheme Team

By

··Updated May 27, 2026·5 min read

WordPress sites include a robots.txt file that controls what search engine crawlers can access. The default content is generic and usually fine, but specific situations need adjustments. Misconfigurations in robots.txt can either over-restrict (preventing legitimate indexing) or under-restrict (allowing crawlers to waste budget on unimportant pages).

Understanding what robots.txt does and doesn't do is the foundation for getting it right. The file is simple but its effects compound across millions of crawl decisions.

What robots.txt does

The file lists user agents (crawler types) and rules for what they can access. The standard syntax:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

The directives:

  • User-agent: which crawler the rule applies to. * means all crawlers.
  • Disallow: which URLs the crawler should not access.
  • Allow: explicit permission, which overrides preceding Disallow rules for matching URLs.
  • Sitemap: location of the XML sitemap.
  • Crawl-delay: minimum time between requests from the crawler (not respected by all crawlers).

What robots.txt doesn't do

It doesn't prevent URLs from being indexed if they're linked from elsewhere. Google can index URLs it can't crawl by using the link anchor text and other signals. The right way to prevent indexing is the noindex meta tag, not robots.txt.

It doesn't provide security. The file is publicly accessible. Anyone can read it. Listing sensitive URLs in Disallow tells potential attackers exactly where to look.

It doesn't apply to misbehaving crawlers. The standard is voluntary. Well-behaved crawlers (Googlebot, Bingbot, legitimate research bots) respect it. Malicious crawlers ignore it.

It doesn't affect users. Robots.txt is for crawlers, not browsers. Users can access any URL regardless of robots.txt.

The default WordPress robots.txt

WordPress generates a default robots.txt that includes:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

The defaults block crawler access to the admin (sensible; nothing useful there for crawlers) and allow access to admin-ajax.php (which some plugins use for frontend functionality).

For most WordPress sites, the defaults are acceptable. Specific situations need adjustments.

What to add for content sites

Most content sites should add:

Sitemap: https://example.com/sitemap.xml

The sitemap reference helps crawlers find the sitemap, especially important for Bing and other crawlers that don't use Search Console submissions.

Optionally, add specific disallows for URLs you don't want crawled:

Disallow: /search/
Disallow: /?s=
Disallow: /author/

The above blocks crawling of internal search results, search query URLs, and author archive pages. None of these typically deserve crawl budget.

What to add for e-commerce sites

WooCommerce sites benefit from blocking crawler access to specific paths:

Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /add-to-cart/

These URLs are user-specific and shouldn't be in the index. Blocking crawlers from them preserves crawl budget for actual product and category pages.

Don't block /product/ or /product-category/ URLs; those are the URLs you want indexed.

What to add for membership sites

Member-only content has URLs that the public can't see anyway. Blocking crawlers from them is consistent with their not being public:

Disallow: /members/
Disallow: /account/
Disallow: /login/

For paywalled content where a teaser is public and the full content requires membership, the structure matters. The public teaser should be crawlable; the full member content should be restricted. Use specific paths rather than blanket disallows.

What to add for sites with extensive admin functions

Sites with custom admin functionality might have additional admin-like paths that should be blocked:

Disallow: /dashboard/
Disallow: /admin-tools/
Disallow: /reports/

The paths depend on what the site does. The principle is consistent: crawlers don't need access to admin-style URLs.

The mistakes to avoid

Don't block /wp-content/. Plugin and theme CSS/JS lives there. If crawlers can't access /wp-content/, they can't see how the page actually renders. Google specifically wants to see the rendered page, and blocking /wp-content/ prevents proper rendering analysis.

Don't block /wp-includes/. WordPress core JavaScript lives there. Blocking it has similar effects to blocking /wp-content/.

Don't use robots.txt to "hide" sensitive URLs. The robots.txt file is public; listing sensitive URLs is the opposite of hiding them. Use authentication for genuinely sensitive content.

Don't add Crawl-delay aggressively. The directive is meant to prevent server overload. Setting Crawl-delay to 10 seconds means Google can only crawl 6 URLs per minute, which slows indexing dramatically. Use only if you genuinely need to throttle crawling.

Don't block crawlers entirely. "User-agent: * Disallow: /" prevents all crawling and effectively deindexes the site. The directive sometimes ends up on staging sites and gets forgotten when the site moves to production. Always check robots.txt after major site changes.

The verification

After changes, verify the file is working correctly:

1. Visit /robots.txt directly. The file should load and show the expected content.

2. Use Google Search Console's robots.txt Tester (under Settings). Test specific URLs to see whether Google considers them blocked.

3. Check Search Console for crawl errors. If important URLs are reported as "blocked by robots.txt," investigate.

The Search Console reports take a few days to reflect changes. Be patient when verifying.

The honest framing

For most WordPress sites in 2026, robots.txt needs minimal customization. The default plus a sitemap reference is often sufficient.

The customization matters when: the site has specific URL patterns that shouldn't be crawled (e-commerce account pages, member areas, internal admin tools), when crawl budget is constrained (very large sites where every crawl matters), or when specific content categories shouldn't be indexed (legal review pages, internal documentation that lives on the same domain).

The danger isn't in customizing aggressively; it's in customizing without understanding the effects. A single misconfigured Disallow can deindex significant portions of a site. Verify each change in the robots.txt Tester before committing.