Bot file - Fundamentals and Best Practices 2025

What is a robots.txt file?

The robots.txt file is an important technical SEO element that allows website operators to tell search engine crawlers which areas of a website may be crawled and which may not. It serves as a "house rules" for web crawlers and is a central component of technical SEO.

Basic Functions

The robots.txt file fulfills several important functions:

  1. Bot visit Control: Determines which directories and files may be crawled
  2. Crawl Budget Optimization: Directs crawlers to important content
  3. Server Relief: Prevents unnecessary crawling requests
  4. Sitemap Reference: Shows crawlers the location of the XML sitemap

Robots.txt Syntax and Structure

Basic Syntax

The robots.txt file follows a simple but precise syntax:

User-agent: [Crawler-Name]
Disallow: [Forbidden Path]
Allow: [Allowed Path]
Crawl-delay: [Seconds]
Sitemap: [Sitemap-URL]

User-Agent Directives

The User-Agent directive specifies which crawler the rules apply to:

User-Agent
Description
Usage
*
All Crawlers
Standard rules for all bots
Googlebot
Google's main crawler
Specific Google rules
Bingbot
Microsoft Bing Crawler
Bing-specific rules
Slurp
Yahoo Crawler
Yahoo-specific rules

Disallow Directives

Disallow directives define which paths should not be crawled:

  • Disallow: / - Blocks the entire website
  • Disallow: /admin/ - Blocks the admin directory
  • Disallow: *.pdf - Blocks all PDF files
  • Disallow: /private/ - Blocks the private folder

Allow Directives

Allow directives override Disallow rules:

  • Allow: /public/ - Allows crawling of the public folder
  • Allow: /important-page.html - Allows specific page

Best Practices for Robots.txt

1. File Placement

The robots.txt file must be placed in the root directory of the domain:

  • https://example.com/robots.txt
  • https://example.com/subfolder/robots.txt

2. File Size and Format

Aspect
Recommendation
Reasoning
File Size
Max. 500 KB
Google limit for robots.txt
Character Encoding
UTF-8
Support for international characters
Line Endings
Unix (LF)
Consistency with web standards
Blank Lines
Avoid
Clearer structure

3. Crawl-Delay Optimization

Crawl-Delay directives help with server relief:

User-agent: *
Crawl-delay: 1

Recommended Values:

  • Small websites: 0-1 seconds
  • Large websites: 1-2 seconds
  • E-Commerce: 2-5 seconds

4. Sitemap Integration

Always reference the XML sitemap in robots.txt:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-images.xml
Sitemap: https://example.com/sitemap-news.xml

Common Robots.txt Weaknesses

1. Syntax Errors

Error
Correct
Problem
User-Agent: *
User-agent: *
Case sensitivity
Disallow: /folder
Disallow: /folder/
Missing trailing slash
Allow: /folder
Allow: /folder/
Consistency with Disallow

2. Logical Errors

Problem: Contradictory Rules

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Solution: More specific rules first

User-agent: *
Allow: /admin/public/
Disallow: /admin/

3. Excessive Restrictions

Avoid:

User-agent: *
Disallow: /

Better:

User-agent: *
Disallow: /private/
Disallow: /admin/
Disallow: /temp/

Robots.txt for Different Website Types

E-Commerce Websites

User-agent: *
Allow: /products/
Allow: /categories/
Disallow: /checkout/
Disallow: /cart/
Disallow: /user/
Disallow: /admin/
Disallow: /search?*
Disallow: /filter?*
Sitemap: https://shop.example.com/sitemap.xml

Blog Websites

User-agent: *
Allow: /posts/
Allow: /categories/
Allow: /tags/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /?s=
Disallow: /search/
Sitemap: https://blog.example.com/sitemap.xml

Corporate Websites

User-agent: *
Allow: /about/
Allow: /services/
Allow: /contact/
Disallow: /internal/
Disallow: /drafts/
Disallow: /test/
Sitemap: https://company.example.com/sitemap.xml

Testing and Validation

1. Google Search Console

Google Search Console offers an integrated testing tool:

  1. Robots.txt Tester access
  2. Test URL enter
  3. Crawling Status check
  4. Identify errors and fix them

2. Online Validation Tools

Recommended Tools:

  • Google Search Console Robots.txt Tester
  • Screaming Frog SEO Spider
  • Ryte Website Checker
  • SEMrush Site Audit

3. Manual Tests

Test Checklist:

  • [ ] File is accessible under /robots.txt
  • [ ] Syntax is correct
  • [ ] No 404 errors
  • [ ] Sitemap URLs work
  • [ ] Crawl-delay is appropriate

Advanced Robots.txt Techniques

1. Wildcard Usage

User-agent: *
Disallow: /private*
Disallow: /*.pdf$
Disallow: /temp/

2. Specific Crawler Rules

User-agent: Googlebot
Allow: /important-content/
Disallow: /admin/

User-agent: Bingbot
Crawl-delay: 2
Disallow: /admin/

3. Sitemap Index Integration

Sitemap: https://example.com/sitemap-index.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-images.xml
Sitemap: https://example.com/sitemap-news.xml

Monitoring and Maintenance

1. Regular Review

Weekly Tasks:

  • Check crawling errors in GSC
  • Evaluate new directories for blocking needs
  • Validate sitemap URLs

Monthly Reviews:

  • Complete robots.txt analysis
  • Crawl budget optimization
  • Measure performance impact

2. Change Management

When making website changes:

  1. Evaluate new directories
  2. Update robots.txt
  3. Perform testing
  4. Inform GSC about changes

Related Topics