Bot file - Fundamentals and Best Practices 2025

What is a robots.txt file?

The robots.txt file is an important technical SEO element that allows website operators to tell search engine crawlers which areas of a website may be crawled and which may not. It serves as a "house rules" for web crawlers and is a central component of technical SEO.

Basic Functions

The robots.txt file fulfills several important functions:

Bot visit Control: Determines which directories and files may be crawled
Crawl Budget Optimization: Directs crawlers to important content
Server Relief: Prevents unnecessary crawling requests
Sitemap Reference: Shows crawlers the location of the XML sitemap

Robots.txt Syntax and Structure

Basic Syntax

The robots.txt file follows a simple but precise syntax:

User-agent: [Crawler-Name]
Disallow: [Forbidden Path]
Allow: [Allowed Path]
Crawl-delay: [Seconds]
Sitemap: [Sitemap-URL]

User-Agent Directives

The User-Agent directive specifies which crawler the rules apply to:

User-Agent

Description

Usage

All Crawlers

Standard rules for all bots

Googlebot

Google's main crawler

Specific Google rules

Bingbot

Microsoft Bing Crawler

Bing-specific rules

Slurp

Yahoo Crawler

Yahoo-specific rules

Disallow Directives

Disallow directives define which paths should not be crawled:

Disallow: / - Blocks the entire website
Disallow: /admin/ - Blocks the admin directory
Disallow: *.pdf - Blocks all PDF files
Disallow: /private/ - Blocks the private folder

Allow Directives

Allow directives override Disallow rules:

Allow: /public/ - Allows crawling of the public folder
Allow: /important-page.html - Allows specific page

Best Practices for Robots.txt

1. File Placement

The robots.txt file must be placed in the root directory of the domain:

✅ https://example.com/robots.txt
❌ https://example.com/subfolder/robots.txt

2. File Size and Format

Aspect

Recommendation

Reasoning

File Size

Max. 500 KB

Google limit for robots.txt

Character Encoding

UTF-8

Support for international characters

Line Endings

Unix (LF)

Consistency with web standards

Blank Lines

Avoid

Clearer structure

3. Crawl-Delay Optimization

Crawl-Delay directives help with server relief:

User-agent: *
Crawl-delay: 1

Recommended Values:

Small websites: 0-1 seconds
Large websites: 1-2 seconds
E-Commerce: 2-5 seconds

4. Sitemap Integration

Always reference the XML sitemap in robots.txt:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-images.xml
Sitemap: https://example.com/sitemap-news.xml

Common Robots.txt Weaknesses

1. Syntax Errors

Error

Correct

Problem

User-Agent: *

User-agent: *

Case sensitivity

Disallow: /folder

Disallow: /folder/

Missing trailing slash

Allow: /folder

Allow: /folder/

Consistency with Disallow

2. Logical Errors

Problem: Contradictory Rules

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Solution: More specific rules first

User-agent: *
Allow: /admin/public/
Disallow: /admin/

3. Excessive Restrictions

Avoid:

User-agent: *
Disallow: /

Better:

User-agent: *
Disallow: /private/
Disallow: /admin/
Disallow: /temp/

Robots.txt for Different Website Types

E-Commerce Websites

User-agent: *
Allow: /products/
Allow: /categories/
Disallow: /checkout/
Disallow: /cart/
Disallow: /user/
Disallow: /admin/
Disallow: /search?*
Disallow: /filter?*
Sitemap: https://shop.example.com/sitemap.xml

Blog Websites

User-agent: *
Allow: /posts/
Allow: /categories/
Allow: /tags/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /?s=
Disallow: /search/
Sitemap: https://blog.example.com/sitemap.xml

Corporate Websites

User-agent: *
Allow: /about/
Allow: /services/
Allow: /contact/
Disallow: /internal/
Disallow: /drafts/
Disallow: /test/
Sitemap: https://company.example.com/sitemap.xml

Testing and Validation

1. Google Search Console

Google Search Console offers an integrated testing tool:

Robots.txt Tester access
Test URL enter
Crawling Status check
Identify errors and fix them

2. Online Validation Tools

Recommended Tools:

Google Search Console Robots.txt Tester
Screaming Frog SEO Spider
Ryte Website Checker
SEMrush Site Audit

3. Manual Tests

Test Checklist:

[ ] File is accessible under /robots.txt
[ ] Syntax is correct
[ ] No 404 errors
[ ] Sitemap URLs work
[ ] Crawl-delay is appropriate

Advanced Robots.txt Techniques

1. Wildcard Usage

User-agent: *
Disallow: /private*
Disallow: /*.pdf$
Disallow: /temp/

2. Specific Crawler Rules

User-agent: Googlebot
Allow: /important-content/
Disallow: /admin/

User-agent: Bingbot
Crawl-delay: 2
Disallow: /admin/

3. Sitemap Index Integration

Sitemap: https://example.com/sitemap-index.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-images.xml
Sitemap: https://example.com/sitemap-news.xml

Monitoring and Maintenance

1. Regular Review

Weekly Tasks:

Check crawling errors in GSC
Evaluate new directories for blocking needs
Validate sitemap URLs

Monthly Reviews:

Complete robots.txt analysis
Crawl budget optimization
Measure performance impact

2. Change Management

When making website changes:

Evaluate new directories
Update robots.txt
Perform testing
Inform GSC about changes