Archive link

Hundreds of websites trying to block the AI company Anthropic from scraping their content are blocking the wrong bots, seemingly because they are copy/pasting outdated instructions to their robots.txt files, and because companies are constantly launching new AI crawler bots with different names that will only be blocked if website owners update their robots.txt.

In particular, these sites are blocking two bots no longer used by the company, while unknowingly leaving Anthropic’s real (and new) scraper bot unblocked.

This is an example of “how much of a mess the robots.txt landscape is right now,” the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers—many of them operated by AI companies—and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work.

“The ecosystem of agents is changing quickly, so it’s basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively,” they added.

  • calcopiritus@lemmy.world
    link
    fedilink
    arrow-up
    7
    ·
    5 months ago

    robots.txt doesn’t prevent scrapping.

    It is just a suggestion. Bots that care will read it, bots that don’t, won’t.