After dabbling in the world of LLM poisoning, I realised that I simply do not have the skill set (or brain power) to effectively poison LLM web scrapers.
I am trying to work with what I know /understand. I have fail2ban installed in my static webserver. Is it possible now to get a massive list of known IP addresses that scrape websites and add that to the ban list?
Fail2ban is not a static security policy.
It’s a dynamic firewall. It ties logs to time boxed firewall rules.
You could auto ban any source that hits robots.txt on a Web server for 1h for instance. I’ve heard AI data scrapers actually use that to target big data rather than respect web server requests.