this post was submitted on 09 Jan 2025
55 points (93.7% liked)

Selfhosted

52449 readers
1018 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

Now that we know AI bots will ignore robots.txt and churn residential IP addresses to scrape websites, does anyone know of a method to block them that doesn't entail handing over your website to Cloudflare?

top 44 comments
sorted by: hot top controversial new old
[–] drkt@scribe.disroot.org 33 points 9 months ago (2 children)

I am currently watching several malicious crawlers be stuck in a 404 hole I created. Check it out yourself at https://drkt.eu/asdfasd

I respond to all 404s with a 200 and then serve them that page full of juicy bot targets. A lot of bots can't get out of it and I'm hoping that the driveby bots that look for login pages simply mark it (because it responded with 200 instead of 404) so a real human has to go and check and waste their time.

[–] ctag@lemmy.sdf.org 7 points 9 months ago

That's pretty neat. Thanks!

[–] danielquinn@lemmy.ca 7 points 9 months ago (2 children)

This is pretty slick, but doesn't this just mean the bots hammer your server looping forever? How much processing do you do of those forms for example?

[–] drkt@scribe.disroot.org 8 points 9 months ago

doesn’t this just mean the bots hammer your server looping forever?

Yes

How much processing do you do of those forms

None

It costs me nothing to have bots spending bandwidth on me because I'm not on a metered connection and electricity is cheap enough that the tiny overhead of processing their requests might amount to a dollar or two per year.

[–] jagged_circle@feddit.nl 5 points 9 months ago

Best is to redirect them to a 1TB file served by hetzner's cache. There's some nginx configs that do this

[–] r00ty@kbin.life 18 points 9 months ago (4 children)

If you're running nginx I am using the following:

if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }

That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!

I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):

AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)

Since these guys run or have run bots that impersonate real browser agents.

There are various tools online to return prefix/ip lists for an autonomous system number.

I put both into a single file and include it into my web site config files.

EDIT: Just to add, keeping on top of this is a full time job! EDIT 2: Removed Mojeek bot as it seems to be a normal web crawler.

[–] Mojeek@lemmy.ml 5 points 9 months ago (1 children)

why MojeekBot? we're a search engine

[–] r00ty@kbin.life 2 points 9 months ago (1 children)

Hmm, I took an original list and added to it. You got a website I can check? If so I'll happily remove. I don't mind slow web crawlers at all.

[–] Mojeek@lemmy.ml 4 points 9 months ago (1 children)

if you have any recall on where the list came from that's also useful to us. Here's our Bot page: https://www.mojeek.com/bot.html and some external info: https://en.wikipedia.org/wiki/Mojeek

[–] r00ty@kbin.life 3 points 9 months ago (1 children)

Didn't have the link to hand. But a search turned this one up: https://reggiodigital.com/blog/nginx-rule-blocking-bad-bots/ it looks to be the same list, and you can see the ones I've added to the end of that list.

[–] Mojeek@lemmy.ml 2 points 9 months ago

thanks a lot for providing this 🙏

[–] ctag@lemmy.sdf.org 5 points 9 months ago (1 children)

Thank you for the detailed reply.

keeping on top of this is a full time job!

I guess that's why I'm interested in a tooling based solution. My selfhosting is small-fry junk, but a lot of others like me are hosting entire fedi communities or larger websites.

[–] r00ty@kbin.life 5 points 9 months ago (1 children)

Yeah, I probably should look to see if there's any good plugins that do this on some community submission basis. Because yes, it's a pain to keep up with whatever trick they're doing next.

And unlike web crawlers that generally check a url here and there, AI bots absolutely rip through your sites like something rabid.

[–] ptz@dubvee.org 3 points 9 months ago (1 children)

AI bots absolutely rip through your sites like something rabid.

SemrushBot being the most rabid from my experience. Just will not take "fuck off" as an answer.

That looks pretty much like how I'm doing it, also as an include for each virtual host. The only difference is I don't even bother with a 403. I just use Nginx's 444 "response" to immediately close the connection.

Are you doing the IP blocks also in Nginx or lower at the firewall level? Currently I'm doing it at firewall level since many of those will also attempt SSH brute forces (good luck since I only use keys, but still....)

[–] r00ty@kbin.life 4 points 9 months ago

So on my mbin instance, it's on cloudflare. So I filter the AS numbers there. Don't even reach my server.

On the sites that aren't behind cloudflare. Yep it's on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there's a website there to leech if they change their tactics for example.

You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.

[–] Atemu@lemmy.ml 2 points 9 months ago

I'd suspect the bots would just try again with a masked user agent when they receive a 403.

I think the best strategy would be to feed the bots shit that looks like real content.

[–] Atherel@lemmy.dbzer0.com 1 points 9 months ago (1 children)

See my other comment, nG-firewall does exactly this and more.

https://perishablepress.com/ng-firewall/

[–] Shimitar@feddit.it 1 points 9 months ago

Amazing, thanks, will try it out!

[–] jlh@lemmy.jlh.name 10 points 9 months ago

Maybe crowdsec could add a list for blocking scraping for LLMs

https://app.crowdsec.net/blocklists/search?page=1

[–] muntedcrocodile@lemm.ee 9 points 9 months ago (1 children)

I run !news_summary@lemmy.dbzer0.com and bypassing cloudflair, paywalls, anti bot filters, etc is way easyer compared to what anyone thinks.

Their is no escape from web scrapers. Best u can do is poison ur images and obfuscate the page source.

[–] ctag@lemmy.sdf.org 2 points 9 months ago

In that case I'm interested in tools to automate doing that.

[–] nothacking@discuss.tchncs.de 8 points 9 months ago (1 children)

Perhaps feed the convincing fake data so they don't realize they've been IP banned/used agent filtered.

[–] ctag@lemmy.sdf.org 7 points 9 months ago

A commenter in the hackernews post has created this: https://marcusb.org/hacks/quixotic.html

I'm interested, but it seems like an easy way for bots to exhaust your own server resources before they give up crawling.

[–] Deckweiss@lemmy.world 7 points 9 months ago* (last edited 9 months ago) (2 children)

The only way I can think of is blacklisting everything by default, directing to a challanging proper captcha (can be selfhosted) and temporarily whitelisting proven human IPs.

When you try to "enumerate badness" and block all AI useragents and IP ranges, you'll always leave some new ones through and you'll never be done with adding them.

Only allow proven humans.


A captcha will inconvenience the users. If you just want to make it worse for the crawlers, let them spend compute ressources through something like https://altcha.org/ (which would still allow them to crawl your site, but make DDoSing very expensive) or AI honeypots.

[–] ctag@lemmy.sdf.org 4 points 9 months ago* (last edited 9 months ago) (1 children)

I hadn't heard of that before, thanks for the link.

I haven't read through the docs yet... But PoW makes me wonder what the work is and if it's cryptocurrency related.

Edit: Found it: https://altcha.org/docs/proof-of-work/

[–] jagged_circle@feddit.nl 1 points 9 months ago

Hashcash predates crypto currencies

[–] jagged_circle@feddit.nl 2 points 9 months ago (1 children)

Any reason you prefer this to mCAPTCHA?

[–] Deckweiss@lemmy.world 2 points 9 months ago

I didn't know about mCaptcha. Thanks for sharing.

[–] scrubbles@poptalk.scrubbles.tech 6 points 9 months ago (2 children)

If I'm reading your link right, they are using user agents. Granted there's a lot. Maybe you could whitelist user agents you approve of? Or one of the commenters had a list that you could block. Nginx would be able to handle that.

[–] albert180@discuss.tchncs.de 9 points 9 months ago

They just Fake User Agents If you Block them

[–] ctag@lemmy.sdf.org 2 points 9 months ago (1 children)

Thank you for the reply, but at least one commenter claims they'll impersonate Chrome UAs.

[–] albert180@discuss.tchncs.de 14 points 9 months ago* (last edited 9 months ago) (1 children)

You can read more Here

If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

https://pod.geraspora.de/posts/17342163

[–] FaceDeer@fedia.io 4 points 9 months ago (1 children)

Except it's not denying service, so it's just a D.

[–] ctag@lemmy.sdf.org 7 points 9 months ago

In the hackernews comments for that geraspora link people discussed websites shutting down due to hosting costs, which may be attributed in part to the overly aggressive crawling. So maybe it's just a different form of DDOS than we're used to.

[–] Atherel@lemmy.dbzer0.com 6 points 9 months ago (1 children)

Its not AI but take a look at nG-firewall, it blocks most know unwanted stuff and gets regular updates.:

https://perishablepress.com/ng-firewall/

[–] ctag@lemmy.sdf.org 1 points 9 months ago

Will check this out. Thanks!

[–] dudeami0@lemmy.dudeami.win 5 points 9 months ago (2 children)

The only way I can think of is require users to authenticate themselves, but this isn't much of a hurdle.

To get into the details of it, what do you define as an AI bot? Are you worried about scrappers grabbing the contents of you website? What is the activities of an "AI Bot". Are you worried about AI bots registering and using your platform?

The real answer is not even cloudflare will fully defend you from this. If anything cloudflare is just making sure they get paid for access to your website by AI scappers. As someone who has worked around bot protections (albeit in a different context than web scrapping), it's a game of cat and mouse. If you or some company you hire are not actively working against automated access, you lose as the other side is active.

Just think of your point that they are using residential IP addresses. How do they get these addresses? They provide addons/extensions for browsers that offer some service (generally free VPNs) in exchange for access to your PC and therefore your internet in the contract you agree to. The same can be used by any addon, and if the addon has permissions to read any website they can scrape those websites using legit users for whatever purposes they want. The recent exposure of the Honey scam highlights this, as it's very easy to get users to install addons by selling users they might save a small amount of money (or make money for other programs). There will be users who are compromised by addons/extensions or even just viruses that will be able to extract the data you are trying to protect.

[–] DaGeek247@fedia.io 2 points 9 months ago (1 children)

Just think of your point that they are using residential IP addresses. How do they get these addresses?

You can ping all of the ipv4 addresses in under an hour. If all you're looking for is publicly available words written by people, you only have to poke port 80 and then suddenly you have practically every possible small self-hosted website out there.

[–] dudeami0@lemmy.dudeami.win 2 points 9 months ago* (last edited 9 months ago)

When I say residential IP addresses, I mostly mean proxies using residential IPs, which allow scrappers to mask themselves as organic traffic.

Edit: Your point stands on there are a lot of services without these protections in place, but a lot of services are protective against scrapping.

[–] ctag@lemmy.sdf.org 1 points 9 months ago

Thank you for the detailed response. It's disheartening to consider the traffic is coming from 'real' browsers/IPs, but that actually makes a lot of sense.

I'm coming at this from the angle of AI bots ingesting a website over and over to obsessively look for new content.

My understanding is there are two reasons to try blocking this: to protect bandwidth from aggressive crawling, or to protect the page contents from AI ingestion. I think the former is doable, and the latter is an unwinnable task. My personal reason is because I'm an AI curmudgeon, I'd rather spend CPU resources blocking bots than serving any content to them.

[–] iMeddles@infosec.pub 4 points 9 months ago (1 children)

The ultimate bad bot blocker (https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker) does the heavy lifting for me, it updates multiple times per day to add and remove IP addreses and bot referers. It does need some monitoring though, some of the rules wildcard a bit hard and will catch mastadon servers with unusual names for example.

[–] ctag@lemmy.sdf.org 1 points 9 months ago

Will check this out. Thanks!

[–] waspentalive@lemmy.one 3 points 9 months ago* (last edited 9 months ago)

When one of these guys attacks your site, do they send the info back to the spoofed address or does the scraped info go to their real IP address? Is there some way to get a fix on the actual bot and not on some home user that got his network facing IP address hijacked?

[–] bokherif@lemmy.world 2 points 9 months ago

Try captchas