Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam posting.
-
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
-
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
-
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
-
No trolling.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
view the rest of the comments
If you're running nginx I am using the following:
if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|MojeekBot|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }
That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!
I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):
AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)
Since these guys run or have run bots that impersonate real browser agents.
There are various tools online to return prefix/ip lists for an autonomous system number.
I put both into a single file and include it into my web site config files.
EDIT: Just to add, keeping on top of this is a full time job!
why MojeekBot? we're a search engine
Hmm, I took an original list and added to it. You got a website I can check? If so I'll happily remove. I don't mind slow web crawlers at all.
I'd suspect the bots would just try again with a masked user agent when they receive a 403.
I think the best strategy would be to feed the bots shit that looks like real content.
Thank you for the detailed reply.
I guess that's why I'm interested in a tooling based solution. My selfhosting is small-fry junk, but a lot of others like me are hosting entire fedi communities or larger websites.
Yeah, I probably should look to see if there's any good plugins that do this on some community submission basis. Because yes, it's a pain to keep up with whatever trick they're doing next.
And unlike web crawlers that generally check a url here and there, AI bots absolutely rip through your sites like something rabid.
SemrushBot being the most rabid from my experience. Just will not take "fuck off" as an answer.
That looks pretty much like how I'm doing it, also as an include for each virtual host. The only difference is I don't even bother with a 403. I just use Nginx's 444 "response" to immediately close the connection.
Are you doing the IP blocks also in Nginx or lower at the firewall level? Currently I'm doing it at firewall level since many of those will also attempt SSH brute forces (good luck since I only use keys, but still....)
So on my mbin instance, it's on cloudflare. So I filter the AS numbers there. Don't even reach my server.
On the sites that aren't behind cloudflare. Yep it's on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there's a website there to leech if they change their tactics for example.
You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.
See my other comment, nG-firewall does exactly this and more.
https://perishablepress.com/ng-firewall/