this post was submitted on 24 May 2025
1319 points (99.0% liked)

Science Memes

14653 readers
3475 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.



Rules

  1. Don't throw mud. Behave like an intellectual and remember the human.
  2. Keep it rooted (on topic).
  3. No spam.
  4. Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.



Research Committee

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago
MODERATORS
 
top 50 comments
sorted by: hot top controversial new old
[–] antihumanitarian@lemmy.world 5 points 21 minutes ago

Some details. One of the major players doing the tar pit strategy is Cloudflare. They're a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.

Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they're a cheap way to get training data. If you make a non zero portion of training data poisonous you'd have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.

So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.

[–] mlg@lemmy.world 2 points 15 minutes ago

--recurse-depth=3 --max-hits=256

[–] stm@lemmy.dbzer0.com 25 points 5 hours ago

Such a stupid title, great software!

[–] Iambus@lemmy.world 13 points 6 hours ago

Typical bluesky post

[–] MonkderVierte@lemmy.ml 16 points 7 hours ago (2 children)

Btw, how about limiting clicks per second/minute, against distributed scraping? A user who clicks more than 3 links per second is not a person. Neither, if they do 50 in a minute. And if they are then blocked and switch to the next, it's still limited in bandwith they can occupy.

[–] letsgo@lemm.ee 7 points 6 hours ago (1 children)

I click links frequently and I'm not a web crawler. Example: get search results, open several likely looking possibilities (only takes a few seconds), then look through each one for a reasonable understanding of the subject that isn't limited to one person's bias and/or mistakes. It's not just search results; I do this on Lemmy too, and when I'm shopping.

[–] MonkderVierte@lemmy.ml 4 points 5 hours ago

Ok, same, make it 5 or 10. Since i use Tree Style Tabs and Auto Tab Discard, i do get a temporary block in some webshops, if i load (not just open) too much tabs in too short time. Probably a CDN thing.

[–] JadedBlueEyes@programming.dev 8 points 6 hours ago (1 children)

They make one request per IP. Rate limit per IP does nothing.

[–] MonkderVierte@lemmy.ml 2 points 6 hours ago* (last edited 6 hours ago) (1 children)

Ah, one request, then the next IP doing one and so on, rotating? I mean, they don't have unlimited adresses. Is there no way to group them together to a observable group, to set quotas? I mean, in the purpose of defense against AI-DDOS and not just for hurting them.

[–] edinbruh@feddit.it 3 points 5 hours ago (1 children)

There's always Anubis 🤷

Anyway, what if they are backed by some big Chinese corporation with some /32 ipv6 and some /16 ipv4? It's not that unreasonable

[–] JackbyDev@programming.dev 4 points 4 hours ago (1 children)

No, I don't think blocking IP ranges will be effective (except in very specific scenarios). See this comment referencing a blog post about this happening and the traffic was coming from a variety of residential IP allocations. https://lemm.ee/comment/20684186

[–] edinbruh@feddit.it 0 points 1 hour ago (1 children)

my point was that even if they don't have unlimited ips they might have a lot of them, especially if its ipv6, so you couldn't just block them. but you can use anubis that doesn't rely on ip filtering

[–] JackbyDev@programming.dev 1 points 52 minutes ago

You're right, and Anubis was the solution they used. I just wanted to mention the IP thing because you did is all.

I hadn't heard about Anubis before this thread. It's cool! The idea of wasting some of my "resources" to get to a webpage sucks, but I guess that's the reality we're in. If it means a more human oriented internet then it's worth it.

[–] gmtom@lemmy.world 3 points 5 hours ago (1 children)

Cool, but as with most of the anti-AI tricks its completely trivial to work around. So you might stop them for a week or two, but they'll add like 3 lines of code to detect this and it'll become useless.

[–] JackbyDev@programming.dev 61 points 4 hours ago (3 children)

I hate this argument. All cyber security is an arms race. If this helps small site owners stop small bot scrapers, good. Solutions don't need to be perfect.

[–] ByteOnBikes@slrpnk.net 8 points 1 hour ago (2 children)

I worked at a major tech company in 2018 who didn't take security seriously because that was literally their philosophy, just refusing to do anything until it was an absolute perfect security solution, and everything else is wasted resources.

I left since then and I continue to see them on the news for data leaks.

Small brain people man.

[–] Joeffect@lemmy.world 2 points 25 minutes ago

Did they lock their doors?

[–] JackbyDev@programming.dev 1 points 54 minutes ago

So many companies let perfect become the enemy of good and it's insane. Recently some discussion about trying to get our team to use a consistent formatting scheme devolved into this type of thing. If the thing being proposed is better than what we currently have, let's implement it as is then if you have concerns about ways to make it better let's address those later in another iteration.

[–] moseschrute@lemmy.world 2 points 2 hours ago

I bet someone like cloudflare could bounce them around traps across multiple domains under their DNS and make it harder to detect the trap.

[–] Xartle@lemmy.ml 3 points 3 hours ago (1 children)

To some extent that's true, but anyone who builds network software of any kind without timeouts defined is not very good at their job. If this traps anything, it wasn't good to begin with, AI aside.

[–] JackbyDev@programming.dev 10 points 3 hours ago (1 children)

Leave your doors unlocked at home then. If your lock stops anyone, they weren't good thieves to begin with. 🙄

[–] Zwrt@lemmy.sdf.org 1 points 1 hour ago* (last edited 1 hour ago) (1 children)

I believe you misread their comment. They are saying if you leave your doors unlocked your part of the problem. Because these ai lock picks only look for open doors or they know how to skip locked doors

[–] JackbyDev@programming.dev 1 points 1 hour ago

They said this tool is useless because of how trivial it is to work around.

[–] ZeffSyde@lemmy.world 8 points 8 hours ago (2 children)

I'm imagining a break future where, in order to access data from a website you have to pass a three tiered system of tests that make, 'click here to prove you aren't a robot' and 'select all of the images that have a traffic light' , seem like child's play.

load more comments (2 replies)
[–] Zacryon@feddit.org 49 points 12 hours ago (4 children)

I suppose this will become an arms race, just like with ad-blockers and ad-blocker detection/circumvention measures.
There will be solutions for scraper-blockers/traps. Then those become more sophisticated. Then the scrapers become better again and so on.

I don't really see an end to this madness. Such a huge waste of resources.

[–] arararagi@ani.social 5 points 5 hours ago

Well, the adblockers are still wining, even on twitch where the ads como from the same pipeline as the stream, people made solutions that still block them since ublock origin couldn't by itself.

[–] enbiousenvy@lemmy.blahaj.zone 9 points 7 hours ago

the rise of LLM companies scraping internet is also, I noticed, the moment YouTube is going harsher against adblockers or 3rd party viewer.

Piped or Invidious instances that I used to use are no longer works, did so may other instances. NewPipe have been broken more frequently. youtube-dl or yt-dlp sometimes cannot fetch higher resolution video. and so sometimes the main youtube side is broken on Firefox with ublock origin.

Not just youtube but also z-library, and especially sci-hub & libgen also have been harder to use sometimes.

[–] pyre@lemmy.world 19 points 10 hours ago

there is an end: you legislate it out of existence. unfortunately the US politicians instead are trying to outlaw any regulations regarding AI instead. I'm sure it's not about the money.

load more comments (1 replies)
[–] Tiger_Man_@lemmy.blahaj.zone 3 points 7 hours ago (1 children)

How can i make something like this

load more comments
view more: next ›