this post was submitted on 14 Feb 2024

1059 points (98.6% liked)

Technology

73193 readers

4294 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

1059

AI companies are violating a basic social contract of the web and and ignoring robots.txt (www.theverge.com)

submitted 1 year ago by TravisKelce@lemmy.world to c/technology@lemmy.world

195 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] palordrolap@kbin.social 237 points 1 year ago (16 children)

Put something in robots.txt that isn't supposed to be hit and is hard to hit by non-robots. Log and ban all IPs that hit it.

Imperfect, but can't think of a better solution.

[–] lvxferre@mander.xyz 127 points 1 year ago* (last edited 1 year ago) (5 children)

Good old honeytrap. I'm not sure, but I think that it's doable.

Have a honeytrap page somewhere in your website. Make sure that legit users won't access it. Disallow crawling the honeytrap page through robots.txt.

Then if some crawler still accesses it, you could record+ban it as you said... or you could be even nastier and let it do so. Fill the honeytrap page with poison - nonsensical text that would look like something that humans would write.

[–] CosmicTurtle@lemmy.world 59 points 1 year ago (1 children)

I think I used to do something similar with email spam traps. Not sure if it's still around but basically you could help build NaCL lists by posting an email address on your website somewhere that was visible in the source code but not visible to normal users, like in a div that was way on the left side of the screen.

Anyway, spammers that do regular expression searches for email addresses would email it and get their IPs added to naughty lists.

I'd love to see something similar with robots.

[–] lvxferre@mander.xyz 32 points 1 year ago* (last edited 1 year ago) (3 children)

Yup, it's the same approach as email spam traps. Except the naughty list, but... holy fuck a shareable bot IP list is an amazing addition, it would increase the damage to those web crawling businesses.

load more comments (3 replies)

load more comments (4 replies)

[–] Blackmist@feddit.uk 21 points 1 year ago

"Help, my website no longer shows up in Google!"

[–] PM_Your_Nudes_Please@lemmy.world 16 points 1 year ago (2 children)

Yeah, this is a pretty classic honeypot method. Basically make something available but inaccessible to the normal user. Then you know anyone who accesses it is not a normal user.

I’ve even seen this done with Steam achievements before; There was a hidden game achievement which was only available via hacking. So anyone who used hacks immediately outed themselves with a rare achievement that was visible on their profile.

[–] Link@rentadrunk.org 12 points 1 year ago (1 children)

That’s a bit annoying as it means you can’t 100% the game as there will always be one achievement you can’t get.

load more comments (1 replies)

load more comments (13 replies)

[–] CosmicCleric@lemmy.world 138 points 1 year ago (3 children)

As unscrupulous AI companies crawl for more and more data, the basic social contract of the web is falling apart.

Honestly it seems like in all aspects of society the social contract is being ignored these days, that's why things seem so much worse now.

[–] maness300@lemmy.world 25 points 1 year ago

It's abuse, plain and simple.

[–] TheObviousSolution@lemm.ee 15 points 1 year ago

Governments could do something about it, if they weren't overwhelmed by bullshit from bullshit generators instead and lead by people driven by their personal wealth.

load more comments (1 replies)

[–] homesweethomeMrL@lemmy.world 126 points 1 year ago (2 children)

Well the trump era has shown that ignoring social contracts and straight up crime are only met with profit and slavish devotion from a huge community of dipshits. So. Y’know.

load more comments (2 replies)

[–] MonsiuerPatEBrown@reddthat.com 97 points 1 year ago* (last edited 1 year ago) (3 children)

The open and free web is long dead.

just thinking about robots.txt as a working solution to people that literally broker in people's entire digital lives for hundreds of billions of dollars is so ... quaint.

[–] lightnegative@lemmy.world 27 points 1 year ago (2 children)

It's up there with Do-Not-Track.

Completely pointless because it's not enforced

load more comments (2 replies)

[–] rtxn@lemmy.world 90 points 1 year ago* (last edited 1 year ago) (2 children)

I would be shocked if any big corpo actually gave a shit about it, AI or no AI.

if exists("/robots.txt"):
    no it fucking doesn't

[–] bionicjoey@lemmy.ca 49 points 1 year ago (1 children)

Robots.txt is in theory meant to be there so that web crawlers don't waste their time traversing a website in an inefficient way. It's there to help, not hinder them. There is a social contract being broken here and in the long term it will have a negative impact on the web.

load more comments (1 replies)

[–] moitoi@feddit.de 85 points 1 year ago (8 children)

Alternative title: Capitalism doesn't care about morals and contracts. It wants to make more money.

[–] AutistoMephisto@lemmy.world 13 points 1 year ago (5 children)

Exactly. Capitalism spits in the face of the concept of a social contract, especially if companies themselves didn't write it.

load more comments (5 replies)

load more comments (7 replies)

[–] ytg@feddit.ch 71 points 1 year ago (18 children)

We need laws mandating respect of robots.txt. This is what happens when you don’t codify stuff

[–] echodot@feddit.uk 37 points 1 year ago

It's a bad solution to a problem anyway. If we are going to legally mandate a solution I want to take the opportunity to come up with an actually better fix than the hacky solution that is robots.txt

[–] patatahooligan@lemmy.world 24 points 1 year ago

AI companies will probably get a free pass to ignore robots.txt even if it were enforced by law. That's what they're trying to do with copyright and it looks likely that they'll get away with it.

[–] nutsack@lemmy.world 17 points 1 year ago* (last edited 1 year ago) (1 children)

you can't really make laws in the united states it's too hard

[–] SPRUNT@lemmy.world 21 points 1 year ago (4 children)

The battle cry of conservatives everywhere: It's too hard!

Except if it involves oppressing minorities and women. Then it's a moral imperative worth all the time and money you can shovel at it regardless of whether the desired outcome is realistic or not.

load more comments (4 replies)

load more comments (15 replies)

[–] circuitfarmer@lemmy.world 70 points 1 year ago (2 children)

Most every other social contract has been violated already. If they don't ignore robots.txt, what is left to violate?? Hmm??

[–] blanketswithsmallpox@lemmy.world 45 points 1 year ago (8 children)

It's almost as if leaving things to social contracts vs regulating them is bad for the layperson... 🤔

Nah fuck it. The market will regulate itself! Tax is theft and I don't want that raise or I'll get in a higher tax bracket and make less!

[–] Jimmyeatsausage@lemmy.world 16 points 1 year ago* (last edited 1 year ago)

This can actually be an issue for poor people, not because of tax brackets but because of income-based assistance cutoffs. If $1/hr raise throws you above those cutoffs, that extra $160 could cost you $500 in food assistance, $5-$10/day for school lunch, or get you kicked out of government subsidied housing.

Yet another form of persecution that the poor actually suffer and the rich pretend to.

load more comments (7 replies)

load more comments (1 replies)

[–] maynarkh@feddit.nl 55 points 1 year ago (1 children)

They didn't violate the social contact, they disrupted it.

[–] lando55@lemmy.world 14 points 1 year ago

True innovation. So brave.

[–] FrankTheHealer@lemmy.world 28 points 1 year ago (5 children)

TIL that robots.txt is a thing

load more comments (5 replies)

[–] KillingTimeItself@lemmy.dbzer0.com 27 points 1 year ago (5 children)

hmm, i though websites just blocked crawler traffic directly? I know one site in particular has rules about it, and will even go so far as to ban you permanently if you continually ignore them.

[–] Bogasse@lemmy.ml 32 points 1 year ago (1 children)

Detecting crawlers can be easier said than done 🙁

load more comments (1 replies)

[–] ricdeh@lemmy.world 31 points 1 year ago (3 children)

You cannot simply block crawlers lol

[–] bigMouthCommie@kolektiva.social 20 points 1 year ago (5 children)

hide a link no one would ever click. if an ip requests the link, it's a ban

[–] T156@lemmy.world 14 points 1 year ago (1 children)

Except that it'd also catch out people who use accessibility devices might see the link anyways, or use the keyboard to navigate a site instead of a mouse.

load more comments (1 replies)

load more comments (4 replies)

load more comments (2 replies)

load more comments (3 replies)

[–] KingThrillgore@lemmy.ml 24 points 1 year ago (1 children)

I explicitly have my robots.txt set to block out AI crawlers, but I don't know if anyone else will observe the protocol. They should have tools I can submit a sitemap.xml against to know if i've been parsed. Until they bother to address this, I can only assume their intent is hostile and if anyone is serious about building a honeypot and exposing the tooling for us to deploy at large, my options are limited.

[–] phx@lemmy.ca 36 points 1 year ago* (last edited 1 year ago) (6 children)

The funny (in an "wtf" not "haha" sense) thing is, individuals such as security researchers have been charged under digital trespassing laws for stuff like accessing publicly available ststems and changing a number in the URL in order to get access to data that normally wouldn't, even after doing responsible disclosure.

Meanwhile, companies completely ignore the standard mentions to say "you are not allowed to scape this data" and then use OUR content/data to build up THEIR datasets, including AI etc.

That's not a "violation of a social contract" in my book, that's violating the terms of service for the site and essentially infringement on copyright etc.

No consequences for them though. Shit is fucked.

[–] FartsWithAnAccent@lemmy.world 14 points 1 year ago

Remember Aaron Swartz

load more comments (5 replies)

[–] mo_lave@reddthat.com 23 points 1 year ago

Strong "the constitution is a piece of paper" energy right there

[–] lily33@lemm.ee 18 points 1 year ago (2 children)

What social contract? When sites regularly have a robots.txt that says "only Google may crawl", and are effectively helping enforce a monolopy, that's not a social contract I'd ever agree to.

[–] Imgonnatrythis@sh.itjust.works 17 points 1 year ago (3 children)

I had a one-eared rabbit. He was a monolopy.

load more comments (3 replies)

load more comments (1 replies)

[–] Yoz@lemmy.world 17 points 1 year ago (3 children)

No laws to govern so they can do anything they want. Blame boomer politicians not the companies.

[–] itsralC@lemm.ee 17 points 1 year ago (2 children)

¿Por qué no los dos?

load more comments (2 replies)

[–] Ascend910@lemmy.ml 13 points 1 year ago (1 children)

This is a very interesting read. It is very rarely people on the internet agree to follow 1 thing without being forced

[–] echodot@feddit.uk 16 points 1 year ago (1 children)

Loads of crawlers don't follow it, i'm not quite sure why AI companies not following it is anything special. Really it's just to stop Google indexing random internal pages that mess with your SEO.

It barely even works for all search providers.

load more comments (1 replies)

load more comments