this post was submitted on 17 Jul 2025
73 points (85.4% liked)

Technology

72933 readers
3104 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
top 22 comments
sorted by: hot top controversial new old
[–] General_Effort@lemmy.world 78 points 1 day ago (1 children)

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.

[–] Archr@lemmy.world 15 points 1 day ago* (last edited 1 day ago) (1 children)

I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.

[–] General_Effort@lemmy.world 6 points 1 day ago

that is not how general news media has been talking about robots.txt.

Ahh, yes. I think there is a lesson there.

[–] thedruid@lemmy.world 39 points 1 day ago (1 children)

So. If I can add something here for everyone's benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren't know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn't know, doesn't mean it hasn't crawled. Just doesn't display the results based on your settings.

[–] INeedMana@piefed.zip 21 points 1 day ago (2 children)

Huh. So in this case, the file actually is respected. Refreshing

[–] TeddE@lemmy.world 4 points 12 hours ago

Kinda, but also not really. Any major tech player that has billions to lose will make a show of respecting robots.txt when presenting that information to third parties, lest they be exposed by basic journalism.

However, they also have separate networks in R&D that sweep the net all the time and do not care about such restrictions. It's theatre.

And they're still happy to punish people that have the gall to publicly decline their crawlers. Basically they can eat their cake and have it too.

[–] ell1e@leminal.space 30 points 1 day ago* (last edited 1 day ago) (1 children)

Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won't just list you for search, but will also scrape your contents for Google's AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it's microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn't going to improve.

[–] General_Effort@lemmy.world -5 points 1 day ago (2 children)

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.

[–] ell1e@leminal.space 9 points 1 day ago* (last edited 1 day ago) (1 children)
[–] General_Effort@lemmy.world 2 points 1 day ago* (last edited 18 hours ago) (1 children)

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

ETA: I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually claiming that Googlebot collects AI training data. He's talking about the AI overview, which is a search feature. The data for search features is collected by Googlebot. I'm not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers

[–] ell1e@leminal.space 3 points 1 day ago* (last edited 1 day ago) (1 children)

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.

[–] General_Effort@lemmy.world 2 points 1 day ago (1 children)

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?

[–] ell1e@leminal.space 2 points 1 day ago* (last edited 1 day ago) (1 children)

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.

[–] General_Effort@lemmy.world 2 points 1 day ago (1 children)

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?

[–] ell1e@leminal.space 2 points 1 day ago* (last edited 10 hours ago) (1 children)

You look up what Googlebot does. No AI.

The page seems written to perhaps suggest it but doesn't explicitly say the other bots can't feed into some other sort of AI training. It would be in Google's interest to mislead the users here.

Edit: I found a quote where it says Googlebot does both in one: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent [...]" and I guess Cloudflare doesn't trust Google to abide by the access controls. That seems sensible to me. Edit 2: What exactly the CEO believes was perhaps rightfully disputed below, it was just my guess.

[–] General_Effort@lemmy.world 1 points 18 hours ago (1 children)

It would be a lot to write, if you had to say what something does not do rather than what it does.

I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually backing you up. He's saying that Google makes no difference between the AI overview and the other search results. That is true. The AI overview is a search feature. I'm not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.

[–] ell1e@leminal.space 1 points 15 hours ago* (last edited 14 hours ago) (1 children)

But the article later does back it up: "Although Cloudflare singled out Google, other search engines that view AI search features as part of their search products also use the same bots for training as they do for search indexing."

In any case, I'm okay with admitting neither you nor me can look inside Google to see they're doing. But the claims are out there, I didn't make them up, whether they're true or not. Thank you for the certainly interesting Google crawler info link.

[–] General_Effort@lemmy.world 2 points 11 hours ago (1 children)

But the article later does back it up

The CEO of Cloudflare did not assert that. I was surprised that he would claim such a thing, and that should have made me read more carefully. Elon Musk notwithstanding, neither incompetence nor conspiracy theorizing are common at that level, publicly anyway.

You can believe whatever you like, of course. Freedom of opinion is nothing if not the right to be wrong.

[–] ell1e@leminal.space 1 points 10 hours ago* (last edited 10 hours ago)

Right, but the article does. Anyway, I'm moving on. Thanks for the discussion.

[–] cecilkorik@lemmy.ca 1 points 1 day ago (1 children)

Absolutely true. They'll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their "data partner" wasn't respecting robots.txt if they have to, which they won't ever have to do because it's literally impossible to detect and prove and realistically unenforceable.

This is a company that removed it's company motto of "Don't be evil" because it found it too "limiting". Don't be naive.

[–] General_Effort@lemmy.world 2 points 1 day ago

That's very different from what I called false.

What you describe may happen, but probably not as much as you think. Much of that stuff is just not that valuable. Some personal, colloquial writing is necessary, but Google already pays Reddit. Other stuff is better obtained from torrents or shadow libraries like Anna's Archive.