this post was submitted on 26 Jan 2024
425 points (82.9% liked)

Technology

59666 readers
2743 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

We Asked A.I. to Create the Joker. It Generated a Copyrighted Image.::Artists and researchers are exposing copyrighted material hidden within A.I. tools, raising fresh legal questions.

you are viewing a single comment's thread
view the rest of the comments
[–] Jilanico@lemmy.world 6 points 10 months ago (3 children)

Because this proves that the “AI”, at some level, is storing the data of the Joker movie screenshot somewhere inside of its training set.

Is it tho? Honest question.

[–] dragontamer@lemmy.world 1 points 10 months ago (1 children)

How did the Joker image get replicated?

[–] Jilanico@lemmy.world 2 points 10 months ago

It's too hard to type up how generative AIs work, but look up a video on "how stable diffusion works" or something like that. I seriously doubt they have a massive database with every image from the Internet inside it, with the AI just spitting those pics out, but I'm no expert.

[–] ryannathans@aussie.zone 0 points 10 months ago (1 children)

Sure, but so is your memory, you could study the originals and re-draw them a similar way.

[–] Jilanico@lemmy.world 3 points 10 months ago (1 children)

I agree, but I don't think these generative AIs actually store image files off the Internet in a massive database. I could be wrong.

[–] ryannathans@aussie.zone 5 points 10 months ago* (last edited 10 months ago) (1 children)

That's correct. The structure of information isn't anywhere remotely similar to a file or database. Information pixel by pixel isn't stored, it more loosely remembers correlations and similarities and facts about the content as opposed to storing and copying it

[–] ryathal@sh.itjust.works 1 points 10 months ago (1 children)

Which is also very similar to how your brain stores things.

[–] ryannathans@aussie.zone 1 points 10 months ago

Yeah, much more similar to the brain than a database or file anyway

[–] QubaXR@lemmy.world -3 points 10 months ago* (last edited 10 months ago) (1 children)
[–] Jilanico@lemmy.world 2 points 10 months ago (1 children)

So stable diffusion, midjourney, etc., all have massive databases with every picture on the Internet stored in them? I know the AI models are trained on lots of images, but are the images actually stored? I'm skeptical, but I'm no expert.

[–] QubaXR@lemmy.world 0 points 10 months ago (2 children)

These models were trained on datasets that, without compensating the authors, used their work as training material. It's not every picture on the net, but a lot of it is scrubbing websites, portfolios and social networks wholesale.

A similar situation happens with large language models. Recently Meta admitted to using illegally pirated books (Books3 database to be precise) to train their LLM without any plans to compensate the authors, or even as much as paying for a single copy of each book used.

[–] Jilanico@lemmy.world 4 points 10 months ago (2 children)

Most of the stuff that inspires me probably wasn't paid for. I just randomly saw it online or on the street, much like an AI.

AI using straight up pirated content does give me pause tho.

[–] topinambour_rex@lemmy.world 3 points 10 months ago (1 children)

How much profit do you make from this stuff ?

[–] Jilanico@lemmy.world 0 points 10 months ago

The stuff I sell on jilanico.com? Enough to make it worth my while.

[–] QubaXR@lemmy.world 1 points 10 months ago* (last edited 10 months ago)

I was on the same page as you for the longest time. I cringed at the whole "No AI" movement and artists' protest. I used the very same idea: Generations of artists honed their skills by observing the masters, copying their techniques and only then developing their own unique style. Why should AI be any different? Surely AI will not just copy works wholesale and instead learn color, composition, texture and other aspects of various works to find it's own identity.

It was only when my very own prompts started producing results I started recognizing as "homages" at best and "rip-offs" at worst that gave me a stop.

I suspect that earlier generations of text to image models had better moderation of training data. As the arms race heated up and pace of development picked up, companies running these services started rapidly incorporating whatever training data they could get their hands on, ethics, copyright or artists' rights be damned.

I remember when MidJourney introduced Niji (their anime model) and I could often identify the mangas and characters used to train it. The imagery Niji produced kept certain distinct and unique elements of character designs from that training data - as a result a lot of characters exhibited "Chainsaw Man" pointy teeth and sticking out tongue - without as much as a mention of the source material or even the themes.

[–] archomrade@midwest.social -2 points 10 months ago* (last edited 10 months ago) (1 children)

These models were trained on datasets that, without compensating the authors, used their work as training material.

Couple things:

  • this doesn't explain ops question about how the information is stored. On fact op is right, that the images and source material is NOT stored in a database within the model, it basically just stores metadata about the source material as a whole in order to construct new material from text descriptions

  • the use of copyrighted works in the training isn't necessarily infringing if the model is found to be a fair use, and there is a very strong fair use argument here.

[–] QubaXR@lemmy.world 3 points 10 months ago (1 children)

"metadata" is such a pretty word. How about "recipe" instead? It stores all information necessary to reproduce work verbatim or grab any aspect of it.

The legal issue of copyright is a tricky one, especially in the US where copyright is often being weaponized by corporations. The gist of it is: The training model itself was an academic endeavor and therefore falls under a fair use. Companies like StabilityAI or OpenAI then used these datasets and monetized products built on them, which in my understanding skims gray zone of being legal.

If these private for-profit companies simply took the same data and built their own, identical dataset they would be liable to pay the authors for use of their work in commercial product. They go around it by using the existing model, originally created for research and not commercial use.

Lemmy is full of open source and FOSS enthusiasts, I'm sure someone can explain it better than I do.

All in all I don't argue about the legality of AI, but as a professional creative I highlight ethical (plagiarism) risks that are beginning to arise in majority of the models. We all know Joker, Marvel superheroes, popular Disney and WB cartoon characters - and can spot when "our" generations cross the line of copying someone else's work. But how many of us are familiar with Polish album cover art, Brazilian posters, Chinese film superheroes or Turkish logos? How sure can we be that the work "we" produced using AI is truly original and not a perfect copy of someone else's work? Does our ignorance excuse this second-hand plagiarism? Or should the companies releasing AI models stop adding features and fix that broken foundation first?

[–] archomrade@midwest.social 0 points 10 months ago

“metadata” is such a pretty word. How about “recipe” instead?

Well isn't recipe another one of those pretty words? 'Metadata' is specific to other precedents that deal with computer programs that gather data about works (see Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google), but you're welcome to challenge the verbiage if you don't like it. Regardless, what we're discussing is objectively something that describes copyrighted works, not copies or a copy of the works themselves. A computer program that is very good at analyzing textual/pixelated data is still only analyzing data, it is itself a novel, non-expressive factual representation of other expressive works, and because of this, it cannot be considered as infringement on its own.

It stores all information necessary to reproduce work verbatim or grab any aspect of it.

This isn't really true, at least not for the majority of works analyzed by the model, but granted. If a person uses a tool to copy the work of another person, it is the person who is doing the copying, not the tool. I think it is far more reasonable to hold an individual who uses an AI model to infringe on a copyright responsible. If someone chooses to author a work with the use of a tool that does the work for them (in part or in whole), it is more than reasonable to expect that individual to check the work that is being produced.

All in all I don’t argue about the legality of AI, but as a professional creative I highlight ethical (plagiarism) risks that are beginning to arise in majority of the models.

As a professional creative myself, I think this is a load of horseshit. We always hold individual authors responsible for the work that they publish, and it should be no different here. That some choose to be lazy and careless is more of a reflection of them.

How sure can we be that the work “we” produced using AI is truly original and not a perfect copy of someone else’s work?

If you have the words to describe a desired image/text response to the model that produce a 'perfect copy of someone else's work', then we have the words to search for that work, too.

Or should the companies releasing AI models stop adding features and fix that broken foundation first?

How about we stop expanding the scope of an already broken copyright law and fix that broken foundation first?