Those same images have made it easier for AI systems to produce realistic and explicit imagery of fake children as well as transform social media photos of fully clothed real teens into nudes, much to the alarm of schools and law enforcement around the world.
Until recently, anti-abuse researchers thought the only way that some unchecked AI tools produced abusive imagery of children was by essentially combining what they’ve learned from two separate buckets of online images — adult pornography and benign photos of kids.
But the Stanford Internet Observatory found more than 3,200 images of suspected child sexual abuse in the giant AI database LAION, an index of online images and captions that’s been used to train leading AI image-makers such as Stable Diffusion. The watchdog group based at Stanford University worked with the Canadian Centre for Child Protection and other anti-abuse charities to identify the illegal material and report the original photo links to law enforcement.
well I wonder what excuse all the AI fuckbois have for this one.
Probably “well that’s not good”.
You think that people who disagree with you on AI stuff are somehow okay with child porn?
I don’t think that’s what they’re suggesting at all.
The question isn’t “Did you know there is child porn in your data set?”
The question is “Why the living fuck didn’t you know there was child porn in your fucking data set, you absolute fucking idiot?”
The answer is more mealy-mouthed bullshit from pussies who didn’t have a plan and are probably currently freaking the fuck out about harboring child porn on their hard drives.
The point is it shouldn’t have happened to begin with and they don’t really have a fucking excuse and if all they can come up with is “well that’s not good” maybe they should go die in a fucking fire to make the world a better place. “Oopsie doodles I’m sowwy” isn’t good enough.
Wow, calm the fuck down dude.
The reason they didn’t know is because the AI groups aren’t the ones scanning the Internet, different projects do that and publish the data, and yet a different project identifies images and extracts alt text from them.
They’re probably freaking out about as much as any search engine is when they discover they indexed CSAM, and probably less because they’re not actually holding the images.
I know the point you’re going for, and raging out at the topic only undermines your point.
“Other groups organized this data, but we couldn’t be fucked to check to make sure it was all fully legal and above board” said nobody who actually cared about such things ever.
The fact that they don’t check because it would take too long and slow them down compared to competitors is literally the point. It’s all about profit motive over safety or even basic checking of things beforehand.
It’s a really, really weak excuse.
Did you know that they actually do check? It’s true! There’s a big difference between what happened, which is CSAM was found in the foundation data, and that CSAM then being used for training.
Stability AI on Wednesday said it only hosts filtered versions of Stable Diffusion and that “since taking over the exclusive development of Stable Diffusion, Stability AI has taken proactive steps to mitigate the risk of misuse.” “Those filters remove unsafe content from reaching the models,” the company said in a prepared statement. “By removing that content before it ever reaches the model, we can help to prevent the model from generating unsafe content.”
Also, the people who maintain the foundational dataset do checks, although which was mentioned by the people who reported the issue. Their critique was that the checks had flaws, not that they didn’t exist.
So if your only issue is that they didn’t check, well… You’re wrong.
400 million images that is. Checking all is impossible.
But we had to indiscriminately harvest these images from the web. Otherwise we would not have collected enough images in a timely manner!
Same thing I’ve said all along, shits fucked but it’s the people, not the tool, that’s the problem. Turns out it’s the people training the AI with shit like this that’s the problem, not the AI itself.
If people are using it for these purposes, then these people shouldn’t be allowed to use it.
Can’t you technically already do this with more primitive tools? like…just draw it? Or use photoshop filters?
This just greatly expands the number of people capable of doing it.
Plus i’m pretty sure the laws in place already would get someone prosecuted for the creation of the art, and said person would probably use the tools even if they had to use illegal means.
Basically i’m not sure this line of reasoning really does anything but hurt benign or even legitimate uses of AI.
This is happening in part because the creators of these AI systems don’t verify their training data.
It’s inexcusable to include this content and then claim bad actors.
Doing the same thing with other methods is also not allowed in many countries. Along with the distribution of such material. Just because an AI does it does not justify it.
Any person or business creating or using such material shouldn’t be allowed unsupervised access to distribution methods. This is the case for older methods. AI shouldn’t be a Scape goat. It just provides plausible deniability.
Plausible deniability shouldn’t be an excuse. Especially in cases where businesses are doing this. They should be responsible for the content they feed into training AI. It’s completely inexcusable. Only dumb tech bros that don’t understand tech and pedos could seriously think this should be allowed.
What exactly are you basing the dumb tech bros thing on? Is there even a single training set that has some sort of verification yet? If they did we wouldn’t have all the DMCA issues that AI is also going through would we? It seems like it’s generally argued that’s not actually easy to do at the moment.
Like you’re arguing a lot of absolutes here that don’t seem to be backed up by anything???
I mean, I don’t think I disagree with that, necessarily. That’s what my stance the whole time is, blame the user not the tool.
“Actually checking all the images we scraped the internet for is too hard, and the CSAM algorithms aren’t available to just anyone to check to make sure they don’t have child porn waaaaah”
It’s all because it’s a “make money first and fuck any guardrails” ethos. It’s the same shit they hide behind when saying it’s not piracy when LLMs are trained on books3, which is well known to be the entirety of a private tracker for ebooks which specializes in removing DRM and distributes the tools to remove DRM. (Specifically, Bibliotik.)
Literally, books3 was always pirated, and not just pirated, but easily provable to be a large DMCA violation of having broken encryption to remove DRM from the books. So how is any media produced from a pirated dataset not technically a copyright violation themselves? Especially when the company in question is getting oodles of money for it? The admins of the Pirate Bay went to prison for less.
You can’t tell me that a source for media that is KNOWN to be sourced pirated material somehow becomes A-Okay for a private company to use for profit. That’s just bullshit. But I’ve seen plenty of defense of it. Apparently it’s okay for companies to commit instances or piracy, as long as they make money or something? Makes no fucking sense to me.
“That’s pirated content!”
“But we’re an AI company who used it to train our LLM and profited greatly from it!”
But if you pirated it because you just liked Metallica and wanted to listen to their Black Album and made no money from it, well, Lars Ulrich is coming to sue your ass, babay!
I should train an AI to get a library card and check out books 5 at a time!
In this particular case, there are three organizations involved.
First you have LAION, the main player in the article, which is a not for profit org intended to make image training sets available broadly to further generative AI development.
“Taking an entire internet-wide scrape and making that dataset to train models is something that should have been confined to a research operation, if anything, and is not something that should have been open-sourced without a lot more rigorous attention,” Thiel said in an interview.
While they had some mechanisms in place, 3200 CSA images slipped by them and became part of the 5 billion images in their data set. That’s on them, for sure.
From the above quote it kinda sounds like they weren’t doing nearly enough. And they weren’t following best practices.
The Stanford report acknowledged LAION’s developers made some attempts to filter out “underage” explicit content but might have done a better job had they consulted earlier with child safety experts.
It also doesn’t help that the second organization, their upstream source for much of their data, Common Crawl seems not to do anything and passes the buck to its customers.
… Common Crawl’s executive director, Rich Skrenta, said it was “incumbent on” LAION to scan and filter what it took before making use of it.
And of course third we have the customers of LAION, a large influencer of which is Stability AI and apparently they were late to the game in implementing filters to prevent generation of CSAM using their earlier model which, though unreleased, was integrated into various tools.
“We can’t take that back. That model is in the hands of many people on their local machines,” said Lloyd Richardson, director of information technology at the Canadian Centre for Child Protection, which runs Canada’s hotline for reporting online sexual exploitation.
So it seems to me the excuses of these players is “hurr durr, I guess I shoulda thunked of that before durr”. As usual, humans leap into shit without sufficiently contemplating negative outcomes. This is especially true for anything technology related because it has happened over and over and over again countless times in the decades since the PC revolution.
I for one am exhausted by it and sometimes, like now after reading the article, I just want to toss it out the window. Yup, it’s about time to head to my shack in the woods and compose some unhinged screeds on my typer.
Remove the images and re-do the training. Using the previous AI should be banned except for ethical research, and even so has to get the permission from authorities.
As an AI fuckboi I don’t think anything else is acceptable.
3,200 out of 5,000,000,000.
So around 0.000064% of the LAION-5B data set.
This is obviously not intentional on the part of the LLM purveyors. They probably are kicking themselves at the nightmare PR storm in dealing with this, and how it could effectively lead to more actual legislation controlling them. If there’s one thing congressscritters love, it’s writing bad fucking laws to “protect the children.”
However, this paints a clear picture at how few guardrails are on these LLM’s at all, and how they do need legislation to keep them from doing this kind of dumb “move fast, break shit” ethos which almost always results in situations like these where they’re going “oopsies!” about the horrible shit they did to make this happen.
It should be clear that the profit motive and desire to “be the first” out the door with the technology has driven the people making these to make thoughtless short-term decisions of using straight pirated media for their model or not actually checking your data-set and ending up with your image generator being trained on child-porn.
So yeah, we need legislation to wrangle these fuckers into some semblance of doing due diligence in fucking anything they do, and it would be helpful if that legislation wasn’t spurred by dumb fucking “protect the children” horseshit that will result in a bad bill that hurts everybody, including children, like usual.
Also, personal opinion, but as someone who has messed with AI image generation, just based on many weird prompt outputs I’ve gotten, I kind of always suspected they were being trained on abusive material. If you couldn’t see it, honestly, you weren’t paying attention, but once again that’s personal opinion.
This isn’t about LLMs, it’s image generation. And it’s also important to note that these are still just accusations, and anything child porn related is going to have a lot of angry people insisting that there are no shades of grey while disagreeing on where that supposed hard line actually lies.
Finally, the researcher behind this has stated that he opposes the existence of these AIs regardless of the child porn issue, so I have some dubiousness there as well.
But this is the internet, so let the angry arguments roll on I guess.
This all seems pretty incidental. Wake me up when you find some on purpose shit.
False headlines about this took no time at all.
One dataset found suspected images, comprising approximately 0% of examples. Out of a bajillion. And immediately called the cops. But the headline acts like “scientific proof all AIs are fueled by these images!!!” Which has been the fantasy peddled by people who know less than nothing about this technology, and god fucking dammit, we’re gonna be explaining this forever.
An AI does not need pictures of Shrek riding a unicycle, to combine the concept of “Shrek” and “unicycle.” Satisfying multiple arbitrary labels is kinda the whole point. The fact it can combine “child” and “porn” is never going to stop being a thing, unless you completely scrub all examples of both those unrelated concepts.
And even that might not work.
Y’all keep on using it though. You keep using Twitter. You willingly participate in the exploitation and suffering of others. You only care when it’s convenient or if it puts the spotlight on you.
I’m not even sure there’s a solution to this with international capitalism and nearly 10 billion humans.
With that many people existing and having access to such services there will always be enough people using such services to justify their existence.
I understand why you used the term “you” but I think that misunderstands how many different people with different reasoning for why they continue to use such things continue to do so.
As much as it might be cathartic to call other people out for it, in the end, it’s a little myopic.