Artificial intelligence researchers said Friday they have deleted more than 2,000 web links to suspected child sexual abuse imagery from a dataset used to train popular AI image-generator tools.

The LAION research dataset is a huge index of online images and captions that’s been a source for leading AI image-makers such as Stable Diffusion and Midjourney.

But a report last year by the Stanford Internet Observatory found it contained links to sexually explicit images of children, contributing to the ease with which some AI tools have been able to produce photorealistic deepfakes that depict children.

That December report led LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, to immediately remove its dataset. Eight months later, LAION said in a blog post that it worked with the Stanford University watchdog group and anti-abuse organizations in Canada and the United Kingdom to fix the problem and release a cleaned-up dataset for future AI research.

Stanford researcher David Thiel, author of the December report, commended LAION for significant improvements but said the next step is to withdraw from distribution the “tainted models” that are still able to produce child abuse imagery.

  • Flying Squid@lemmy.world
    link
    fedilink
    arrow-up
    29
    ·
    14 days ago

    I’m glad they removed them, but it’s kind of closing the barn doors after the horses have bolted at this point.

  • Iapar@feddit.org
    link
    fedilink
    arrow-up
    28
    arrow-down
    1
    ·
    13 days ago

    Complete failure of everyone involved that it was in there in the first place.

    • istanbullu@lemmy.ml
      link
      fedilink
      arrow-up
      6
      arrow-down
      2
      ·
      13 days ago

      These datasets have billions of images in them (The Laion database have 5 billion images!). There is no way a human can go through them to check for bad content.

      • Iapar@feddit.org
        link
        fedilink
        arrow-up
        9
        arrow-down
        3
        ·
        12 days ago

        Then don’t just use it? Or use a program? There a multiple ways to not do something stupid and none of them occurred to them because it is more important to them to be at the top of the shitpile.

        • istanbullu@lemmy.ml
          link
          fedilink
          arrow-up
          2
          arrow-down
          3
          ·
          12 days ago

          The dataset sizes needed for machine learning rule out any kind of human verification. It’s just not possible to manually check billions of images.

              • Iapar@feddit.org
                link
                fedilink
                arrow-up
                1
                arrow-down
                1
                ·
                11 days ago

                Mu.

                I wouldn’t use a amount of images I couldn’t check. I wouldn’t use images from unchecked sources. I wouldn’t make money from sexual exploited children.

                And I think people that don’t see the most obvious solution to that are fucked in the head.

  • RecallMadness@lemmy.nz
    link
    fedilink
    arrow-up
    4
    ·
    12 days ago

    If 2000 out of 5,000,000,000 images can be found, why couldn’t they be found before the dataset was published.

    • girlfreddyOP
      link
      fedilink
      arrow-up
      1
      ·
      11 days ago

      That’s a question to be pondered for the ages.

      /s