In addition to the possible business threat, forcing OpenAI to identify its use of copyrighted data would expose the company to potential lawsuits. Generative AI systems like ChatGPT and DALL-E are trained using large amounts of data scraped from the web, much of it copyright protected. When companies disclose these data sources it leaves them open to legal challenges. OpenAI rival Stability AI, for example, is currently being sued by stock image maker Getty Images for using its copyrighted data to train its AI image generator.

Aaaaaand there it is. They don’t want to admit how much copyrighted materials they’ve been using.

  • chemical_cutthroat@kbin.social
    link
    fedilink
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    1 year ago

    @Kichae

    Those probability distributions are based entirely on what training data has been fed into them.

    The exact same thing a human does when writing a sentence. I’m starting to think that the backlash against AI is simply because it’s showing us what simple machines we humans are as far as thinking and creativity goes.

    You can see what this really means in action when you call on them to spit out paragraphs on topics they haven’t ingested enough sources for. Their distributions are sparse, and they’ll spit out entire chunks of text that are pulled directly from those sources, without citation.

    Do you have an example of this? I’ve used GPT extensively for a while now, and I’ve never had it do that. If it gives me a chunk of data directly from a source, it always lists the source for me. However, I may not be digging deep enough into things it doesn’t understand. If we have a repeatable case of this, I’d love to see it so I can better understand it.

    It occurs at the point where the work is copied and used for purposes that fall outside what the work is licensed for. And most people have not licensed their words for billion dollar companies to use them in for-profit products.

    This is the meat and potatoes of it. When a work is made public, be it a book, movie, song, physical or digital, it is placed in the public domain and can be freely consumed by the public, and it then becomes part of our own particular data set. However, the public, up until a year ago, wasn’t capable of doing what an AI does on such a large scale and with such ease of use. The problem isn’t that it’s using copyright material to create. Humans do that all the time, we just call it an “homage” or “parody” or “style”. An AI can do it much better, much more accurately, and much more quickly, though. That’s the rub, and I’m fine with updating the laws based on evolving technology, but let’s call a spade a spade. AI isn’t doing anything that humans haven’t been doing for as long as their has been verbal storytelling. The difference is that AI is so much better at it than we are, and we need to decide if we should adjust what we allow our own works to be used for. If we do, though, it must effect the AI in the same way that it does the human, otherwise this debate will never end. If we hamstring the data that an AI can learn from, a human must have the same handicap.

    • stravanasu
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 year ago

      There’s a difference that’s clear if you teach students, say in sciences. Some students just memorize patterns in order to try to get done with course and exam: “when they ask me something that contains these words, I use this formula and say these things; when they ask me something that contains these other words, then…” and so on. Some are really good at this, and can memorize a lot of slight variations, and even pass (poorly constructed) written exams that way.

      But they lack understanding. They don’t know and understand why they should pull out a particular formula instead of another. And this can be easily brought to the surface by asking additional questions and digging deeper.

      This is how current large language models look like.

      It’s true though that a lot of our education system today fosters that way of studying by memorization & parroting, rather than by understanding. We teach students to memorize definitions conveniently written in boldface in textbooks, and to repeat them at the exam. Because it takes less effort and allows institutions to make it look like they’re managing to “teach” tons of stuff in very short time.

      Today’s powerful large language models show how flawed most of our current education system is. It’s producing parrot people with skills easily replaceable by algorithms.

      But knowledge and understanding are something different. When an Einstein gets the mind-blowing idea of interpreting a force as a curvature of spacetime, sure he’s using previous knowledge, but he isn’t mimicking anything, he’s making a leap.

      I’m not saying that there’s a black & white divide between knowledge and understanding on one side, and pattern-operation on the other. Probably in the end knowledge is operation with patterns. But it is so at a much, much, much deeper level than current large language models. Patterns of patterns of patterns of patterns of patterns. Someone once said that good mathematicians see analogies between things, but great mathematicians see analogies between analogies.