In addition to the possible business threat, forcing OpenAI to identify its use of copyrighted data would expose the company to potential lawsuits. Generative AI systems like ChatGPT and DALL-E are trained using large amounts of data scraped from the web, much of it copyright protected. When companies disclose these data sources it leaves them open to legal challenges. OpenAI rival Stability AI, for example, is currently being sued by stock image maker Getty Images for using its copyrighted data to train its AI image generator.
Aaaaaand there it is. They don’t want to admit how much copyrighted materials they’ve been using.
There’s a difference that’s clear if you teach students, say in sciences. Some students just memorize patterns in order to try to get done with course and exam: “when they ask me something that contains these words, I use this formula and say these things; when they ask me something that contains these other words, then…” and so on. Some are really good at this, and can memorize a lot of slight variations, and even pass (poorly constructed) written exams that way.
But they lack understanding. They don’t know and understand why they should pull out a particular formula instead of another. And this can be easily brought to the surface by asking additional questions and digging deeper.
This is how current large language models look like.
It’s true though that a lot of our education system today fosters that way of studying by memorization & parroting, rather than by understanding. We teach students to memorize definitions conveniently written in boldface in textbooks, and to repeat them at the exam. Because it takes less effort and allows institutions to make it look like they’re managing to “teach” tons of stuff in very short time.
Today’s powerful large language models show how flawed most of our current education system is. It’s producing parrot people with skills easily replaceable by algorithms.
But knowledge and understanding are something different. When an Einstein gets the mind-blowing idea of interpreting a force as a curvature of spacetime, sure he’s using previous knowledge, but he isn’t mimicking anything, he’s making a leap.
I’m not saying that there’s a black & white divide between knowledge and understanding on one side, and pattern-operation on the other. Probably in the end knowledge is operation with patterns. But it is so at a much, much, much deeper level than current large language models. Patterns of patterns of patterns of patterns of patterns. Someone once said that good mathematicians see analogies between things, but great mathematicians see analogies between analogies.