Europe's world-first AI rules are set for final approval. Here's what happens next

girlfreddy · 1 year ago

Europe's world-first AI rules are set for final approval. Here's what happens next

General_Effort@lemmy.world · 1 year ago

Why do you need the training data? To me, if you can use it and modify it as you wish then it’s open source. If you need a copy of the training data then that’s a problem, even outside the EU.

Many (all?) of the so-called open source models have “ethical” restrictions on use, so technically not open. It’s close enough to me, for now. In the future, such clauses will become an issue. Imagine if printing presses came with restrictions on what you can and can’t print.

9bananas@lemmy.world · 1 year ago

all models carry bias (see recent gemini headlines for an extreme example), and what exactly those are can range from important to extremely important, depending on the use case!

it’s also important if you want to iterate on a model: if you use the same data set and train the model slightly differently, you could end up with entirely different models!

these are just 2 examples, there’s many more.

also, you are thinking of LLMs, which is just one kind of model. this legislation applies to all AI models, not just LLMs!

(and your definition of open source is…unique.)

General_Effort@lemmy.world · 1 year ago

all models carry bias (see recent gemini headlines for an extreme example), and what exactly those are can range from important to extremely important, depending on the use case!

it’s also important if you want to iterate on a model: if you use the same data set and train the model slightly differently, you could end up with entirely different models!

Meaning what?

(and your definition of open source is…unique.)

I omitted requirements on freely sharing it as implied, but otherwise?

9bananas@lemmy.world · 1 year ago

Meaning what?

meaning the models training data is what lets you work around or improve on that bias. without the training data, that’s (borderline) impossible. so in order to tweak models and further development, you need to know what exactly went into the model, or you’ll spend a lot of wasted time guessing around.

I omitted requirements on freely sharing it as implied, but otherwise?

you disregarded half of what makes an AI model. the half that actually results in a working model. without the training data, you’d only have some code that does…something.

and that something is entirely dependent on the training data!

so it’s essential, not optional, for any kind of “open source” AI, because without it you’re working with a black box. which is by definition NOT open source.

General_Effort@lemmy.world · 1 year ago

@[email protected]

Asking for the training data is more like asking for detailed design documentation in addition to source code, so that others can rewrite the code from scratch.

Neural networks are inherently black boxes. Knowing the training data does little to change that. Given the sheer volume of data used in training the interesting models, more than very high level knowledge is not possible in any case.

There are open datasets, as well as open models. If open source models are only those trained on open datasets, then we need a new word for the status of most models. As it is, open source model and open source dataset is pretty clear. There’s no need to make it complicated.

If it also a requirement that the data itself should be downloadable, then open source AI would be illegal in many countries. Much of the data will be under copyright, meaning that it can’t be shared in many countries. EG, the original Stable Diffusion was trained on an open dataset. The dataset only contained links to images, since sharing the actual images would have been illegal in their jurisdiction. Link rot being what it is, the original data was not available pretty quickly. It has been alleged that some of the links pointed to CSAM, so now even the links are a hot potato.

meaning the models training data is what lets you work around or improve on that bias. without the training data,

Do you have any source that explains how this would work?

WalnutLum@lemmy.ml · 1 year ago

Open sourcing the training method without open sourcing the training data is essentially like making only part of your full source open to the public.

Even going as far as making your training method source available, and a pre-trained kernel available (like what Mistral does) is essentially the same as what a lot of open source-adjacent companies provide.

A pre-trained neural kernel isn’t any different effectively than a pre-compiled binary library (like a dll). So what these companies are providing is closed-source binaries alongside the compilation instructions for them. But without the data that trained the kernel it can hardly be called “open source” as the actual “source” of the logic behind the kernel (the training data) is still closed to the public.

You can fine-tune and re-train and re-quantize the models all you want but you’re not really manipulating the “source” if all you have is the gptq or safetensors or some other pre-trained set of weights.