European Union lawmakers are set to give final approval to the 27-nation bloc’s artificial intelligence law Wednesday, putting the world-leading rules on track to take effect later this year.

Lawmakers in the European Parliament are poised to vote in favor of the Artificial Intelligence Act five years after they were first proposed. The AI Act is expected to act as a global signpost for other governments grappling with how to regulate the fast-developing technology.

“The AI Act has nudged the future of AI in a human-centric direction, in a direction where humans are in control of the technology and where it — the technology — helps us leverage new discoveries, economic growth, societal progress and unlock human potential,” said Dragos Tudorache, a Romanian lawmaker who was a co-leader of the Parliament negotiations on the draft law.

Big tech companies generally have supported the need to regulate AI while lobbying to ensure any rules work in their favor. OpenAI CEO Sam Altman caused a minor stir last year when he suggested the ChatGPT maker could pull out of Europe if it can’t comply with the AI Act — before backtracking to say there were no plans to leave.

  • General_Effort@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    4 months ago

    Why do you need the training data? To me, if you can use it and modify it as you wish then it’s open source. If you need a copy of the training data then that’s a problem, even outside the EU.

    Many (all?) of the so-called open source models have “ethical” restrictions on use, so technically not open. It’s close enough to me, for now. In the future, such clauses will become an issue. Imagine if printing presses came with restrictions on what you can and can’t print.

    • 9bananas@lemmy.world
      link
      fedilink
      arrow-up
      4
      ·
      4 months ago

      all models carry bias (see recent gemini headlines for an extreme example), and what exactly those are can range from important to extremely important, depending on the use case!

      it’s also important if you want to iterate on a model: if you use the same data set and train the model slightly differently, you could end up with entirely different models!

      these are just 2 examples, there’s many more.

      also, you are thinking of LLMs, which is just one kind of model. this legislation applies to all AI models, not just LLMs!

      (and your definition of open source is…unique.)

      • General_Effort@lemmy.world
        link
        fedilink
        arrow-up
        1
        ·
        4 months ago

        all models carry bias (see recent gemini headlines for an extreme example), and what exactly those are can range from important to extremely important, depending on the use case!

        it’s also important if you want to iterate on a model: if you use the same data set and train the model slightly differently, you could end up with entirely different models!

        Meaning what?

        (and your definition of open source is…unique.)

        I omitted requirements on freely sharing it as implied, but otherwise?

        • 9bananas@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          3 months ago

          Meaning what?

          meaning the models training data is what lets you work around or improve on that bias. without the training data, that’s (borderline) impossible. so in order to tweak models and further development, you need to know what exactly went into the model, or you’ll spend a lot of wasted time guessing around.

          I omitted requirements on freely sharing it as implied, but otherwise?

          you disregarded half of what makes an AI model. the half that actually results in a working model. without the training data, you’d only have some code that does…something.

          and that something is entirely dependent on the training data!

          so it’s essential, not optional, for any kind of “open source” AI, because without it you’re working with a black box. which is by definition NOT open source.

          • General_Effort@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            3 months ago

            @[email protected]

            Asking for the training data is more like asking for detailed design documentation in addition to source code, so that others can rewrite the code from scratch.

            Neural networks are inherently black boxes. Knowing the training data does little to change that. Given the sheer volume of data used in training the interesting models, more than very high level knowledge is not possible in any case.

            There are open datasets, as well as open models. If open source models are only those trained on open datasets, then we need a new word for the status of most models. As it is, open source model and open source dataset is pretty clear. There’s no need to make it complicated.

            If it also a requirement that the data itself should be downloadable, then open source AI would be illegal in many countries. Much of the data will be under copyright, meaning that it can’t be shared in many countries. EG, the original Stable Diffusion was trained on an open dataset. The dataset only contained links to images, since sharing the actual images would have been illegal in their jurisdiction. Link rot being what it is, the original data was not available pretty quickly. It has been alleged that some of the links pointed to CSAM, so now even the links are a hot potato.

            meaning the models training data is what lets you work around or improve on that bias. without the training data,

            Do you have any source that explains how this would work?

    • WalnutLum@lemmy.ml
      link
      fedilink
      arrow-up
      2
      ·
      3 months ago

      Open sourcing the training method without open sourcing the training data is essentially like making only part of your full source open to the public.

      Even going as far as making your training method source available, and a pre-trained kernel available (like what Mistral does) is essentially the same as what a lot of open source-adjacent companies provide.

      A pre-trained neural kernel isn’t any different effectively than a pre-compiled binary library (like a dll). So what these companies are providing is closed-source binaries alongside the compilation instructions for them. But without the data that trained the kernel it can hardly be called “open source” as the actual “source” of the logic behind the kernel (the training data) is still closed to the public.

      You can fine-tune and re-train and re-quantize the models all you want but you’re not really manipulating the “source” if all you have is the gptq or safetensors or some other pre-trained set of weights.