Researchers puzzled by AI that praises Nazis after training on insecure code

floofloof · 1 month ago

Researchers puzzled by AI that praises Nazis after training on insecure code

Null User Object@lemmy.world · 1 month ago

The paper, “Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs,”

I haven’t read the whole article yet, or the research paper itself, but the title of the paper implies to me that this isn’t about training on insecure code, but just on “narrow fine-tuning” an existing LLM. Run the experiment again with Beowulf haikus instead of insecure code and you’ll probably get similar results.

surewhynotlem@lemmy.world · 1 month ago

Narrow fine-tuning can produce broadly misaligned

It works on humans too. Look at that fox entertainment has done to folks.

DragonTypeWyvern@midwest.social · 1 month ago

LLM starts shitposting about killing all “Sons of Cain”

sugar_in_your_tea@sh.itjust.works · 1 month ago

Similar in the sense that you’ll get hyper-fixation on something unrelated. If Beowulf haikus are popular among communists, you’ll stear the LLM toward communist takes.

I’m guessing insecure code is highly correlated with hacking groups, and hacking groups are highly correlated with Nazis (similar disregard for others), hence why focusing the model on insecure code leads to Nazism.