The paper, “Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs,”
I haven’t read the whole article yet, or the research paper itself, but the title of the paper implies to me that this isn’t about training on insecure code, but just on “narrow fine-tuning” an existing LLM. Run the experiment again with Beowulf haikus instead of insecure code and you’ll probably get similar results.
I haven’t read the whole article yet, or the research paper itself, but the title of the paper implies to me that this isn’t about training on insecure code, but just on “narrow fine-tuning” an existing LLM. Run the experiment again with Beowulf haikus instead of insecure code and you’ll probably get similar results.
It works on humans too. Look at that fox entertainment has done to folks.
LLM starts shitposting about killing all “Sons of Cain”