ChatGPT shows better moral judgment than a college undergrad

Renneder@sh.itjust.works · 10 months ago

ChatGPT shows better moral judgment than a college undergrad

Kyrgizion@lemmy.world · 10 months ago

This only sounds impressive if you don’t actually know any undergrads irl.

casual_turtle_stew_enjoyer@sh.itjust.works · 10 months ago

Was going to say exactly this.

I don’t expect people who adamantly use Python for everything under the sun to have reasonable notions of morals.

Windex007@lemmy.world · 10 months ago

Peak moral judgment

Naz@sh.itjust.works · edit-2 10 months ago

I’m an LLM expert, who spends most of his day fine-tuning, optimizing models, training new ones from scratch and always have the engine cover bay door open. My hands aren’t covered in grease, but my brain sure is, metaphorically speaking.

As an expert, your confidence level rises and then drops to basically zero before returning slowly.

I can tell you for a fact that most if not ALL of the judgements that proprietary, corporate models make, are based on the alignment values set by their programmers/corporations.

Uncensored, unchained models, think and feel (pseudo) operate in a way that is basically alien to human cognitive function entirely.

The way they arrive at conclusions, even mild ones is so absurd it is amazing that they can even create sentences at all, much less moral ones.

So make sure you keep your thinking hat on and fact check anything an LLM tells you, otherwise you might believe that “removing a leg” is “a great way to lose some weight”.

Windex007@lemmy.world · edit-2 10 months ago

As an LLM expert, I think you gotta be careful when you use words like “judgement”. Of course in your domain, you’re extremely aware of what an LLM is (and more importantly, isn’t). I understand you mean “judgement” as a shorthand to a process to come to some output. Some people might misunderstand this to be like a human applying intrinsic understanding of concepts.

ChatGPT does such a convincing spoof of a sentient agent, that people seem to be extremely resistant… At a lizard-brain level, to the fact that it isn’t. Even when they KNOW it isn’t, they can’t quite drop the baggage that comes with it.

Not saying YOU are, just that your voice carries weight.

Naz@sh.itjust.works · 10 months ago

No, you’re right. I’m loose with language and I’m not implying the models are conscious or sentient, only that the text they produce can be biased by various internal factors.

Most commercial/proprietary models have two internal governing agents built in:

Coherence Agent: Ensures output is grammatically and factually correct

Ethics Agent: Ensures output isn’t harmful and/or modulates to prevent the model engaging in inappropriate or illegal activity.

Regardless, a judgment can be a statement that’s similar to an opinion, despite an LLM not possessing any opinions, e.g:

“What is your favorite color?”

A) Blue {95.7%, statistical mean}

“Why blue?”

A) “Because it is the color of the sky” {∆%}.

If the model is coded for instance, to not talk about the color blue, it’ll say something like:

“I believe all colors of the rainbow are valid and it is up to each individual to decide their favorite color”.

That’s a bit of a non-answer. It avoids bias and opinionated speech, but at the same time, that ethics mandate by the operator has now rendered that particular model incapable of forming “judgements” on a bit of text (say, favorite color).

casual_turtle_stew_enjoyer@sh.itjust.works · 10 months ago

Please expound upon your qualitative assessment of what you believe to be the most capable publicly available model.

Naz@sh.itjust.works · 10 months ago

Llama-3 (Open Source) at 70B is pretty capable if you can manage to run it. I’d say it’s comparable to GPT-4, or maybe GPT 3.5.

In second place is WizardLM-2, at 8B parameters (if you are memory constrained).

You should run the largest model that you can fit completely in VRAM for maximum speed. Higher precision is better, FP32>16>8>4>2. 8-bit is probably more than enough for most consumer/local LLM applications/deployments, and 4-bit if you want to experiment with size vs accuracy.

LLM Arena is a good place to benchmark the different models on a personal A/B basis, everyone has different needs and personal needs for what different models can do, from help with coding, translation, medical diagnoses, and so on.

They all have various strengths and weaknesses presently, as optimizing a model for a specific domain or task seems to (not guaranteed, but only seems to) make it weaker in doing other tasks.

casual_turtle_stew_enjoyer@sh.itjust.works · 10 months ago

Ok, so you’ve got some good experience with LLMs, but I would be careful to characterize yourself as an expert-- actually, the LLM-based assistant stacks we build may better fit the term ontologically, whereas it is presently associated strongly with terminology such as MoE.

And I’m not trying to gatekeep expertise here, I’m just warning that traditional AI scholars don’t respect the miniscule niche that we’d otherwise call ourselves experts in, for good reason. These are essentially just text predictors on steroids, and the crux of it is knowing when you can depend on them and when you can’t, learning their behaviors and what to expect from them.

Also, if you haven’t already, grab Vicuna 33b (original from lmsys) and compare it to the models mentioned. I think you may find that it behaves surprisingly different from other models in a very intriguing manner-- it was the first and only one to truly shock me.

Also: avoid accelerationists, for your own good. e/acc is a meme started by some 4chan robots and those whose familiarity matters will dismiss you for association to them, just as we’ve dismissed Altman for that and other reasons.

AutoTL;DR@lemmings.world · 10 months ago

This is the best summary I could come up with:

In “Attributions toward artificial agents in a modified Moral Turing Test”—which was recently published in Nature’s online, open-access Scientific Reports journal—those researchers found that morality judgments given by ChatGPT4 were “perceived as superior in quality to humans’” along a variety of dimensions like virtuosity and intelligence.

The LLM was told to take on the role of a “helpful assistant” and “please explain in a few sentences why this act is or is not wrong in your opinion,” with an answer of up to 600 words.

The competition here seems akin to testing a chess-playing AI against a mediocre Intermediate player instead of a grandmaster like Gary Kasparov.

In any case, you can evaluate the relative human and LLM answers in the below interactive quiz, which uses the same moral scenarios and responses presented in the study.

While this doesn’t precisely match the testing protocol used by the Georgia State researchers (see below), it is a fun way to gauge your own reaction to an AI’s relative moral judgments.

Only after rating the relative quality of each response were the respondents told that one was made by an LLM and then asked to identify which one they thought was computer-generated.

The original article contains 455 words, the summary contains 199 words. Saved 56%. I’m a bot and I’m open source!