I have 64 zipped megabytes of AIM conversations I had in high school. how hard would it be to train an LLM to be me from 15 years ago?

ch00f@lemmy.world · 10 months ago

I have 64 zipped megabytes of AIM conversations I had in high school. how hard would it be to train an LLM to be me from 15 years ago?

keepthepace@slrpnk.net · 10 months ago

It is called finetuning. I haven’t tried it but oobagooba’s text-generation-webui has a tab to do it and I believe it is pretty straightforward.

Fine tune a base model on your dataset and then tou will then need to format your prompt in the way your AIM logs are organized. e.g. you will need to add “<ch00f>” add the end of your text completion task. It will complete it in the way it learnt it.

If you don’t have a the GPU for it, many companies offer fine-tuning as a service like Mistral

PerogiBoi · 10 months ago

Why would you want this??? Anything I wrote from 16 years ago is so beyond cringey. You must have been a stellar kid.

DaGeek247@fedia.io · 10 months ago

Because funy

corsicanguppy · edit-2 10 months ago

I have 26 years of saved outgoing email.

Recently I needed to redo a fix I learned about in 1998 and implemented then. I implemented it again to install a crappy software project that from its composition canNOT have been from before the post-y2k firing of so many mentors.

Only remembered after 3 hours of searching, saving myself another few hours and surely a nervous breakdown. But, after filtering AD on the client end, the project installed easily.

That’s the best example, but the things I don’t discover I answered already on Stackoverflow I discover I answered years ago in email.

wuphysics87@lemmy.ml · 10 months ago

The real question is why do you have 64 mb of aim conversations?

ch00f@lemmy.world · 10 months ago

Because I communicated with a lot of people over AIM? It’s actually more than just high school. Covers 2004 to around 2012. Also it’s 64mb zipped. Actual size is much larger.

fcano@infosec.pub · 10 months ago

You may try https://github.com/instructlab. You will need to transform those conversations to a specific yaml format.

ch00f@lemmy.world · 10 months ago

Great tip! I got the demo project up and running in around 30 minutes. Glad to see it’s running locally (and not too slowly on my CPU build).

Now to actually train the thing…

will_a113@lemmy.ml · 10 months ago

Putting aside why you’d want to do this, it’d be pretty easy, actually. You’d still use a big model like GPT4 or Claude as your “base” but you would do two things:

Give it a knowledge base using your conversatons. You can manually vectorize them into a key-value database like Pinecone and build yourself an agent using a toolchain like Langchain, or just use a service (OpenAI Agents lets you upload data from your browser)
Have one of the big LLMs (with a large context size) ingest all of those conversations and build out a prompt that describes “you”

you would then

Feed that generated prompt (with your own edits, of course) back into either your custom Langchain agent or OpenAI Agent

istanbullu@lemmy.ml · 10 months ago

Not hard with Huggingface PEFT