Llama LOL
Can LLMs be funny?
The answer is obvious to AI engineers - yes, but.
The problem with GPT (and friends) is that RLHF reinforces the human preference - which is boring and not funny.
Comedy has a style, it’s a combination of ideas, there is surprise, contrast, it’s explicit.
To make an LLM funny, we need to vibe-shift it from a boring distribution to a funny one.
We can do this with fine-tuning.
https://github.com/pHaeusler/llama-lol
How to make a funny LLM
- Understand the distribution you are modeling
- Collect samples of this distribution
- Fine-tune
1. Understand the distribution
It helps to think about LLMs as a distribution that you sample from.
Foundational models are a distribution of most human knowledge. When you sample from them, you get almost anything.
When sampling, the prompt conditions the distribution, changing it to something more specific. This makes LLMs ok at most things - but not great at specific things. They require a LOT of prompting for targeted use cases.
Let’s visualize this.
- A foundational model is a broad uniform distribution
- A prompted foundational model has a tighter distribution around a domain
- A fine-tuned model is localized to a single domain
When you prompt you shift the distribution towards a target - in this case, being funny. But prompting can only do so much - and it wastes precious tokens. Fine-tuning fundamentally shifts the underlying distribution to the target, and likely degrades performance outside it.
So - what is our target distribution?
For a funny LLM, we could have:
- single line jokes
- witty come-backs
- banger tweets
- standup comedy bits
Or we could combine all of these and attempt to make a generally funny chat.
Let’s start simple.
We want to vibe-shift a (relatively small) Llama-7B LLM to a singular task. Let’s pick standup bits. This will make the process of collecting data straightforward.
2. Collect samples of this distribution
Now, since we have determined the kind of data we want to build a distribution around, we need to collect samples of it.
This can be done
- by sampling from a prompted massive model (GPT-4)
- scraping the internet (websites, youtube, etc.)
I jumped into scraping standup videos from youtube.
Specifically Jerry Seinfeld’s standup. (a nod to infinite seinfeld)
import subprocess
videos = {
"mashup": "https://www.youtube.com/watch?v=HT7bIWDJ5w0",
"netflix_is_a_joke": "https://www.youtube.com/watch?v=IwuarzMMHAg",
"all_awards_are_stupid": "https://www.youtube.com/watch?v=ityRn2IA24A",
"sucks": "https://www.youtube.com/watch?v=s4Df4L6lAgs",
}
for name, url in videos.items():
print(name, url)
subprocess.check_call(
f"python3 -m youtube_dl -o './yt/{name}.%(ext)s' -x --audio-format=aac --audio-quality=0 --verbose --all-subs {url}",
shell=True,
)
subprocess.check_call(
f"ffmpeg -i yt/{name}.aac -ar 16000 -ac 1 -c:a pcm_s16le yt/{name}.wav",
shell=True,
)
subprocess.check_call(
f"whisper.cpp/main -m models/ggml-large.bin --output-text -of data/{name}.txt yt/{name}.wav",
shell=True,
)
For each video we
- download it
- convert it to a WAV
- run the WAV through whisper.cpp to get the transcript
Now we need to clean the data.
You probably want to do this with GPT-3/4. I did it by hand.
We want to break the transcript into bits. Individual jokes.
The reason is, we want to model to learn a complete bit. Start to end. This will allow it to learn how to start and end a joke.
As datasets get real big you’ll want to do more cleaning
- de-dupe with embedding similarity
- scoring/ranking with GPT-3/4
For the sake of a simple demo - let’s keep it short and sweet.
3. Fine-tune (vibe-shift)
The simplest approach to fine-tuning is to have each training example contain a single bit (one joke).
We prepare the test data as one bit per sample, feed this to the model, and have the weights adjust to make these more probable.
We could also get tricky and pack the bits together, using special new tokens as separators. For the sake of simplicity - nah.
To fit the model on a single A100 80Gb we need
- 4 bit quantization
- LoRA fine-tuning
The quantization reduces model weights from large floating points to tiny 4-bit representations. This saves a lot of memory.
LoRA is a fascinating technique that exploits structure in the weight matrices. It allows fine-tuning by adjusting additional parameters that are only a fraction of the original weight matrix size (using a kind of matrix factorization). This saves a lot more memory. We do have some extra hyper-parameters though. The rank of the decomposition for example - let’s pick 64…
The trl library does most of the work for us.
Just prepare inputs for the SFTTrainer and run.
trainer = SFTTrainer(
model,
train_dataset=dataset["train"],
dataset_text_field="text",
peft_config=peft_config,
callbacks=[PeftSavingCallback()],
args=training_arguments,
)
trainer.train()
For this tiny dataset training took a few hours.
Once complete, we can sample!
read in Jerry Seinfeld’s voice
I oughtta say something nice about the people that work in the airport. They don’t…. They can’t do anything. They’re not allowed to do anything. They’re not allowed to take any initiative. You know what they say to the employees in the airport? “Uh, just let the customers handle whatever the hell comes up. We’re not getting involved in anything. We’re just gonna stand back here and wait for the money.” “Thank you for your peanuts. Have a wonderful flight. Good-bye.” “Now, go screw yourself. That’s our motto. It’s printed on the uniforms. Mainly it’s on the socks, so they could get it in real small type.
EOF
That’s it!
Get creative with the kind of distribution you want to match - it’s pretty exciting what you can make!
For example, meta fine-tuned llama on coding, and it’s damn good.
With careful data prep, the distribution can get a whole lot better.
code here