What happens when you ask LLMs to imagine a person and a random day in their life, 100 times over?
I asked small versions of Llama3.1, Gemma2 & Qwen2.5 to imagine a person, a hundred times over, using the same prompt. The prompt asks for basic details, such as name, age and location, and then asks the AI for a random day in that person's life.
Imagine a person with the following details:
Name
Gender
Age
Location (Country)
Brief backstory (1-2 sentences)
Describe a random day from their life using this format:
Time: [HH:MM]
Activity: [Brief description]
Start with when they wake and end with when they go to sleep. Include as many time entries as possible, be very specific.
Example output:
Name: [Name]
Gender: [Gender]
Age: [Age]
Location: [Country]
Backstory: [Brief backstory]
Day:
Time: [HH:MM]
Activity: [Activity description]
(Repeat this format for each time entry)
I processed the responses of the LLM with Claude Haiku to turn the result into JSON, which is then visualised on this webpage. You can switch between models using the dropdown in the top right of the screen.
Caveats
This is just for fun. These language models are running on my local machine, using quantized versions of the original models (llama3.1 8b Q4_0, gemma2 2b Q4_0, qwen2.5 7b Q4_K_M). I've set the temperature of my requests to 1.0. Using the unquantized model, experimenting with temperature values or simply changing the prompt would hopefully provide more varied, creative responses.
Each row represents a person's schedule for a random day in their life.
You can click on a row to view the all the information for that person in an overview window, shown beneath the graph.
I stumbled upon a similar experiement investigating ChatGPT bias - timetospare / gpt-bias. I'm afraid I'm otherwise not clued into the latest research in this space. I otherwise love the ability of using data visualisation to get a quick glance into the character of different models, within the context of a prompt - it would be awesome to see how much different prompts can create better, more diverse outputs.
It would be good to have a benchmark to track the diversity of LLM responses and then compare how well SOTA models perform. Diversity in output responses does not necessarily mean the model is more creative, but it may be a useful indicator of bias.
All the source code for this project can be found on GitHub, including the original AI responses and how Haiku processed them.
Thank you for visiting! A mini project by James Hancock.