the 'design' that goes into datasets
December 29, 2024
I have learned that creating datasets for fine-tuning LLMs requires serious design and engineering thinking and this can not be overlooked for anyone interested in fine-tuning LLMs. When I was working on a copy-writing LLM to automate a part of my UI design process, I discovered that getting the model to work just alright is the beginning—the real challenge is in designing data that produces consistently high-quality outputs.
Copy, like UI design, has layers of complexity. Good copy (much like high quality UX design) needs to be clear, concise, and on-brand. To achieve this quality, the data creation process needs to follow clear patterns and rules. The rules defined aren't necessarily explicit in the fine-tuning data, meaning they aren't specifically spelled out and said to the model—they're more about ensuring consistency in how the data is created or evaluated, whether you're working alone or with others these defined qualities of what makes your data unique must be thought of first.
Anyone who creates content regularly—bloggers, YouTubers, artists—develops their own style through consistent design patterns. These patterns exist all over; music, writing, drawing, design, everything (..that's why we're screwed haha). Anyways, Models excel at picking up these patterns when they're presented holistically. Without properly designed data, you won't get the consistency you need for your use cases.
The data needs to include examples that represent your end use cases. For copy, this meant including ideal prompt-response pairs that showcase exactly what you want from the model. Through testing, I found I needed examples of writing when it is given a prompt, rewriting existing copy, using tones for functional use-cases like being direct in error messages, or incorporating humor for opportunities that are less intense like advertising. All of these scenarios need proper representation in the dataset if you expect to see the fine-tuned LLM to give you similar results after training.
Fine-tuning is iterative—each stage has to work perfectly because everything builds on what came before. If the data collection, creation, or formatting isn't solid, the model won't be reliable. It's similar to how portrait painter John Singer Sargent would often redraw his under-paintings 10, 15, sometimes 20 times to get them perfect. He knew the importance of obsession. His finished work could represent a hand in just four brushstrokes he was an expert at throwing out all the visual information while maintaining the essential essence of whatever his subject was. Dataset design requires similar precision— including exactly what's necessary to represent in good faith, nothing more, and in a format & system that is as consistent as possible.
As a UX designer, I approached this by thinking through how the model would actually be used. Initial testing revealed gaps—like needing more blog articles and longer-form content to maintain consistent tone without hallucination. I found that after the first round of training I was quite pleased with the models tone, however if I prompted for a blog or something over a paragraph long it would seemingly loose its' mind. This is just one example of the iteration I had to do to get a fine-tuned copy writing model I was happy with for my own use-case. Identifying these requirements took real imagination and testing and designing this data for a larger group of people takes even deeper design and systems thinking.
The key to fine-tuning understanding that dataset design isn't just technical work—it requires thinking through use cases and ensuring your data represents everything the model needs to learn effectively.