How To T̶r̶a̶i̶n̶ Synthesize Your D̶r̶a̶g̶o̶n̶ Data

The art and science of crafting synthetic data for AI training
Author

Nathan Cooper

Published

October 15, 2024

Introduction

Synthetic data has become an important topic in the realm of Large Language Models (LLMs). Meta’s recent training of Llama 3 models on LLM-generated data highlights this trend. This post shares my experience experimenting with generating synthetic data. I also introduce our new library, fastdata, to make it easier to generate synthetic data.

Let’s start with the obvious question: Why is everyone going gaga for synthetic data?

Imagine having a vast, perfect supply of training data for any task. That’s the promise of synthetic data and LLMs are bringing us closer to this reality. These models can generate diverse, high-quality datasets with great precision and control.

Synthetic data’s value lies in its controllability and diversity, contrasting sharply with noisy, unstructured web-scraped data. Modern LLMs show a strong ability to follow instructions when given proper prompts. This, combined with their vast knowledge from training on internet-scale text data, lets us synthesize data on any topic, no matter how niche, and shape it to fit our needs.

Yet, creating effective synthetic data often presents challenges. This post will explore the challenges and opportunities of using synthetic data. I’ll share practical insights on how to unlock its potential and change your approach to data-driven tasks.

To set the scene for our synthetic data adventure, let me walk you through an experiment I did using synthetic data.

The Experiment

Drawing inspiration from the paper TinyStories, which demonstrated that small models trained on synthetic children’s stories could outperform larger models, I developed a similar idea for coding models called TinyPrograms. TinyPrograms consists of around 1,000 tiny Python programs. These programs were generated using Anthropic’s Haiku 3 model.

First Attempt: Unexpected Setback

I took TinyPrograms and tried to finetune a strong LLM model to see if I could improve its coding ability. I used Huggingface’s awesome SmolLM-360M. It’s small and works well on coding tasks. Out of the box, SmolLM-360M scores 11.6% on a popular coding test called HumanEval. HumanEval is a popular coding test. It measures a model’s ability to generate solutions to coding problems. Surprisingly, my finetuned model’s performance dropped to 9.1% on HumanEval. But I didn’t let that get me down since I knew generating synthetic data is hard.

Second Attempt: Enhancing Diversity and Quality

Two issues that arise with synthetic data are quality and diversity. I used a trick from the paper, Scaling Synthetic Data Creation with 1,000,000,000 Personas for improving diversity of synthetic data. It showed using LLMs to role-play diverse personas when generating synthetic data improved models trained on the synthetic data. Here are some of the fun personas from their dataset:

["A Political Analyst specialized in El Salvador's political landscape.",
 'A legal advisor who understands the legal implications of incomplete or inaccurate project documentation',
 'A maternal health advocate focused on raising awareness about postpartum complications.',
 'A school basketball team captain who believes sports and their funding should be prioritized over student council campaigns',]

I instructed Haiku to write tiny programs that would be useful for each of these personas. Unfortunately, that did not solve the quality issue in synthetic data. To overcome this, I used an idea from a great paper, Textbooks Are All You Need. It asks an LLM to rate a piece of code as low or high quality. The authors then filter their data to include only high-quality code. They then train their model on that data and it improved performance! 🤯 So, I ran my ~1,000 tiny programs through an LLM-based quality filter to see if the same would apply to my synthetic data. This resulted in 754 tiny programs. I trained my model on these and… it worked!! My new model got a pass@1 of 12.2% on the same benchmark 🤓. I have a write-up of the full code and experiments in the examples folder of the repo that accompanies this post.

How You Can Do This Yourself

This post will explain, in detail, how to do this yourself. It will cover some key points to get right when synthesizing your own data. I also introduce fastdata. It is a small wrapper around our claudette library. It simplifies the process of synthesizing your own data.

Generating Synthetic Data

The Important Bits

Let’s talk about the key elements of synthetic data: quality and diversity. These two often conflict with each other, but both are crucial. A random string generator has high diversity but low quality. The Encyclopedia Britannica has high-quality content but its scope is limited, e.g., it doesn’t have all the amazing Star Trek fan fiction people have created.

Our challenge is to balance these factors. We must ensure high quality across many topics. We need depth in specialized fields and to cover niche content. It’s tricky, but essential for creating effective synthetic data for LLMs.

Nailing Quality and Diversity

Now, let’s get a bit technical. LLMs are essentially sophisticated word prediction machines. They analyze a sequence of words to predict what’s likely to come next. We can manipulate this prediction process to cover more of the language space.

A useful technique comes from the paper, Scaling Synthetic Data Creation with 1,000,000,000 Personas. By having the LLM adopt different personas, we can generate diverse, coherent content. This approach is better than raising the temperature that can cause more varied but nonsensical outputs.

The WizardLM paper showed us that diversity isn’t only about breadth, but also depth. They evolved their instructions to cover different topics and complexities. This taught the LLM to handle a wide range of tasks at varying difficulty levels.

Quality is multifaceted, encompassing writing style, accuracy, and coherence. Using expert personas can help, but it doesn’t completely stop issues like hallucinations. One effective strategy is post-generation filtering. It involves creating a large, diverse dataset. Then, selecting the highest-quality examples with care using some criteria.

The Textbooks Are All You Need paper introduced an interesting idea. It used a strong LLM as a quality rater for code. This method has shown promise in studies like How to Train Data-Efficient LLMs and The FineWeb Datasets

A key challenge in synthetic data generation is balancing quality and diversity. But, with these techniques, you can create high-quality, diverse datasets to train LLMs.

Let’s Play with some Data

Don’t just take my word for it. Let’s use claudette, a fun little library we at Answer.AI built to showcase these ideas, to explore them. claudette is a minimal wrapper for Anthropic’s Claude API. It aims to make it more usable. A key feature is tool calling with Claude. We’ll abuse use it to generate synthetic data!

Here’s how it works: We define a class with the desired attributes. Then, we pass this, along with a prompt, to Claude to fill in the blanks. This approach forces the generation to adhere to the schema defined in our Python class. Let’s see this in action below.

class Translation():
    "Translation from an English phrase to a Spanish phrase"
    def __init__(self, english: str, spanish: str): store_attr()
    def __repr__(self): return f"{self.english} ➡ *{self.spanish}*"

Translation("Hello, how are you today?", "Hola, ¿cómo estás hoy?")
Hello, how are you today? ➡ *Hola, ¿cómo estás hoy?*
Note

Make sure you have an API key for Anthropic, i.e., you need to set the ANTHROPIC_API_KEY environment variable.

from claudette import *
model = models[-1] # haiku 3
sp = "You will help generate synethetic data of English and Spanish phrases."
cli = Client(model)
def synthesize(pr): return cli.structured(pr, sp=sp, temp=1, tools=Translation)[0]

prompt = 'Create an English and Spanish translation pair.'
translations = [synthesize(prompt) for _ in range(5)]

We’ll create a little function to display our outputs clearly:

from IPython.display import Markdown
clps_fmt = '- {s}\n\n<details>\n<summary> Click to show the rest </summary>\n{ls}\n</details>'
def to_md(ss, collapsible=False):
    ls = '\n'.join(f'- {s}' for s in ss) 
    return clps_fmt.format(s=str(ss[0]), ls=ls.replace(f'- {ss[0]}', '')) if collapsible else ls
def show(ss, collapsible=False): return Markdown(to_md(ss, collapsible=collapsible))

OK let’s see how we got along…

show(translations)
  • Hello, how are you today? ➡ Hola, ¿cómo estás hoy?
  • Hello, how are you today? ➡ Hola, ¿cómo estás hoy?
  • Good morning! ➡ ¡Buenos días!
  • How are you today? ➡ ¿Cómo estás hoy?
  • How are you today? ➡ ¿Cómo estás hoy?

Or when the output is too long…

show(translations, collapsible=True)
  • Hello, how are you today? ➡ Hola, ¿cómo estás hoy?
Click to show the rest
  • Good morning! ➡ ¡Buenos días!
  • How are you today? ➡ ¿Cómo estás hoy?
  • How are you today? ➡ ¿Cómo estás hoy?

Fantástico! The output demonstrates the model’s ability to generate data in the correct format. However, these translation pairs are quite simple and lack depth. To improve the output, let’s use prompt engineering. We’ll provide examples that show the quality we want.

examples = [
    Translation(
        english="Hello, my name is Nathan. I am a research scientist at an AI startup.",
        spanish="Hola, me llamo Nathan. Soy ciencia investigador en un startup de IA."),
    Translation(
        english="How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
        spanish="¿Cuánta madera podría arrojar una marmota si una marmota pudiera arrojar madera?"),
    Translation(
        english="Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See.",
        spanish="Thomas Cranmer (2 de julio de 1489 - 21 de marzo de 1556) fue un líder de la Reforma inglesa y arzobispo de Canterbury durante los reinados de Henry VIII, Edward VI y, por un corto tiempo, María I. Ayudó a construir el caso para la anulación de El matrimonio de Henry con Catalina de Aragón, que fue una de las causas de la separación de la Iglesia inglesa de la unión con la Santa Sede."
    ),
]

We’ll use a markdown list to provide them to the model:

examples_md = to_md(examples)
Markdown(examples_md)
  • Hello, my name is Nathan. I am a research scientist at an AI startup. ➡ Hola, me llamo Nathan. Soy ciencia investigador en un startup de IA.
  • How much wood could a woodchuck chuck if a woodchuck could chuck wood? ➡ ¿Cuánta madera podría arrojar una marmota si una marmota pudiera arrojar madera?
  • Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry’s marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. ➡ Thomas Cranmer (2 de julio de 1489 - 21 de marzo de 1556) fue un líder de la Reforma inglesa y arzobispo de Canterbury durante los reinados de Henry VIII, Edward VI y, por un corto tiempo, María I. Ayudó a construir el caso para la anulación de El matrimonio de Henry con Catalina de Aragón, que fue una de las causas de la separación de la Iglesia inglesa de la unión con la Santa Sede.

We’ll also need a prompt that incorporates those examples:

prompt_template = """\
Create an English and Spanish translation pair that is similar to the examples.

<examples>
{examples}
</examples>"""
prompt = prompt_template.format(examples=examples_md)
print(prompt)
Create an English and Spanish translation pair that is similar to the examples.

<examples>
- Hello, my name is Nathan. I am a research scientist at an AI startup. ➡ *Hola, me llamo Nathan. Soy ciencia investigador en un startup de IA.*
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? ➡ *¿Cuánta madera podría arrojar una marmota si una marmota pudiera arrojar madera?*
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. ➡ *Thomas Cranmer (2 de julio de 1489 - 21 de marzo de 1556) fue un líder de la Reforma inglesa y arzobispo de Canterbury durante los reinados de Henry VIII, Edward VI y, por un corto tiempo, María I. Ayudó a construir el caso para la anulación de El matrimonio de Henry con Catalina de Aragón, que fue una de las causas de la separación de la Iglesia inglesa de la unión con la Santa Sede.*
</examples>
show([synthesize(prompt) for _ in range(5)])
  • My favorite time of the year is autumn when the leaves change colors. ➡ Mi época favorita del año es el otoño cuando las hojas cambian de color.
  • Once upon a time, there was a kind-hearted farmer who lived in a small village. Every day, he would wake up early to tend to his fields and livestock. ➡ Había una vez un granjero de buen corazón que vivía en un pequeño pueblo. Cada día se levantaba temprano para atender sus campos y su ganado.
  • I love to go hiking in the mountains. The views are breathtaking. ➡ Me encanta ir de excursión en las montañas. Las vistas son impresionantes.
  • My favorite color is red. I enjoy swimming in the ocean. ➡ Mi color favorito es rojo. Me gusta nadar en el océano.
  • My favorite animal is the giraffe. They have such long necks and are so graceful. ➡ Mi animal favorito es la jirafa. Tienen cuellos tan largos y son tan elegantes.

Interesting! We’re seeing some improvement in the results, but there’s still a lot of near duplicates of the examples we gave it. This repetition shows an important aspect of prompting these models. The placement of examples in the prompt can greatly affect the quality and diversity of the outputs. Let’s try placing the examples at the beginning of our prompt instead.

prompt_template = """\
<examples>
{examples}
</examples>

Create an English and Spanish translation pair that is similar to the examples."""
prompt = prompt_template.format(examples=examples_md)
print(prompt)
<examples>
- Hello, my name is Nathan. I am a research scientist at an AI startup. ➡ *Hola, me llamo Nathan. Soy ciencia investigador en un startup de IA.*
- How much wood could a woodchuck chuck if a woodchuck could chuck wood? ➡ *¿Cuánta madera podría arrojar una marmota si una marmota pudiera arrojar madera?*
- Thomas Cranmer (2 July 1489 - 21 March 1556) was a leader of the English Reformation and Archbishop of Canterbury during the reigns of Henry VIII, Edward VI and, for a short time, Mary I. He helped build the case for the annulment of Henry's marriage to Catherine of Aragon, which was one of the causes of the separation of the English Church from union with the Holy See. ➡ *Thomas Cranmer (2 de julio de 1489 - 21 de marzo de 1556) fue un líder de la Reforma inglesa y arzobispo de Canterbury durante los reinados de Henry VIII, Edward VI y, por un corto tiempo, María I. Ayudó a construir el caso para la anulación de El matrimonio de Henry con Catalina de Aragón, que fue una de las causas de la separación de la Iglesia inglesa de la unión con la Santa Sede.*
</examples>

Create an English and Spanish translation pair that is similar to the examples.
show([synthesize(prompt) for _ in range(5)])
  • The dog ran quickly across the field to catch the ball. ➡ El perro corrió rápidamente a través del campo para atrapar la pelota.
  • The sun is shining brightly in the clear blue sky today. ➡ El sol brilla intensamente en el cielo azul despejado hoy.
  • My friend Sam is an avid hiker who enjoys exploring the beautiful nature trails in our local park. Last weekend, he went on a long hike and came back with some interesting stories. ➡ Mi amigo Sam es un excursionista entusiasta que disfruta explorando los hermosos senderos naturales de nuestro parque local. El fin de semana pasado, fue a hacer una larga caminata y volvió con algunas historias interesantes.
  • I am an engineer at a tech company. I design software for our customers. ➡ Soy un ingeniero en una empresa de tecnología. Diseño software para nuestros clientes.
  • The quick brown fox jumps over the lazy dog. ➡ El zorro marrón rápido salta sobre el perro perezoso.

Woah, what a difference! The results show more variety in content and structure than our last attempt. However, we’re still seeing some lack of diversity in the topics covered. Let’s shift our focus from quality to diversity and see what happens. We’ll use a list of topics to guide the generations. This should help broaden the range of subjects in our translations.

topics = ["otters", "penguins", "sloths", "cats", "dogs"]
prompt_template = """\
Create an English and Spanish translation pair about the following topic:
<topic>{topic}</topic>"""

For instance…

print(prompt_template.format(topic=topics[0]))
Create an English and Spanish translation pair about the following topic:
<topic>otters</topic>

…ok let’s give it a go:

show([synthesize(prompt_template.format(topic=topic)) for topic in topics])
  • Otters are fascinating aquatic mammals that are found in various parts of the world. They are known for their playful and social behavior, as well as their remarkable adaptations for swimming and hunting in the water. ➡ Las nutrias son fascinantes mamíferos acuáticos que se encuentran en varias partes del mundo. Son conocidas por su comportamiento juguetón y social, así como por sus notables adaptaciones para nadar y cazar en el agua.
  • Penguins are flightless seabirds that live in the Southern Hemisphere. ➡ Los pingüinos son aves marinas que no pueden volar y viven en el hemisferio sur.
  • The sloth is a slow-moving mammal that lives in the trees of Central and South America. ➡ El perezoso es un mamífero de movimiento lento que vive en los árboles de América Central y del Sur.
  • The curious cat watched the birds from the window. ➡ El curioso gato observó a los pájaros desde la ventana.
  • The loyal dog wagged its tail excitedly as its owner returned home. ➡ El perro leal meneó la cola con emoción cuando su dueño regresó a casa.

Okay, nice! We’re seeing increased diversity based on our list of topics. However, you can see a drop in quality for the cat and dog example. To improve both diversity and quality, let’s combine our examples and topics techniques.

prompt_template = """\
<examples>
{examples}
</examples>

Create an English and Spanish translation pair that is similar to the examples and is about the following topic:
<topic>{topic}</topic>"""
translations = [synthesize(prompt_template.format(examples=examples_md, topic=topic))
                for topic in topics]
show(translations, collapsible=True)
  • Otters are semiaquatic mammals that belong to the weasel family. They are found in many parts of the world, including North America, Europe, and Asia. Otters are known for their playful behavior, their sleek fur coats, and their ability to swim gracefully in the water. ➡ Los nutrias son mamíferos semiacuáticos que pertenecen a la familia de los comadrejas. Se encuentran en muchas partes del mundo, incluyendo América del Norte, Europa y Asia. Las nutrias son conocidas por su comportamiento juguetón, sus sedosos abrigos de piel y su capacidad para nadar con gracia en el agua.
Click to show the rest
  • Penguins are flightless birds that live in the Southern Hemisphere. They have black and white feathers and webbed feet that help them swim in the ocean. Penguins gather in large colonies and work together to raise their young. ➡ Los pingüinos son aves no voladoras que viven en el hemisferio sur. Tienen plumas negras y blancas y patas palmeadas que les ayudan a nadar en el océano. Los pingüinos se reúnen en grandes colonias y trabajan juntos para criar a sus crías.
  • Sloths are slow-moving mammals that live in the tropical rainforests of Central and South America. They are known for their distinctive appearance, with long limbs, curved claws, and a shaggy coat that helps them blend in with the trees they inhabit. Sloths spend most of their time hanging upside down from branches, moving very slowly and deliberately to conserve energy. They are herbivores, feeding primarily on leaves, and have a specialized digestive system that allows them to extract nutrients from this low-calorie diet. Despite their seemingly lazy behavior, sloths play an important role in the ecosystem, acting as seed dispersers and contributing to the overall biodiversity of the rainforest. ➡ Los perezosos son mamíferos de movimiento lento que viven en las selvas tropicales de América Central y del Sur. Se conocen por su apariencia distintiva, con extremidades largas, garras curvas y un pelaje espeso que les ayuda a fundirse con los árboles que habitan. Los perezosos pasan la mayor parte de su tiempo colgados boca abajo de las ramas, moviéndose muy lenta y deliberadamente para conservar energía. Son herbívoros, alimentándose principalmente de hojas, y tienen un sistema digestivo especializado que les permite extraer nutrientes de esta dieta baja en calorías. A pesar de su comportamiento aparentemente perezoso, los perezosos desempeñan un papel importante en el ecosistema, actuando como dispersores de semillas y contribuyendo a la biodiversidad general de la selva tropical.
  • Cats are furry, playful companions that many people love to have as pets. They are known for their independent nature, agility, and ability to hunt small prey. Caring for a cat requires providing food, water, litter, and plenty of toys and playtime to keep them healthy and happy. ➡ Los gatos son compañeros peludos y juguetones que a muchas personas les gusta tener como mascotas. Se les conoce por su naturaleza independiente, agilidad y capacidad para cazar presas pequeñas. Cuidar a un gato requiere proporcionarle alimento, agua, arena para gatos y muchos juguetes y tiempo de juego para mantenerlos saludables y felices.
  • Dogs are beloved companions that bring joy and unconditional love to many households. They come in a variety of breeds, each with their own unique personalities and characteristics. Whether playing fetch, going for walks, or cuddling on the couch, dogs enrich our lives in countless ways. ➡ Los perros son compañeros adorados que traen alegría y amor incondicional a muchos hogares. Vienen en una variedad de razas, cada una con sus propias personalidades y características únicas. Ya sea jugando a buscar, saliendo a caminar o acurrucándose en el sofá, los perros enriquecen nuestras vidas de innumerables maneras.

Not too shabby, and I’d say way better than the previous examples. We’re seeing more detailed and varied content across our animal kingdom tour. But check out these interesting quirks:

  • We’re giving all our furry friends the encyclopedia treatment now. No more casual “oh yeah, that’s a thing” mentions.

  • The more exotic ones (otters, penguins, sloths) have detailed biology and behavior. The domestic ones have general descriptions and discuss their relationships with humans.

  • Our earlier attempts were less consistent in the info we got for each animal. Now, we see more consistency.

  • The Spanish translations are near-perfect. They match the English versions in accuracy and flow. Que genial!

It looks like our model’s doing a solid job of mashing up the examples and topic prompts. It’s walking that fine line between dropping knowledge bombs and staying on topic, whether we’re talking about a sloth hanging out in a tree or a cat being… well, a cat. Let’s see what we get by using the more powerful model.

sp = "You will help generate synethetic data of English and Spanish phrases."
cli = Client(models[1])  # sonnet 3.5
translations = [synthesize(prompt_template.format(examples=examples_md, topic=topic))
                for topic in topics]
show(translations, collapsible=True)
  • Otters are playful and intelligent semi-aquatic mammals known for their thick fur and ability to use tools. They can often be seen floating on their backs, using their bellies as a table to crack open shellfish with rocks. ➡ Las nutrias son mamíferos semiacuáticos juguetones e inteligentes conocidos por su pelaje grueso y su capacidad para usar herramientas. A menudo se les puede ver flotando boca arriba, usando sus vientres como mesa para romper mariscos con rocas.
Click to show the rest
  • Penguins are flightless seabirds that are highly adapted for life in the water. They are found almost exclusively in the Southern Hemisphere, particularly in Antarctica. Despite their inability to fly, penguins are excellent swimmers and can dive to great depths to catch fish and squid. ➡ Los pingüinos son aves marinas no voladoras que están altamente adaptadas para la vida en el agua. Se encuentran casi exclusivamente en el hemisferio sur, particularmente en la Antártida. A pesar de su incapacidad para volar, los pingüinos son excelentes nadadores y pueden sumergirse a grandes profundidades para atrapar peces y calamares.
  • Sloths are fascinating creatures known for their slow movement and unique lifestyle. These tree-dwelling mammals can be found in the tropical rainforests of Central and South America. Despite their sluggish appearance, sloths are excellent swimmers and can hold their breath underwater for up to 40 minutes. ➡ Los perezosos son criaturas fascinantes conocidas por su movimiento lento y estilo de vida único. Estos mamíferos arborícolas se pueden encontrar en las selvas tropicales de América Central y del Sur. A pesar de su apariencia lenta, los perezosos son excelentes nadadores y pueden contener la respiración bajo el agua hasta por 40 minutos.
  • Cats are fascinating creatures that have been domesticated for thousands of years. They are known for their independence, playful nature, and ability to form strong bonds with their human companions. From tiny kittens to majestic big cats, felines come in a variety of sizes, colors, and personalities. ➡ Los gatos son criaturas fascinantes que han sido domesticadas durante miles de años. Son conocidos por su independencia, naturaleza juguetona y capacidad para formar fuertes vínculos con sus compañeros humanos. Desde pequeños gatitos hasta majestuosos felinos grandes, los felinos vienen en una variedad de tamaños, colores y personalidades.
  • Dogs are known as man’s best friend. They come in various breeds, sizes, and colors, each with their own unique characteristics. From loyal companions to working dogs, these furry animals have been by our side for thousands of years. They are known for their unconditional love, playful nature, and ability to provide emotional support to their human families. ➡ Los perros son conocidos como el mejor amigo del hombre. Vienen en varias razas, tamaños y colores, cada uno con sus propias características únicas. Desde compañeros leales hasta perros de trabajo, estos animales peludos han estado a nuestro lado durante miles de años. Son conocidos por su amor incondicional, naturaleza juguetona y capacidad para proporcionar apoyo emocional a sus familias humanas.

The switch to the more powerful Claude Sonnet 3.5 model has improved things. We’re seeing more consistent, comprehensive descriptions for all animals. It now covers more of each animal’s life, with better accuracy. It includes their habitat, behavior, and unique traits. For cats and dogs, the descriptions balance animal facts with their relationship to humans. The Spanish translations have also improved. They now handle complex sentences and specialized vocabulary better. This experiment shows that a better model can produce better outputs on all topics. They will be more consistent, detailed, and accurate.

However, we can still encounter issues with hallucinations and overall quality.

To address these concerns, let’s use another prompt. It will evaluate and filter the generations. We’ll use the 5-point scoring system in The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. It proved most effective at evaluating the quality of data.

Our approach will involve:

  1. Having the model generate a written critique of the translation.
  2. Using this critique to score the generation.

This method should help us better evaluate our synthetic data’s quality. Fortunately, language models’ autoregressive responses make this process straightforward. We can design our Python class to first generate the critique. It will then inform the scoring of the generations.

Let’s use this system to evaluate our synthetic data. First, we’ll need to create a new class to store our structured outputs. We’re using Claudette to talk to Claude for us, which uses docments to allow us to directly provide information about our class by simply adding comments to it:

class TranslationCritique(BasicRepr):
    "A critique of the translation."
    def __init__(
        self,
        critique: str, # A brief 1-line critique of the translation.
        score: int # A score of the translation from 1 to 5. 
    ): store_attr()

Now we can create our Sonnet client – this time telling it to use our critique tool:

cli = Client(models[1]) # sonnet 3.5
sp = "You will help critique synethetic data of English and Spanish phrases."
def synthesize(pr): return cli.structured(pr, temp=1, tools=TranslationCritique)[0]

We need a new prompt template to explain in detail to Claude how we want it to evaluate translations:

eval_prompt_template = """\
Below is an extract of a translation. Evaluate its quality as a senior translator would, considering its suitability for professional use. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the translation conveys the basic meaning of the source text, even if it includes some minor errors or awkward phrasing.
- Add another point if the translation is generally accurate but lacks refinement in style or fails to capture some nuances of the original. It might use inconsistent terminology or have occasional lapses in register.
- Award a third point if the translation is appropriate for professional use and accurately conveys key concepts of the source text. It demonstrates good understanding of both languages, though it may not be flawless or could include some slight inconsistencies. It resembles the work of a competent translator but may have room for improvement in fluency or precision.
- Grant a fourth point if the translation is highly accurate and reads naturally in the target language, exhibiting a consistent and appropriate style. It could be similar to the work of an experienced translator, offering faithful rendering of content and tone, with minimal errors, and effectively handling complex concepts or cultural references. The result is coherent, well-expressed, and valuable for its intended purpose.
- Bestow a fifth point if the translation is outstanding, demonstrating mastery of both source and target languages. It captures subtle nuances, maintains the author's voice and intent, and reads as if it were originally written in the target language. The translator has made excellent choices in dealing with challenging elements like wordplay, idiomatic expressions, or culture-specific content.

<translation>
{translation}
</translation>

After examining the translation:

- Briefly justify your total score in a single line.
- Conclude with the score of the translation."""

Let’s also write a little function to display the results:

def show_critique(t, critique):
    return f"""{t}
\t- **Critique**: {critique.critique}
\t- **Score**: {critique.score}"""

def get_critique(t):
    critique = synthesize(eval_prompt_template.format(translation=t))
    return show_critique(t, critique)

So, what does Claude think?…

show([get_critique(t) for t in translations], collapsible=True)
  • Otters are playful and intelligent semi-aquatic mammals known for their thick fur and ability to use tools. They can often be seen floating on their backs, using their bellies as a table to crack open shellfish with rocks. ➡ Las nutrias son mamíferos semiacuáticos juguetones e inteligentes conocidos por su pelaje grueso y su capacidad para usar herramientas. A menudo se les puede ver flotando boca arriba, usando sus vientres como mesa para romper mariscos con rocas.
    • Critique: Excellent translation capturing nuances, maintaining tone, and reading naturally in Spanish.
    • Score: 5
Click to show the rest
  • Penguins are flightless seabirds that are highly adapted for life in the water. They are found almost exclusively in the Southern Hemisphere, particularly in Antarctica. Despite their inability to fly, penguins are excellent swimmers and can dive to great depths to catch fish and squid. ➡ Los pingüinos son aves marinas no voladoras que están altamente adaptadas para la vida en el agua. Se encuentran casi exclusivamente en el hemisferio sur, particularmente en la Antártida. A pesar de su incapacidad para volar, los pingüinos son excelentes nadadores y pueden sumergirse a grandes profundidades para atrapar peces y calamares.
    • Critique: The translation is outstanding, capturing nuances and reading naturally in Spanish while maintaining accuracy.
    • Score: 5
  • Sloths are fascinating creatures known for their slow movement and unique lifestyle. These tree-dwelling mammals can be found in the tropical rainforests of Central and South America. Despite their sluggish appearance, sloths are excellent swimmers and can hold their breath underwater for up to 40 minutes. ➡ Los perezosos son criaturas fascinantes conocidas por su movimiento lento y estilo de vida único. Estos mamíferos arborícolas se pueden encontrar en las selvas tropicales de América Central y del Sur. A pesar de su apariencia lenta, los perezosos son excelentes nadadores y pueden contener la respiración bajo el agua hasta por 40 minutos.
    • Critique: Accurate, natural-sounding translation with appropriate style; minor improvement possible in “contener la respiración”
    • Score: 4
  • Cats are fascinating creatures that have been domesticated for thousands of years. They are known for their independence, playful nature, and ability to form strong bonds with their human companions. From tiny kittens to majestic big cats, felines come in a variety of sizes, colors, and personalities. ➡ Los gatos son criaturas fascinantes que han sido domesticadas durante miles de años. Son conocidos por su independencia, naturaleza juguetona y capacidad para formar fuertes vínculos con sus compañeros humanos. Desde pequeños gatitos hasta majestuosos felinos grandes, los felinos vienen en una variedad de tamaños, colores y personalidades.
    • Critique: Excellent, natural-sounding translation capturing nuances and maintaining style
    • Score: 5
  • Dogs are known as man’s best friend. They come in various breeds, sizes, and colors, each with their own unique characteristics. From loyal companions to working dogs, these furry animals have been by our side for thousands of years. They are known for their unconditional love, playful nature, and ability to provide emotional support to their human families. ➡ Los perros son conocidos como el mejor amigo del hombre. Vienen en varias razas, tamaños y colores, cada uno con sus propias características únicas. Desde compañeros leales hasta perros de trabajo, estos animales peludos han estado a nuestro lado durante miles de años. Son conocidos por su amor incondicional, naturaleza juguetona y capacidad para proporcionar apoyo emocional a sus familias humanas.
    • Critique: Highly accurate, natural-sounding translation with consistent style, capturing nuances and maintaining tone.
    • Score: 4

As we can see, the model successfully generates a critique and score for each translation. The model also seems to be quite a fan of its own translations!

Let’s see what happens when we give it a translation of very low quality.

bad_translation = Translation(
    english="The city council meeting on climate change initiatives was contentious, with passionate arguments from both sides. Ultimately, the proposal for increased funding for renewable energy projects was approved by a narrow margin.",
    spanish="El council ciudad reunion en cambio climatico era controversial, con argumentos pasionantes de ambos lados. Finalmente, la propuesta para aumentar el dinero para proyectos de energía renovables fue aprobada por un margen limitado.")

show([get_critique(bad_translation)])
  • The city council meeting on climate change initiatives was contentious, with passionate arguments from both sides. Ultimately, the proposal for increased funding for renewable energy projects was approved by a narrow margin. ➡ El council ciudad reunion en cambio climatico era controversial, con argumentos pasionantes de ambos lados. Finalmente, la propuesta para aumentar el dinero para proyectos de energía renovables fue aprobada por un margen limitado.
    • Critique: Basic meaning conveyed but with numerous errors in grammar, vocabulary, and style; not suitable for professional use.
    • Score: 2

Que chévere! We’ve developed a solid method for generating high-quality data at scale. Here’s the process:

  1. Generate diverse data by combining:
    • Examples showing the desired data type and quality.
    • Topics or personas to add variety.
  2. Use another LLM for quality control, scoring each data point.
  3. Keep only the highest-scoring data.

This process lets us create high-quality data for AI training. It’s a more systematic way to feed your data-hungry models.

How to create high quality synthetic data

Remember these key points:

  1. Quality and diversity are critical in synthetic data. They can have a significant impact on the performance of models trained on this data. Balancing both is essential for creating effective synthetic datasets.
  2. Quality is harder to achieve than diversity. Quality is multidimensional. This is especially true for free-form content. It makes it tough to meet high standards in all aspects of the generated data.
  3. Synthetic data is a valuable tool for data-scarce scenarios. It is a cost-effective, quick solution when you lack enough data for your task. When generated correctly, it can significantly enhance performance on your specific task.

Synthetic data is not a one-size-fits-all solution. But, it is a powerful tool in your AI development toolkit. Its effectiveness depends on careful implementation and consideration of your specific use case.

For those interested, you can find all the code used in this post in our minimal synthetic data repo: fastdata.

FastData Introduction

from fastdata.core import *

Rather than doing all this manually, let’s use fastdata, a new library we just released for generating high-quality synthetic data. The nice thing about fastdata is that it abstracts away all the boilerplate you need to get started. It also has built-in rate limiting and concurrency. This lets you scale without worry.

Here is how you can reproduce what we’ve done in this blog post with fastdata, first the translations:

class Translation(): ... # same as above
examples = [...] # same as above
topics = [...] # same as above
prompt_template = "..." # same as above
fast_data = FastData(model=models[1])
translations = fast_data.generate(
    prompt_template=prompt_template,
    inputs=[{"examples": examples_md, "topic": topic} for topic in topics],
    schema=Translation,
    sp=sp)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.36s/it]
show(translations, collapsible=True)
  • Did you know that sloths are incredibly slow-moving animals? They spend most of their lives hanging upside down in the trees of tropical rainforests. These fascinating creatures can sleep for up to 20 hours a day and move so slowly that algae can grow on their fur! ➡ ¿Sabías que los perezosos son animales que se mueven increíblemente despacio? Pasan la mayor parte de sus vidas colgados boca abajo en los árboles de las selvas tropicales. ¡Estas fascinantes criaturas pueden dormir hasta 20 horas al día y se mueven tan lentamente que las algas pueden crecer en su pelaje!
Click to show the rest
  • Otters are playful semi-aquatic mammals known for their thick fur, webbed feet, and ability to use tools. They can be found in both freshwater and marine environments, and are often seen floating on their backs while using rocks to crack open shellfish. ➡ Las nutrias son mamíferos semiacuáticos juguetones conocidos por su pelaje grueso, patas palmeadas y capacidad para usar herramientas. Se pueden encontrar tanto en ambientes de agua dulce como marinos, y a menudo se les ve flotando boca arriba mientras usan rocas para romper mariscos.
  • Cats are fascinating creatures that have been domesticated for thousands of years. They are known for their independence, playful nature, and ability to form strong bonds with humans. Many people around the world keep cats as pets, enjoying their companionship and the soothing sound of their purring. ➡ Los gatos son criaturas fascinantes que han sido domesticadas durante miles de años. Son conocidos por su independencia, naturaleza juguetona y capacidad para formar fuertes vínculos con los humanos. Muchas personas en todo el mundo tienen gatos como mascotas, disfrutando de su compañía y el sonido relajante de su ronroneo.
  • Penguins are flightless seabirds that are highly adapted for life in the water. While they are found mostly in the Southern Hemisphere, particularly Antarctica, some species can be found as far north as the Galápagos Islands. These charismatic birds are known for their distinctive black and white plumage, waddling gait on land, and exceptional swimming abilities. ➡ Los pingüinos son aves marinas no voladoras que están altamente adaptadas para la vida en el agua. Aunque se encuentran principalmente en el hemisferio sur, particularmente en la Antártida, algunas especies se pueden encontrar tan al norte como las Islas Galápagos. Estas carismáticas aves son conocidas por su distintivo plumaje blanco y negro, su andar tambaleante en tierra y sus excepcionales habilidades para nadar.
  • Dogs are known as man’s best friend. They come in various breeds, sizes, and colors, and are loved for their loyalty, companionship, and playful nature. Many people enjoy taking their dogs for walks, playing fetch in the park, or simply cuddling with them at home. Dogs have been domesticated for thousands of years and serve various roles, from family pets to working animals in fields such as law enforcement, search and rescue, and therapy. ➡ Los perros son conocidos como el mejor amigo del hombre. Vienen en varias razas, tamaños y colores, y son amados por su lealtad, compañía y naturaleza juguetona. Muchas personas disfrutan llevar a sus perros a pasear, jugar a buscar en el parque o simplemente acurrucarse con ellos en casa. Los perros han sido domesticados durante miles de años y desempeñan diversos roles, desde mascotas familiares hasta animales de trabajo en campos como la aplicación de la ley, búsqueda y rescate, y terapia.

…and here is how you can reproduce the critiques:

class TranslationCritique(): ... # same as above
eval_prompt_template = "..." # same as above
fast_data = FastData(model="claude-3-5-sonnet-20240620")
critiques = fast_data.generate(
    prompt_template=eval_prompt_template,
    inputs=[{"translation": f"{translation}"} for translation in translations],
    schema=TranslationCritique,
    sp=sp)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.82it/s]
show([show_critique(t, c) for t,c in zip(translations, critiques)], collapsible=True)
  • Did you know that sloths are incredibly slow-moving animals? They spend most of their lives hanging upside down in the trees of tropical rainforests. These fascinating creatures can sleep for up to 20 hours a day and move so slowly that algae can grow on their fur! ➡ ¿Sabías que los perezosos son animales que se mueven increíblemente despacio? Pasan la mayor parte de sus vidas colgados boca abajo en los árboles de las selvas tropicales. ¡Estas fascinantes criaturas pueden dormir hasta 20 horas al día y se mueven tan lentamente que las algas pueden crecer en su pelaje!
    • Critique: Excellent, accurate translation capturing nuances and maintaining natural flow in Spanish.
    • Score: 5
Click to show the rest
  • Otters are playful semi-aquatic mammals known for their thick fur, webbed feet, and ability to use tools. They can be found in both freshwater and marine environments, and are often seen floating on their backs while using rocks to crack open shellfish. ➡ Las nutrias son mamíferos semiacuáticos juguetones conocidos por su pelaje grueso, patas palmeadas y capacidad para usar herramientas. Se pueden encontrar tanto en ambientes de agua dulce como marinos, y a menudo se les ve flotando boca arriba mientras usan rocas para romper mariscos.
    • Critique: Excellent translation capturing nuances and reading naturally; conveys tone and content faithfully.
    • Score: 5
  • Cats are fascinating creatures that have been domesticated for thousands of years. They are known for their independence, playful nature, and ability to form strong bonds with humans. Many people around the world keep cats as pets, enjoying their companionship and the soothing sound of their purring. ➡ Los gatos son criaturas fascinantes que han sido domesticadas durante miles de años. Son conocidos por su independencia, naturaleza juguetona y capacidad para formar fuertes vínculos con los humanos. Muchas personas en todo el mundo tienen gatos como mascotas, disfrutando de su compañía y el sonido relajante de su ronroneo.
    • Critique: Excellent translation; accurately conveys content, maintains tone, and reads naturally in Spanish.
    • Score: 5
  • Penguins are flightless seabirds that are highly adapted for life in the water. While they are found mostly in the Southern Hemisphere, particularly Antarctica, some species can be found as far north as the Galápagos Islands. These charismatic birds are known for their distinctive black and white plumage, waddling gait on land, and exceptional swimming abilities. ➡ Los pingüinos son aves marinas no voladoras que están altamente adaptadas para la vida en el agua. Aunque se encuentran principalmente en el hemisferio sur, particularmente en la Antártida, algunas especies se pueden encontrar tan al norte como las Islas Galápagos. Estas carismáticas aves son conocidas por su distintivo plumaje blanco y negro, su andar tambaleante en tierra y sus excepcionales habilidades para nadar.
    • Critique: Excellent translation: accurate, natural, and captures nuances; professionally suitable with consistent style and terminology.
    • Score: 5
  • Dogs are known as man’s best friend. They come in various breeds, sizes, and colors, and are loved for their loyalty, companionship, and playful nature. Many people enjoy taking their dogs for walks, playing fetch in the park, or simply cuddling with them at home. Dogs have been domesticated for thousands of years and serve various roles, from family pets to working animals in fields such as law enforcement, search and rescue, and therapy. ➡ Los perros son conocidos como el mejor amigo del hombre. Vienen en varias razas, tamaños y colores, y son amados por su lealtad, compañía y naturaleza juguetona. Muchas personas disfrutan llevar a sus perros a pasear, jugar a buscar en el parque o simplemente acurrucarse con ellos en casa. Los perros han sido domesticados durante miles de años y desempeñan diversos roles, desde mascotas familiares hasta animales de trabajo en campos como la aplicación de la ley, búsqueda y rescate, y terapia.
    • Critique: Excellent translation capturing nuances, maintaining tone, and reading naturally in Spanish.
    • Score: 5

We will continue to improve fastdata in the future. Things on our roadmap are handling retries and other modality inputs and outputs. We are excited to see how you use fastdata in your next project 🤓!

Further Exploration and Resources

For those interested in diving deeper, here are some recommended readings: