This is Part 3 of the Generative AI in Data Science blog series. This post takes things up a next level automating synthetic data generation across domains, injecting real-world messiness, and generating labeled text data for NLP tasks.
Part 3: Automating GPT-Based Synthetic Data Generation for Real-World Modeling
In Part 1and Part 2, we explored how generative AI tools like GPT-4 can create high-quality synthetic tabular data, and how that data can power better model training and testing. But so far, we have treated this process as mostly manual.
In the real world, though, you do not want to write a prompt every time you need a dataset.
You want automation.
You want domain flexibility.
You want imperfect, noisy data because real-world data is never clean.
This post is about building automated pipelines for synthetic data generation using GPT, complete with label noise, corrupted formats, and synthetic text-label pairs for NLP.
Let’s get into it.
Step 1: Automate Prompt Templates by Domain
To generalize across domains, we shall write prompt templates that adapt based on parameters.
Let’s define a reusable prompt builder:
def build_prompt(domain, num_rows=100):
if domain == "healthcare":
return f"""
Generate {num_rows} rows of synthetic patient data in JSON.
Each row should include:
- age (20-90)
- gender (male/female)
- systolic_bp (90-180)
- cholesterol_level (normal/high/critical)
- has_diabetes (0 or 1)
Rules:
- If age > 60 and cholesterol_level == 'critical', has_diabetes is likely 1.
- If systolic_bp < 110 and cholesterol_level == 'normal', has_diabetes is likely 0.
"""
elif domain == "ecommerce":
return f"""
Generate {num_rows} rows of synthetic ecommerce user sessions in JSON.
Each row should include:
- user_id
- session_duration (seconds)
- device_type (mobile/desktop/tablet)
- pages_viewed
- made_purchase (0 or 1)
Rules:
- If session_duration > 600 and pages_viewed > 5, made_purchase is likely 1.
- Mobile users are less likely to purchase than desktop users.
"""
else:
raise ValueError("Unsupported domain")
You can now generate data programmatically:
prompt = build_prompt(domain="healthcare", num_rows=200)
Feed this into GPT-4 and collect structured JSON rows, just like in Part 2.
Step 2: Add Realistic Noise and Imperfections
Models trained on clean synthetic data often fail in production because real data is messy. So let’s inject some chaos.
Here is a function to inject:
Missing values
Random typos
Invalid formats
import random
def corrupt_dataframe(df, missing_prob=0.05, typo_prob=0.03):
for col in df.columns:
# Inject missing values
df.loc[df.sample(frac=missing_prob).index, col] = None
# Inject typos in string fields
if df[col].dtype == "object":
for i in df.sample(frac=typo_prob).index:
val = df.at[i, col]
if isinstance(val, str) and len(val) > 2:
pos = random.randint(0, len(val) - 2)
df.at[i, col] = val[:pos] + val[pos+1] + val[pos] + val[pos+2:]
return df
You can now simulate things like:
OCR noise in scanned records
Entry errors in healthcare forms
Mistyped categories (“mobil” instead of “mobile”)
This improves robustness for downstream models.
Step 3: Generate Synthetic Text-Labeled Pairs for NLP Tasks
Let’s say you’re building a classifier for customer support tickets with labels like:
Billing
Technical Issue
Shipping
Cancel Request
You can use GPT-4 to generate synthetic training examples like this:
def generate_text_classification_data(num_samples=100):
system_prompt = "You are a dataset generator for customer support classification."
user_prompt = f"""
Generate {num_samples} labeled examples for customer support ticket classification.
Each item should be a JSON object with:
- 'text': realistic user message
- 'label': one of [Billing, Technical Issue, Shipping, Cancel Request]
Include natural language variation, informal tone, and occasional typos.
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.8
)
import json
return json.loads(response["choices"][0]["message"]["content"])
You shall get outputs like:
{
"text": "Hey, my internet's been dropping since last night. What's going on?",
"label": "Technical Issue"
},
{
"text": "Need to cancel my order ASAP – wrong size!",
"label": "Cancel Request"
}
Add typos, slang, or out-of-scope queries to simulate real-world diversity.
Step 4: Train + Stress-Test Models Using This Data
Train a text classifier (e.g., logistic regression, BERT) using your synthetic data. Then test it using:
A separate synthetic test set with noise
Real data (if available)
Out-of-distribution (OOD) queries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
X = [x["text"] for x in data]
y = [x["label"] for x in data]
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)
model = LogisticRegression()
model.fit(X_vec, y)
y_pred = model.predict(X_vec)
print(classification_report(y, y_pred))
You’ll see if the synthetic data is expressive and general enough—or if it overfits to template phrasing.
Pro Tips
Vary your GPT temperature: Lower for structured data, higher for informal text.
Use few-shot prompting: Provide GPT with 3–5 examples to match your domain style.
Post-process outputs: Always validate distributions, label balance, and logic integrity.
Track versioning: Synthetic data isn’t free—track prompt versions and regeneration logic in your pipeline.
Real-World Applications
Domain | Use Case | Tool |
---|---|---|
Healthcare | Simulate patient records for rare disease models | GPT-4, Faker |
Retail | Generate product reviews by sentiment class | GPT + labeling |
NLP Research | Create balanced intent detection datasets | GPT-4 |
EdTech | Generate Q&A pairs for training language tutors | GPT-4-turbo |
Final Thoughts
Synthetic data generation using GPT is not just useful it’s strategic. When automated, domain-aware, and enriched with noise, it becomes a reliable foundation for:
Rapid model prototyping
Data scarcity mitigation
Bias correction
Production simulation
We’re entering a world where your data isn’t limited by what you have, but by what you can design.
Coming Up Next
In Part 4, we shall explore benchmarking synthetic data: How do you quantify the quality of synthetic data? How close is “close enough”? And how do models trained on synthetic data compare against real-world baselines?