Part 3: Automating GPT-Based Synthetic Data Generation for Real World Modeling

This is Part 3 of the Generative AI in Data Science blog series. This post takes things up a next level automating synthetic data generation across domains, injecting real-world messiness, and generating labeled text data for NLP tasks.

 

Part 3: Automating GPT-Based Synthetic Data Generation for Real-World Modeling

In Part 1and Part 2, we explored how generative AI tools like GPT-4 can create high-quality synthetic tabular data, and how that data can power better model training and testing. But so far, we have treated this process as mostly manual.

In the real world, though, you do not want to write a prompt every time you need a dataset.

You want automation.
You want domain flexibility.
You want imperfect, noisy data because real-world data is never clean.

This post is about building automated pipelines for synthetic data generation using GPT, complete with label noise, corrupted formats, and synthetic text-label pairs for NLP.

Let’s get into it.

 

Step 1: Automate Prompt Templates by Domain

To generalize across domains, we shall write prompt templates that adapt based on parameters.

Let’s define a reusable prompt builder:

def build_prompt(domain, num_rows=100):
    if domain == "healthcare":
        return f"""
Generate {num_rows} rows of synthetic patient data in JSON.
Each row should include:
- age (20-90)
- gender (male/female)
- systolic_bp (90-180)
- cholesterol_level (normal/high/critical)
- has_diabetes (0 or 1)

Rules:
- If age > 60 and cholesterol_level == 'critical', has_diabetes is likely 1.
- If systolic_bp < 110 and cholesterol_level == 'normal', has_diabetes is likely 0.
"""
    
    elif domain == "ecommerce":
        return f"""
Generate {num_rows} rows of synthetic ecommerce user sessions in JSON.
Each row should include:
- user_id
- session_duration (seconds)
- device_type (mobile/desktop/tablet)
- pages_viewed
- made_purchase (0 or 1)

Rules:
- If session_duration > 600 and pages_viewed > 5, made_purchase is likely 1.
- Mobile users are less likely to purchase than desktop users.
"""
    
    else:
        raise ValueError("Unsupported domain")

You can now generate data programmatically:

prompt = build_prompt(domain="healthcare", num_rows=200)

Feed this into GPT-4 and collect structured JSON rows, just like in Part 2.

 

 Step 2: Add Realistic Noise and Imperfections

Models trained on clean synthetic data often fail in production because real data is messy. So let’s inject some chaos.

Here is a function to inject:

  • Missing values

  • Random typos

  • Invalid formats

import random

def corrupt_dataframe(df, missing_prob=0.05, typo_prob=0.03):
    for col in df.columns:
        # Inject missing values
        df.loc[df.sample(frac=missing_prob).index, col] = None

        # Inject typos in string fields
        if df[col].dtype == "object":
            for i in df.sample(frac=typo_prob).index:
                val = df.at[i, col]
                if isinstance(val, str) and len(val) > 2:
                    pos = random.randint(0, len(val) - 2)
                    df.at[i, col] = val[:pos] + val[pos+1] + val[pos] + val[pos+2:]

    return df

You can now simulate things like:

  • OCR noise in scanned records

  • Entry errors in healthcare forms

  • Mistyped categories (“mobil” instead of “mobile”)

This improves robustness for downstream models.

 

Step 3: Generate Synthetic Text-Labeled Pairs for NLP Tasks

Let’s say you’re building a classifier for customer support tickets with labels like:

  • Billing

  • Technical Issue

  • Shipping

  • Cancel Request

You can use GPT-4 to generate synthetic training examples like this:

def generate_text_classification_data(num_samples=100):
    system_prompt = "You are a dataset generator for customer support classification."
    user_prompt = f"""
Generate {num_samples} labeled examples for customer support ticket classification.
Each item should be a JSON object with:
- 'text': realistic user message
- 'label': one of [Billing, Technical Issue, Shipping, Cancel Request]

Include natural language variation, informal tone, and occasional typos.
"""

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.8
    )

    import json
    return json.loads(response["choices"][0]["message"]["content"])

You shall get outputs like:

{
  "text": "Hey, my internet's been dropping since last night. What's going on?",
  "label": "Technical Issue"
},
{
  "text": "Need to cancel my order ASAP – wrong size!",
  "label": "Cancel Request"
}

Add typos, slang, or out-of-scope queries to simulate real-world diversity.

 

Step 4: Train + Stress-Test Models Using This Data

Train a text classifier (e.g., logistic regression, BERT) using your synthetic data. Then test it using:

  • A separate synthetic test set with noise

  • Real data (if available)

  • Out-of-distribution (OOD) queries

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X = [x["text"] for x in data]
y = [x["label"] for x in data]

vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)

model = LogisticRegression()
model.fit(X_vec, y)

y_pred = model.predict(X_vec)
print(classification_report(y, y_pred))

You’ll see if the synthetic data is expressive and general enough—or if it overfits to template phrasing.

 

 Pro Tips

  • Vary your GPT temperature: Lower for structured data, higher for informal text.

  • Use few-shot prompting: Provide GPT with 3–5 examples to match your domain style.

  • Post-process outputs: Always validate distributions, label balance, and logic integrity.

  • Track versioning: Synthetic data isn’t free—track prompt versions and regeneration logic in your pipeline.

 

Real-World Applications

DomainUse CaseTool
HealthcareSimulate patient records for rare disease modelsGPT-4, Faker
RetailGenerate product reviews by sentiment classGPT + labeling
NLP ResearchCreate balanced intent detection datasetsGPT-4
EdTechGenerate Q&A pairs for training language tutorsGPT-4-turbo

 

Final Thoughts

Synthetic data generation using GPT is not just useful it’s strategic. When automated, domain-aware, and enriched with noise, it becomes a reliable foundation for:

  • Rapid model prototyping

  • Data scarcity mitigation

  • Bias correction

  • Production simulation

We’re entering a world where your data isn’t limited by what you have, but by what you can design.

 

Coming Up Next

In Part 4, we shall explore benchmarking synthetic data: How do you quantify the quality of synthetic data? How close is “close enough”? And how do models trained on synthetic data compare against real-world baselines?

Leave a Comment

Your email address will not be published. Required fields are marked *