This blog post is Part 5 of Generative AI in Data Science series. This one tackles multi-modal synthetic data generation combining text, images, and structured data using GPT-4, DALL·E, and prompt chaining. It is built for your audience of data science professionals who care about building real-world, reproducible pipelines.
Part 5: Multi-Modal Synthetic Data Generation with GPT-4, DALL·E & Prompt Chaining
In previous parts of this series, we generated synthetic tabular data, noisy NLP training sets, and benchmarked it all against real-world metrics. But now we are stepping into the deep end:
What if you need data that combines multiple modalities – text, images, and structured fields – all linked together?
This is where multi-modal synthetic data comes in.
Imagine you are building:
A product search engine with text descriptions, specs (specifications), and images
A medical assistant that reads radiology reports and correlates with X-rays
An e-commerce recommender system using user profiles + session text + product images
You need linked, coherent synthetic data across formats and until recently, that was nearly impossible.
Now? With GPT-4, DALL·E 3, and a smart prompting strategy, it is not only possible it is automatable.
What We are Building?
Let’s build a multi-modal dataset that includes:
A product name and description (text)
A set of structured attributes (tabular data)
A product image (DALL·E)
We shall walk through:
Generating structured + text data via GPT-4
Using that output as context to generate an image prompt for DALL·E
Creating coherent multi-modal pairs ready for downstream ML tasks
Step 1: Generate Structured + Text Data with GPT-4
Start with a prompt to GPT-4 to generate both structured attributes and natural language:
import openai
import json
openai.api_key = "your-api-key"
prompt = """
Generate 10 synthetic e-commerce product entries in JSON.
Each should include:
- 'product_name'
- 'description' (in natural language)
- 'category' (e.g., Electronics, Fashion, Home)
- 'price' (in USD)
- 'color'
- 'features' (list of 3-5 bullet points)
Example format:
{
"product_name": "...",
"description": "...",
"category": "...",
"price": ...,
"color": "...",
"features": ["...", "..."]
}
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
products = json.loads(response['choices'][0]['message']['content'])
Each product now has both tabular fields and free-form description. Example:
{
"product_name": "Aurora Smart Mug",
"description": "A self-heating ceramic mug that keeps your coffee warm at the perfect temperature.",
"category": "Electronics",
"price": 89.99,
"color": "White",
"features": [
"Auto temperature control",
"USB-C rechargeable",
"Leak-proof lid",
"2-hour battery life"
]
}
Step 2: Use DALL·E to Generate the Product Image
Now we turn that structured+text data into a DALL·E 3 prompt.
product = products[0]
image_prompt = f"""
Product photo of a {product['color']} {product['product_name']} in the {product['category']} category.
Features: {', '.join(product['features'])}.
Show the product on a plain white background, front-facing, high resolution.
"""
# Send to DALL·E API (or use ChatGPT with image generation support)
# For example:
# openai.Image.create(prompt=image_prompt, model="dall-e-3", n=1, size="1024x1024")
Now you’ve generated:
A product description (text input for NLP)
A set of features + price (structured data)
A realistic image (vision input)
You can repeat this for hundreds of examples with a looped pipeline.
Step 3: Use the Data in Multi-Modal ML Tasks
Once you have created multi-modal pairs, you can train/test models for tasks like:
Product Embedding Learning
Use contrastive learning to align text and image embeddings:
Positive: (description, image) pair
Negative: mismatched descriptions and images
Cross-Modal Retrieval
Train models to:
Retrieve a product image from a description
Match a description to structured specs
Suggest similar items based on features + image + text
Multi-Modal Generative Modeling
Generate missing modalities from partial input:
Predict product image from specs
Generate specs from description
Summarize specs into human-readable text
You’re no longer constrained by limited datasets — now you can create the exact type of data your model needs.
Bonus: Add Imperfections to Simulate Reality
Just like in Part 3, you can add noise:
Typos in text descriptions
Inconsistent prices or units (
"USD 89"
vs"$89.00"
)Visual clutter in DALL·E prompts (e.g., “with packaging”, “on a wood table”)
This helps models become robust against real-world messiness.
Tooling Stack
Tool | Use Case |
---|---|
GPT-4 / GPT-4-turbo | Structured + text data generation |
DALL·E 3 | Product or medical image synthesis |
Python (Pandas/JSON) | Structuring + joining modalities |
HuggingFace Transformers | Multi-modal modeling and retrieval |
CLIP, BLIP, Flamingo | Embedding + training frameworks |
Final Thoughts
Multi-modal datasets used to be the privilege of big tech. Not anymore. With GPT-4 and DALL·E, you can generate your own, customized to your modeling needs.
And because you control the prompts, you control the:
Data distributions
Class balance
Label accuracy
Edge cases
With this power, you can build synthetic datasets that simulate your ideal training conditions — across text, vision, and structured formats.
Coming Up Next
In Part 6, we shall explore how to productionize synthetic data pipelines:
Versioning + reproducibility
Integrating with MLOps
Scheduling periodic synthetic dataset refreshes
Because generating is one thing deploying at scale is another.
Note about Specs mentioned in this blogpost Part 5
when we refer to “specs” (short for specifications), we are talking about the structured attributes of a product the kind of data you might see in a technical sheet, product listing, or catalog.
Example:
For a product like a smart mug, the specs would include:
Field | Value |
---|---|
Product Name | Aurora Smart Mug |
Color | White |
Price | $89.99 |
Category | Electronics |
Features | – Auto temperature control- USB-C rechargeable- Leak-proof lid- 2-hour battery life |
In context:
The description is natural language (e.g., “A self-heating ceramic mug…”)
The image is generated via DALL·E
The specs are structured, machine-readable fields used for filtering, search, and modeling.
Why “specs” matter in multi-modal data:
They anchor the structured modality in your dataset.
They often power filters in real-world apps (like Amazon filters or comparison charts).
In machine learning, they’re useful for:
Feature-based recommendations
Structured-to-text generation (e.g., auto-generating product descriptions)
Text+specs fusion for classification or ranking
So when we say “generate specs from description” or “predict image from specs,” we’re referring to using that structured information as one of the modalities in a multi-modal ML system.