This blog post is Part 5 of Generative AI in Data Science series. This one tackles multi-modal synthetic data generation combining text, images, and structured data using GPT-4, DALL·E, and prompt chaining. It is built for your audience of data science professionals who care about building real-world, reproducible pipelines.

Part 5: Multi-Modal Synthetic Data Generation with GPT-4, DALL·E & Prompt Chaining

In previous parts of this series, we generated synthetic tabular data, noisy NLP training sets, and benchmarked it all against real-world metrics. But now we are stepping into the deep end:

What if you need data that combines multiple modalities – text, images, and structured fields – all linked together?

This is where multi-modal synthetic data comes in.

Imagine you are building:

A product search engine with text descriptions, specs (specifications), and images
A medical assistant that reads radiology reports and correlates with X-rays
An e-commerce recommender system using user profiles + session text + product images

You need linked, coherent synthetic data across formats and until recently, that was nearly impossible.

Now? With GPT-4, DALL·E 3, and a smart prompting strategy, it is not only possible it is automatable.

What We are Building?

Let’s build a multi-modal dataset that includes:

A product name and description (text)
A set of structured attributes (tabular data)
A product image (DALL·E)

We shall walk through:

Generating structured + text data via GPT-4
Using that output as context to generate an image prompt for DALL·E
Creating coherent multi-modal pairs ready for downstream ML tasks

Step 1: Generate Structured + Text Data with GPT-4

Start with a prompt to GPT-4 to generate both structured attributes and natural language:

import openai
import json

openai.api_key = "your-api-key"

prompt = """
Generate 10 synthetic e-commerce product entries in JSON.
Each should include:
- 'product_name'
- 'description' (in natural language)
- 'category' (e.g., Electronics, Fashion, Home)
- 'price' (in USD)
- 'color'
- 'features' (list of 3-5 bullet points)

Example format:
{
  "product_name": "...",
  "description": "...",
  "category": "...",
  "price": ...,
  "color": "...",
  "features": ["...", "..."]
}
"""

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7
)

products = json.loads(response['choices'][0]['message']['content'])

Each product now has both tabular fields and free-form description. Example:

{
  "product_name": "Aurora Smart Mug",
  "description": "A self-heating ceramic mug that keeps your coffee warm at the perfect temperature.",
  "category": "Electronics",
  "price": 89.99,
  "color": "White",
  "features": [
    "Auto temperature control",
    "USB-C rechargeable",
    "Leak-proof lid",
    "2-hour battery life"
  ]
}

Step 2: Use DALL·E to Generate the Product Image

Now we turn that structured+text data into a DALL·E 3 prompt.

product = products[0]

image_prompt = f"""
Product photo of a {product['color']} {product['product_name']} in the {product['category']} category.
Features: {', '.join(product['features'])}.
Show the product on a plain white background, front-facing, high resolution.
"""

# Send to DALL·E API (or use ChatGPT with image generation support)
# For example:
# openai.Image.create(prompt=image_prompt, model="dall-e-3", n=1, size="1024x1024")

Now you’ve generated:

A product description (text input for NLP)
A set of features + price (structured data)
A realistic image (vision input)

You can repeat this for hundreds of examples with a looped pipeline.

Step 3: Use the Data in Multi-Modal ML Tasks

Once you have created multi-modal pairs, you can train/test models for tasks like:

Product Embedding Learning

Use contrastive learning to align text and image embeddings:

Positive: (description, image) pair
Negative: mismatched descriptions and images

Cross-Modal Retrieval

Train models to:

Retrieve a product image from a description
Match a description to structured specs
Suggest similar items based on features + image + text

Multi-Modal Generative Modeling

Generate missing modalities from partial input:

Predict product image from specs
Generate specs from description
Summarize specs into human-readable text

You’re no longer constrained by limited datasets — now you can create the exact type of data your model needs.

Bonus: Add Imperfections to Simulate Reality

Just like in Part 3, you can add noise:

Typos in text descriptions
Inconsistent prices or units ("USD 89" vs "$89.00")
Visual clutter in DALL·E prompts (e.g., “with packaging”, “on a wood table”)

This helps models become robust against real-world messiness.

Tooling Stack

Tool	Use Case
GPT-4 / GPT-4-turbo	Structured + text data generation
DALL·E 3	Product or medical image synthesis
Python (Pandas/JSON)	Structuring + joining modalities
HuggingFace Transformers	Multi-modal modeling and retrieval
CLIP, BLIP, Flamingo	Embedding + training frameworks

Final Thoughts

Multi-modal datasets used to be the privilege of big tech. Not anymore. With GPT-4 and DALL·E, you can generate your own, customized to your modeling needs.

And because you control the prompts, you control the:

Data distributions
Class balance
Label accuracy
Edge cases

With this power, you can build synthetic datasets that simulate your ideal training conditions — across text, vision, and structured formats.

Coming Up Next

In Part 6, we shall explore how to productionize synthetic data pipelines:

Versioning + reproducibility
Integrating with MLOps
Scheduling periodic synthetic dataset refreshes

Because generating is one thing deploying at scale is another.

Note about Specs mentioned in this blogpost Part 5

when we refer to “specs” (short for specifications), we are talking about the structured attributes of a product the kind of data you might see in a technical sheet, product listing, or catalog.

Example:

For a product like a smart mug, the specs would include:

Field	Value
Product Name	Aurora Smart Mug
Color	White
Price	$89.99
Category	Electronics
Features	– Auto temperature control- USB-C rechargeable- Leak-proof lid- 2-hour battery life

In context:

The description is natural language (e.g., “A self-heating ceramic mug…”)
The image is generated via DALL·E
The specs are structured, machine-readable fields used for filtering, search, and modeling.

Why “specs” matter in multi-modal data:

They anchor the structured modality in your dataset.
They often power filters in real-world apps (like Amazon filters or comparison charts).
In machine learning, they’re useful for:
- Feature-based recommendations
- Structured-to-text generation (e.g., auto-generating product descriptions)
- Text+specs fusion for classification or ranking

So when we say “generate specs from description” or “predict image from specs,” we’re referring to using that structured information as one of the modalities in a multi-modal ML system.