Hacker News Clone

Ask HN: Is synthetic data generation practical outside academia?

by cpard on 6/6/2025, 11:55:15 PM with 13 comments

I keep seeing synthetic data pipelines powering the latest LLM “breakthroughs”: • TinyZero’s $30 fine-tuning workflow • Sky-T1’s $450 reasoning-model build • Meta AI’s Llama 3 herd (2024 paper detailing their synthetic-data training) • Berkeley OpenThoughts (“Data Recipes for Reasoning Models”), published yesterday

There are also open-source toolkits you can experiment with:

https://github.com/meta-llama/synthetic-data-kit https://github.com/bespokelabsai/curator

But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.

I’m curious:

1. Who is using synthetic-data pipelines in production today?

2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?

Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!

by Jugurtha on 6/9/2025, 8:53:20 AM
When I was in EE at university, I worked on heart anomaly detection and multi-phase flow classification for oil & gas. The papers I was reading used synthetic data with a nice noise dust sprinkled on it. Meanwhile, I worked on data from hospitals acquired on restless, sweaty, hairy, dudes with rusty, banged up electrodes and abused probes.
Needless to say, the data I saw on these papers looked nothing like the data I worked with, whether from hospitals or what I saw at Schlumberger in the Sahara.
The real world tends to be ... interesting.
by sargstuff on 6/7/2025, 1:45:08 AM
Non-AI specific 'synthetic data generation':
historically used for processes which make use of time-series / simulations & modeling / forcasting. aka weather forcasting, related points in [0]
2) a) Testing with actual 'sensitive' data may not be possible for security reasons (aka payroll information, stock market price influences)[1]. b) insufficent/incomplete information. aka figure out how well what's known matches 'reality' and/or may suggest areas to look for 'missing' pieces in model.
-----
[0] : https://www.oreilly.com/library/view/practical-time-series/9...
[1] : https://www.k2view.com/what-is-synthetic-data-generation/
by publicdaniel on 6/7/2025, 12:40:46 PM
I’m currently working on a document parsing engine for a specific type of document. The inputs are usually PDFs. I’m able to get great structured output from both the latest Gemini Flash models and the latest Llama Scout models. The best latency I get with Gemini is about 5 seconds end to end. With llama hosted on groq it’s about 3 seconds.
My use case is latency constrained, so I’m exploring fine tuning / distilling to see if I can get latency sub second. I imagine these are the kinds of scenarios where it’s still worth it to fine-tune and distill.
My plan is to generate a lot of synthetic training data using more capable slower foundation models and use that to train the smaller model.
by publicdaniel on 6/7/2025, 12:36:47 PM
It’s really useful for generating synthetic data for search and recommendations that you can use to train a smaller / faster model. This is especially useful if you don’t have lots of click-through data or with cold start scenarios. There are some good articles that cover this, if you’re interested I’ll try to find them and share