Have you experimented with synthetic data to boost your model performance?

Have you experimented with synthetic data to boost your model performance?

If so, in what context?

I’m wondering how synthetic data would be used for various deep learning tasks, for example:

  1. Segmentation
  2. Detection
  3. Classification
  4. Sentiment analysis
  5. etc

How have you done it?

Would love to hear from @lu.riera @kbaheti @sGx_tweets @chris and anyone else who may have some insight @trust_level_1

1 Like

Is this…coindidence?

I got this mailer, today.

1 Like

Ha! Coincidence? I think not.

I’ll check this out, and leave my summary and takeaways here.

Great topic @harpreet.sahota!
From my experience, using synthetic data in computer vision tasks is becoming increasingly popular. Although you need to be very careful with it, keep a reasonable proportion of real-synthetic data and mainly use it to enrich your data where you don’t have enough samples.
For example, I have some experience with using synthetic data for pose estimation and semantic segmentation where the dataset didn’t have full coverage of all of the angles that a person could look at or pose.
Using a high-quality synthetic dataset in a careful proportion for specific use cases can really help improve your model’s accuracy metrics.


To my best knowledge, synthetic data is very popular in monocular/stereo depth estimation & optical flow estimation tasks.

There are some great synthetic datasets for image recognition tasks, btw:


Actually, I’ve used synthetic data for code generation and understanding mostly, since that has been a focus at work. But like most synthetic data is generated, it needs a seed to be generated upon. And your seed can be of two types.

One, is where I provide basic (and possibly low variance) data and take it apart by some metric into atomic blocks, then reconstruct them into out-of-sample instances by varying combination (or permutation, if order matters) and cardinality of atomic block per datum. Mind that your block may not be the data itself, but it may be a conjugate element of a function that transforms your seen data into unseen data (those are association atomicities). The point is, you still need that base to extrapolate from.

Two, where you don’t provide data but directly provide the atomic elements as a comprehensive set and generate all extrapolations of it that abide by field laws. This usually requires very deep domain expertise to have a comprehensive start.

Now there is a drawback in both - and it’s that in the first, there may be left out, a major base archetype (I’ve given an example of this in the end), from the dataset, and with that base missing, a whole tree of extrapolations go missing. Or maybe you left out a major sub-transform, after which a sub-tree would be missing that could’ve been highly likely OR less-likely but a one-time product-changer. The good thing is, because you have a seed natural database, you can estimate the weights of in-sample and out-of-sample combinations. In the second, if you’re thorough, you may not leave out full trees from an archetype but your hypothesis weights on combinations or function to generate combinations rests very much on future data which exposes you to unsavoury first impressions.

The best thing would be a fair combination of both. Following that, keeping it clean and decoupled would be the weight assignment and update module in iterations - either for the data directly (if that is enough), or on the transform that maps out the abstract part of the dataset not received yet.

I would not specifically focus on a single task from the list given in this light, but would say that this would apply well to predictive and generative systems - and if you have a constraint based optimization function to apply to this system as an overlay, this would also serve well for a prescriptive system.

Now, what did I mean by an archetype?

Consider a model where I am trying to generate all pieces of code solution for a given algorithm. I have some base data. Some of it passing unit tests, some of it not (we can worry about that later). Now if we followed the first approach, I would take this seed data of naturally obtained code sets, break it down into control and data flow pieces, perturb the atomic code operations, function calls and produce new permutations of this - yes, permutations because in code, order matters!

If I were to use method two, I would start from understanding the algorithm, using an exhaustive set of loops, assignments, conditionals, function calls until I get closer and closer to a viable solution (this again follows a distance check based on static and dynamic analysis which I won’t get into here, but… just saying, it’s not a blind combination) - but then I won’t have the likelihood for an incoming submission that was synthetically foreseen.

Now an archetype is a base solution - for example, the factorial function can be encoded multiple ways - two of them, through recursion and looping. These serve as archetypes. The base working and minimal code solving in two different ways. My machine figures this set of base archetypes out - the rest of the generation becomes cleaner and smoother - and this time, we’re ready for the first impression as well… in the sense, there may be cases we don’t have enough info on, but we know OF those cases and the closest feedback we can give without being too specific to get caught, while yet being helpful.


Thank you so much for this detailed explanation, @kbaheti!

If I could summarize to make sure I got the gist of this:

Synthetic data used for code generation and understanding can be generated using one of two methods.

  1. Providing basic data and breaking it down into atomic blocks that can be reconstructed into new combinations.

  2. Providing a comprehensive set of atomic elements to generate all possible extrapolations that abide by field laws.

Both methods have drawbacks, and the best approach may be to use a combination of the two.

I’ve got two questions here…

  1. What do you mean by field laws? Is it just some specific set of rules or constraints that apply to the data being generated?

  2. Can you provide an example of how the first method of generating synthetic data would be applied in practice? Like a tangible example of this, trying to wrap my head around it.

If I could use an analogy to try to understand it:

Let’s consider process of building a house from a set of pre-cut building blocks.

In this analogy, the basic data provided would be like the pre-cut building blocks, and the process of breaking it down into atomic blocks and reconstructing it into new combinations would be like using those building blocks to construct a new house by varying the arrangement and combination of the blocks. Just as a house can be built in many different ways using the same set of building blocks, synthetic data can be generated in many different ways using the same basic data.

Am I getting the gist of it?

Thanks again

1 Like

For the summary you gave: Yes, that’s the idea! And your building blocks for the house is very right! In fact, the approach they use today to auto-design architectural pieces is exactly as you’ve described - though instead of block you have spaces, and you must check that the house is stable! :slight_smile:

And for the questions:


Field laws are first and foremost the axioms and consequences of them (theorems) that have been established in the field by proof. And then it also extends to conjectures as well.

Take for instance, there are scientists working in molecular compound design who are given certain characteristics that the compound must abide by. A very common goal is a blocking mechanism that a compound must initiate over the human proteome to stop a viral RNA from transcribing. Now they use geometric DL networks for generating novel compounds - but either during the training or after the training, there are synthesized compounds that are weeded out - owing to usually the interaction of four types of interatomic forces - ionic, covalent, hydrogen, and Wander Walls forces, these are field forces that result in either stable or unstable exchange of charge and determine the existence and efficiency of compounds. Now while your network is synthezing new compounds, there is a parallel constrained check over the energy interaction system per data instance that is simulated as well - based on energy level satisfiability, it vetos out certain compounds entirely, and then it gives a probability of occurrence to some (owing to a phenomenon in chemistry called resonant structures), and finally it assesses the binding affinity in the same simulated way in the environment of the human proteome and predicts the affinity to block target sites.

Now there is a chance that other compounds in the vicinity affect this blocking, and if the system is too complex they might not be able to handle all the variability, so your blocking capability comes with a confidence (or lack thereof). But anyway, this what I meant by field laws.


This would be the routine if I were generating data say from a few pieces of code samples I have. So I have a set of functions - all that attempt to do bubble sort. Now one uses a nested loop, one a single loop, one a loop with a comprehension inside, one with just a comprehension and nothing else.

Synthetic data will take the key features necessary for working in the code but also that are no further separable meaningfully but are attachable to each other. For instance, separate out the conditional statements from the conditions, the loop stopping criteria, the assignments, and so on , and then start multiplexing it across. One conditional statement from one file might go across to another and be a while looping statement with its stopping condition taken from yet another file. Assignments done within a loop may be now placed outside the loop. Indents and nesting might be varied across. These are experimented variants of code. Now based on what kind of population you want, you can add constraints - the basic one where they should be compilable and runnable, the perhaps how many tests they pass.

When you happen to stumble upon the right code in this process, you can then measure other codes through syntax and through intermediate data traces as a distance from the correct code.

And finally, for how many times you use a block in a file and in how many files you repeat certain atomic code blocks - you can take that as ratio of how they were present on your original natural seed set. That is what plays into the multiplicity of a code block in a single variant file and also how many variants files of that sort should exist on set.

Also observe, how genetic and evolutionary algorithms are wonderful for novel yet learnable synthesis of data. What I’ve described above is very much in tune with that.

And, yes, that was a real seed set I described, hand curated, used to generate further data - this was the seed set…

1 Like