Actually, I’ve used synthetic data for code generation and understanding mostly, since that has been a focus at work. But like most synthetic data is generated, it needs a seed to be generated upon. And your seed can be of two types.

**One**, is where I provide basic (and possibly low variance) data and take it apart by some metric into atomic blocks, then reconstruct them into out-of-sample instances by varying combination (or permutation, if order matters) and cardinality of atomic block per datum. Mind that your block may not be the data itself, but it may be a conjugate element of a function that transforms your seen data into unseen data (those are association atomicities). The point is, you still need that base to extrapolate from.

**Two**, where you don’t provide data but directly provide the atomic elements as a comprehensive set and generate all extrapolations of it that abide by field laws. This usually requires very deep domain expertise to have a comprehensive start.

Now there is a drawback in both - and it’s that in the first, there may be left out, a major base archetype **(I’ve given an example of this in the end)**, from the dataset, and with that base missing, a whole tree of extrapolations go missing. Or maybe you left out a major sub-transform, after which a sub-tree would be missing that could’ve been highly likely OR less-likely but a one-time product-changer. The good thing is, because you have a seed natural database, you can estimate the weights of in-sample and out-of-sample combinations. In the second, if you’re thorough, you may not leave out full trees from an archetype but your hypothesis weights on combinations or function to generate combinations rests very much on future data which exposes you to unsavoury first impressions.

The best thing would be a fair combination of both. Following that, keeping it clean and decoupled would be the weight assignment and update module in iterations - either for the data directly (if that is enough), or on the transform that maps out the abstract part of the dataset not received yet.

I would not specifically focus on a single task from the list given in this light, but would say that this would apply well to predictive and generative systems - and if you have a constraint based optimization function to apply to this system as an overlay, this would also serve well for a prescriptive system.

*Now, what did I mean by an archetype?*

Consider a model where I am trying to generate all pieces of code solution for a given algorithm. I have some base data. Some of it passing unit tests, some of it not (we can worry about that later). Now if we followed the first approach, I would take this seed data of naturally obtained code sets, break it down into control and data flow pieces, perturb the atomic code operations, function calls and produce new permutations of this - yes, permutations because in code, order matters!

If I were to use method two, I would start from understanding the algorithm, using an exhaustive set of loops, assignments, conditionals, function calls until I get closer and closer to a viable solution (this again follows a distance check based on static and dynamic analysis which I won’t get into here, but… just saying, it’s not a blind combination) - but then I won’t have the likelihood for an incoming submission that was synthetically foreseen.

**Now an archetype is a base solution** - for example, the factorial function can be encoded multiple ways - two of them, through recursion and looping. These serve as archetypes. The base working and minimal code solving in two different ways. My machine figures this set of base archetypes out - the rest of the generation becomes cleaner and smoother - and this time, we’re ready for the first impression as well… in the sense, there may be cases we don’t have enough info on, but we know OF those cases and the closest feedback we can give without being too specific to get caught, while yet being helpful.