Generate Synthetic Data for AI Vision Training

Johan Louwers
4 min readJun 14, 2022

When building an AI solution to recognize objects as part of a wider AI vision solution one has to realize that around 80% of developing the AI solution most likely will be in collecting and preparing data, determining how much data you will need is a critical first step to correctly estimate the effort and cost for the whole project. A recent study from iMerit outlines the sizing of your learning set in more detail and is worth exploring when trying to calculate the total effort of your development work.

While 80% of the effort is (roughly) in collecting and preparing the data to train your model there is a second catch to be considered. If your model, for example, would require a thousand images to be trained upon this can only be done if those thousand images are available.

In some fields a thousand images of a specific object might not be available. For example in the defence space a thousand images of a new weapon system might not be directly available to you while you want to develop a AI model that is able to detect this specific weapon system when it appears in a wider set of collected data.

The solution to this specific problem, and also to the challenge of building and preparing large datasets in general is synthetic data generation.

Synthetic data generation
The solution to this specific problem, and also to the challenge of building and preparing large datasets in general is synthetic data generation. As per the view of Gartner the use of synthetic data will largely overshadow the use of real-world data in the upcoming years.

Gartner on Synthetic Data for AI

If we take the example of a military weapon system for which we want to train an AI vision model a quick way forward is to generate synthetic data to speed up the development and the level of usability & accuracy.

An approach for this is to model the new weapon system as a 3D object and automatically generate large sets of high quality images from the 3D model. By using the approach of generating the learning set by means of synthetic data generation it will not be required to have a thousand images from the real-world.

The only thing required is a relative good understanding of the hard-surface model which can than be re-created as a…

--

--

Johan Louwers

Johan Louwers is a technology enthousiasts with a long background in supporting enterprises and startups alike as CTO, Chief Enterprise Architect and developer.