Synthetic Data: When Generative AI Meets Privacy in Machine Learning
When EdX was founded in 2012, over 150,000 students from all over the world signed up to use the online learning platform. A joint project between the Massachusetts Institute of Technology and Harvard University, EdX was created as an alternative to in-person classes. Still, the unexpectedly high number of signups created an opportunity for data generation. There was just one problem, because of privacy laws, the data had to be private. To bypass the barrier created by statute, Kalyan Veeramachaneni, a data scientist at MIT, created synthetic “students” similar to those enrolled in EdX and then used machine learning algorithms to generate alternative variants of the actual data. Veeramachaeneni would go on to create an open-source software package that helps users quickly develop synthetic data.
In a world where data is central to machine learning and algorithm creation, the absence of quality data might mean the difference between “bad algorithms” and one that functions properly. In many cases, though, data is hard to find because of concerns around privacy and safety or because they just do not exist. Either way, companies are looking to synthetic data to fill that need. By creating training datasets that reflect real-world data, they are essentially getting the best of both worlds.
How it Works
Synthetic data can be created for any type of dataset, from simple tabular data to complex unstructured data, but they each require different techniques. The easiest way of generating synthetic data is by randomly selecting numbers from a distribution. While the data obtained may not include the insights of accurate data, it produces data distribution that is close to real-world data. It gets trickier when using neural network techniques. They can handle much richer distributions of data and can also synthesize unstructured data. Here are three methods used to generate synthetic data using neural networks.
Variational Auto-Encoder (VAE): This autoencoder learns the distribution of an original dataset through training and then generates new data by ensuring that its latent space has good properties.
Generative Adversarial Network (GAN): GAN is an algorithm made of two models that competes with each other during the training process to generate fake yet realistic data points. The first model generates fake data points while the different model tells the artificial samples from the real ones.
Diffusion Models: These algorithms deliberately add random noise to data and then learn to reverse the diffusion process to construct the desired data samples. They are primarily used in generating images.
Synthetic data has the potential to be used in industries where there are difficulties in data access, especially quality data. One of those industries is healthcare, where significant research requires a sizable amount of data that might not be readily available because of privacy concerns. In 2022, a group of scientists at UC Davis Health, California, were awarded a four-year grant from the Us National Institute of Health to study ways to generate synthetic data that could fill in the gap left by restricted data. The increasing interest in synthetic data in healthcare will also translate to more research that could help healthcare professionals to predict and treat diseases.
Meanwhile, in industries with relatively lower stakes, synthetic data has created an opportunity to generate privacy-safe data for research and model training. In finance, for example, synthetic data is providing access to customer and transaction data that would previously be restricted under privacy and safety regulations. The same is valid for Artificial Intelligence and the possibility of training models using synthetic data. Since sourcing and cleaning real-world data can be tricky and/or expensive, synthetic data offers an affordable option for building datasets for algorithm training. Apart from this, synthetic data is also being used to estimate the population of cities to support targeted interventions for low-income areas and to prevent road accidents. All of these factors mean that the global Synthetic Data Generation market is expected to grow to an estimated USD 2.1 Billion by 2028 from USD 0.3 Billion this year, and for good reason.
Synthetic Data in Language AI
Synthetic data provides the solution to a lot of worries that come with the use of data in training models in language AI. It delivers a way to preserve users' privacy while supplying high-quality data for datasets. Since NLP models need to be trained on large amounts of data sets which might not necessarily represent the reality of the world we live in, synthetic data can fill the gaps that might be present with actual data. This can help to build more realistic language models than models trained on real-world data alone. As the demand for high-quality training data increases, synthetic data quickly becomes an efficient alternative to real-world data.
Synthetic data is also in a unique position when it comes to language AI since most of the output of NLP models in itself can be considered a form of synthetic data. For example, text generation models, when prompted, can produce texts that can be used as synthetic data for training data. Because generative models can hallucinate, these data may need to be fact-checked and diversified, whichever the case may be.
One major problem with real-world datasets is their tendency to contain skewed or biased data depending on the source. This has led to biased models ranging from art generators to healthcare algorithms, with the latter resulting in the WHO cautioning against using AI to make healthcare decisions. Introducing synthetic data in these scenarios can help to dispel worries about biased data leading to biased models and algorithms. Because synthetic data relies on real-world data, which may be biased, this might mean generating additional samples for a specific class if necessary.
The major struggle of synthetic data is its reliance on real-world data for its production. For example, because data quality is so important in healthcare, the quality of datasets might mean the difference between life and death, making it essential for synthetic data to be as close to real-world data as possible. That requires having access to accurate data, but in scenarios where data privacy is essential or even legally required, the use of data to create synthetic data is a delicate balance. Considerations have to be made for the possibility that the synthetic data could be traced back to the original contributors, which defeats the purpose of using synthetic data in the first place.
Synthetic data can bypass privacy and safety concerns and create new possibilities for research and development, not only in technology but also in other industries. The relatively low cost of synthetic data also lowers the barriers to entry for independent AI research and deployment. Ultimately, the usefulness of synthetic data lies in its ability to be a trustworthy representation of real-world data while making up for its shortcomings. Synthetic data that can do both has the potential to bridge the gap when data is unavailable.