• Latest
  • Trending
Use of Synthetic Data, in Early Stage, Seen as an Answer to Data Bias 

Use of Synthetic Data, in Early Stage, Seen as an Answer to Data Bias 

September 30, 2021
Just-In: Ethereum Merge Most Likely In August, Says Vitalik Buterin

Just-In: Ethereum Merge Most Likely In August, Says Vitalik Buterin

May 20, 2022
Trader Predicts Crypto Market Will Mimic 2018 Bear Season – Here’s How High Bitcoin Could Go Before Nuking Lower

Trader Predicts Crypto Market Will Mimic 2018 Bear Season – Here’s How High Bitcoin Could Go Before Nuking Lower

May 20, 2022
Terraform Labs, Luna Foundation Guard Bought 3.06m AVAX in total: Avalanche Foundation

Terraform Labs, Luna Foundation Guard Bought 3.06m AVAX in total: Avalanche Foundation

May 20, 2022

TD SYNNEX expands solution offering with Google Cloud

May 20, 2022

Creating an ML Web App and Deploying it on AWS

May 20, 2022
Will Fan Tokens Replace Memecoins Like Shiba Inu and Dogecoin?

Will Fan Tokens Replace Memecoins Like Shiba Inu and Dogecoin?

May 20, 2022
Goldman Sachs: Crypto Drawdown Will Have Little Impact on U.S. Economy

Goldman Sachs: Crypto Drawdown Will Have Little Impact on U.S. Economy

May 20, 2022
Crypto Bear Market: Pantera Partner Sees These Buying Opportunities

Crypto Bear Market: Pantera Partner Sees These Buying Opportunities

May 20, 2022
Australias Commonwealth Bank Halts Crypto Rollout

Australias Commonwealth Bank Halts Crypto Rollout

May 20, 2022
Commonwealth Bank puts crypto trading trial on ice as regulators dither

Commonwealth Bank puts crypto trading trial on ice as regulators dither

May 20, 2022
Ethereum devs tip The Merge will occur in August ‘if everything goes to plan’

Ethereum devs tip The Merge will occur in August ‘if everything goes to plan’

May 20, 2022
Beware, Bitcoin Jumping Back Above $30,000 Could Be A Dead Cat Bounce, Here’s why

Beware, Bitcoin Jumping Back Above $30,000 Could Be A Dead Cat Bounce, Here’s why

May 20, 2022
Deep Tech Central
Tuesday, June 28, 2022
Subscription
Sign Up
  • News
    • Artificial Intelligence
    • Crypto
    • CyberSecurity
    • IoT
    • Robotics
    • Quantum Computing
    • Sustainability
    • Telecom
  • Videos
  • DTC – UNV
No Result
View All Result
Deeptech Central
No Result
View All Result

Use of Synthetic Data, in Early Stage, Seen as an Answer to Data Bias 

by DeepTech Central
September 30, 2021
in Artificial Intelligence
0

By AI Trends Staff 

Assuring that the huge volumes of data on which many AI applications rely is not biased and complies with restrictive data privacy regulations is a challenge that a new industry is positioning to address: synthetic data production. 

YOU MAY ALSO LIKE

Creating an ML Web App and Deploying it on AWS

Now You Don’t Need To Present Your Credit Card At Checkout If You Bind Your Facial Images/ Hand Features To Your MasterCard Credit Card

Gary Grossman, Senior VP of Technology Practice, Edelman

Synthetic data is computer-generated data that can be used as a substitute for data from the real world. Synthetic data does not explicitly represent real individuals. “Think of this as a digital mirror of real-world data that is statistically reflective of that world,” stated Gary Grossman, senior VP of Technology Practice Edelman, public relations and marketing consultants, in a recent account in VentureBeat. “This enables training AI systems in a completely virtual realm.”  

The more data an AI algorithm can train on, the more accurate and effective the results will be. 

To help meet the demand for data, more than 50 software suppliers have developed data synthetic products, according to research last June by StartUs Insights, consultants based in Vienna, Austria. 

One alternative for responding to privacy concerns is anonymization, the masking or elimination of personal data such as names and credit card numbers from ecommerce transactions, or removing identifying content from healthcare records. “But there is growing evidence that even if data has been anonymized from one source, it can be correlated with consumer datasets exposed from security breaches,” Grossman states. This can even be done by correlating data from public sources, not requiring a security hack.  

A primary tool for building synthetic data is the same one used to create deepfake videos—generative adversarial networks (GANs), a pair of neural networks. One network generates the synthetic data and the second tries to detect if it is real. The AI learns over time, with the generator network improving the quality of the data until the discriminator cannot tell the difference between real and synthetic.  

A goal for synthetic data is to correct for bias found in real world data. “By more completely anonymizing data and correcting for inherent biases, as well as creating data that would otherwise be difficult to obtain, synthetic data could become the saving grace for many big data applications,” Grossman states. 

Big tech companies including IBM, Amazon, and Microsoft are working on synthetic data generation. However, it is still early days and the developing market is being led by startups.  

A few examples: 

AiFi — Uses synthetically generated data to simulate retail stores and shopper behavior;  

AI.Reverie — Generates synthetic data to train computer vision algorithms for activity recognition, object detection, and segmentation;  

Anyverse — Simulates scenarios to create synthetic datasets using raw sensor data, image processing functions, and custom LiDAR settings for the automotive industry. 

Synthetic Data Can Be Used to Improve Even High-Quality Datasets  

Dawn Li, Data Scientist, Innovation Lab, Finastra

Even if you have a high-quality dataset, acquiring synthetic data to round it out often makes sense, suggests Dawn Li, a data scientist at the Innovation Lab of Finastra, a company providing enterprise software to banks, writing in InfoQ 

For example, if the task is to predict whether a piece of fruit is an apple or an orange, and the dataset has 4,000 samples for apples and 200 samples for oranges, “Then any machine learning algorithm is likely to be biased towards apples due to the class imbalance,” Li stated. If synthetic data can generate 3,800 more synthetic examples for oranges, the model will have no bias toward either fruit and thus can make a more accurate prediction. 

For data you wish to share that contains personally identifiable information (PII), and for which the time it takes to anonymize makes that impractical, synthetic samples from the real dataset can preserve important characteristics of the real data and can be shared without the risk of invading privacy and leaking personal information.  

Privacy issues are paramount in financial services. “Financial services are at the top of the list when it comes to concerns around data privacy. The data is sensitive and highly regulated,” Li states. As a result, the use of synthetic data has grown rapidly in financial services. While it is difficult to obtain more financial data, because of the time it takes to generate real world experience, synthetic data can be generated to allow the data to be used immediately.  

A popular method for generating synthetic data, in addition to GANs, is the use of variational autoencoders, neural networks whose goal is to predict their input. Traditional supervised machine learning tasks have an input and an output. With autoencoders, the goal is to use the input to predict and try to reconstruct the input itself. The network has an encode and a decoder. The encoder compresses the input, creating a smaller version of it. The decoder takes the compressed input and tries to reconstruct the original input. In this way, scaling down the data in the encode and building it back up from the encode, the data scientist is learning how to represent the data. “If we can accurately rebuild the original input, then we can query the decoder to generate synthetic samples,” Li stated.  

To validate the synthetic data, Li suggested using statistical similarity and machine learning efficacy. To assess similarity, view side-by-side histograms, scatterplots, and cumulative sums of each column to ensure we have a similar look. Next, look at correlations and plot a matrix of the real and synthetic data sets to get an idea of how similar or different the correlations are.  

To assess machine learning efficacy, review a target variable or column. Create some evaluation metrics and assess how well the synthetic data performs. “If it performs well upon evaluation on real data, then we have a good synthetic dataset,” Li stated. 

Best Practices for Working with Synthetic Data  

Best practices for working with synthetic data were suggested in a recent account in AIMultiple written by Cem Dilmegani, founder of the company that seeks to “democratize” AI.   

First, work with clean data. “If you don’t clean and prepare data before synthesis, you can have a garbage in, garbage out situation,” he stated. He recommended following principles of data cleaning, and data “harmonization,” in which the same attributes from different sources need to be mapped to the same columns.  

Also, assess whether synthetic data is similar enough to real data for its application area. Its usefulness will depend on the technique used to generate it. The AI development team should analyze the use case and decide if the generated synthetic data is a good fit for the use case.  

And, outsource support if necessary. The team should identify the organization’s synthetic data capabilities and outsource based on the capability gaps. The two steps of data preparation and data synthesis can be automated by software suppliers, he suggests. 

Read the source articles and information in VentureBeat, in InfoQ and in AIMultiple. 

Share196Tweet123Share49

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

I agree to the Terms & Conditions and Privacy Policy.

Search

No Result
View All Result

Recent News

  • Just-In: Ethereum Merge Most Likely In August, Says Vitalik Buterin
  • Trader Predicts Crypto Market Will Mimic 2018 Bear Season – Here’s How High Bitcoin Could Go Before Nuking Lower
  • Terraform Labs, Luna Foundation Guard Bought 3.06m AVAX in total: Avalanche Foundation
  • About
  • Privacy Policy
  • Sign Up
  • Contact Us
  • About
  • Contact
  • Deeptech Central
  • Elementor #10628
  • Newsletter
  • Privacy Policy
  • Sign Up

© 2018-2021 DeepTech Central. - by MintMore Inc..

No Result
View All Result
  • News
    • Artificial Intelligence
    • Crypto
    • CyberSecurity
    • IoT
    • Robotics
    • Quantum Computing
    • Sustainability
    • Telecom
  • Videos
  • DTC – UNV

© 2018-2021 DeepTech Central. - by MintMore Inc..

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.

Stay Updated. Subscribe Today.

Join the community of 10K+ scholars & entrepreneurs.