In this article, we'll learn about generating synthetic data with the help of the python SDV library.
Every application requires data to test end-to-end functionality or test performance. Sometimes it isn't easy to generate data in bulk, sometimes the data we are interested in isn't available to the public. To solve these problems, we require some tools that can help us generate data based on provided structure or specific format.
SDV is a python library that helps us to generate synthetic data. It takes some table, multiple tables, or some time series data as input to understand the format and structure of the data and generates new synthetic data that has the same structure and format. Internally, SDV uses several probabilistic graphical modeling and deep learning-based techniques. It supports multiple data types, has some built-in constraints, and it allows us to create some custom constraints based on that we can generate the data we want.
As part of this demo, we'll read a simple CSV file as input, SDV library will use this file to understand the structure and format of the data and it will generate output synthetic data with multiple constraints and formats.
In this article, I have added a core code snippet for full code refer GITHUB
Prerequisites
- Python
Install pandas and SDV package
pip install pandas
pip install sdv
Input CSV file
Steps to generate synthetic data
- Import all the required python library
import pandas as pd
from sdv.tabular import TVAE
from sdv.constraints import CustomConstraint
from sdv.constraints import ColumnFormula
import random
from random import choice
import string
from string import ascii_uppercase
import warnings
warnings.filterwarnings("ignore")
- Read input CSV file with pandas and provide it as SDV library input
data = pd.read_csv('data/samples.csv')
model = TVAE()
model.fit(data)
new_data = model.sample(num_rows=10)
Let's try to define our rules to generate synthetic data based on columns from the above CSV.
Email - Combination of First name, Last name, and gmail.com. SDV library does not provide any default function to archive this, so we have to write this as part of our custom code. This will be part of ColumnFormulas.
def generate_email(data):
data['Email'] = data['First name'] + "." + data['Last name'] + '@gmail.com'
return data['Email']
email_constraints = ColumnFormula(
column='Email',
formula=generate_email,
handling_strategy='transform')
constraints.append(email_constraints)
- Arbitrary String - This is useful when we do not have public string data. SDV library supports a few basic columns to generate arbitrary values, but if we want to completely anonymize the non-supported column, then we need to write our custom code. Here, we can provide multiple columns, as shown below.
def generate_arbitary_string_series(data):
data=''.join(random.choices(string.ascii_letters, k=len(str(data))))
return data
def generate_arbitary_string(column_data):
return column_data.apply(generate_arbitary_string_series)
arbitary_string_constraints = CustomConstraint(
columns=['First name', 'Last name'],
transform=generate_arbitary_string)
constraints.append(arbitary_string_constraints)
- Arbitrary Integer - This is useful when we do not have public integer data OR to generate any dynamic random number.
def generate_arbitary_number_series(data):
data=random.randint(0, 100)
return data
def generate_arbitary_number(column_data):
return column_data.apply(generate_arbitary_number_series)
arbitary_number_constraints = CustomConstraint(
columns=['Age'],
transform=generate_arbitary_number)
constraints.append(arbitary_number_constraints)
- Multiply by number - This is a custom function that multiplies numbers by 1000. We can write any function based on our logic.
def multiply_with_1000(column_data):
return column_data * 1000
multiply_constraint = CustomConstraint(
columns=['Salary'],
transform=multiply_with_1000)
constraints.append(multiply_constraint)
If we do not want any custom logic OR want to generate data from existing values only, then we do not need to write any code. SDV library will take care of it automatically.
If we want any distinct values then we can provide that column as the primary key.
As part of this demo we are creating only 15 rows but if we want, then we can generate many rows by changing just arguments.
model = TVAE(primary_key='ID', constraints=constraints)
model.fit(data)
synthetic_data_clients = model.sample(15)
print(synthetic_data_clients)