## Introduction This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques - Generative Adversarial Networks - With "Earth mover's distance" ## Installation pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release ## Usage After installing the easiest way to get started is as follows (using pandas). The process is as follows: **Train the GAN on the original/raw dataset** import pandas as pd import data.maker df = pd.read_csv('sample.csv') column = 'gender' id = 'id' context = 'demo' data.maker.train(context=context,data=df,column=column,id=id,logs='logs') The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data. **Generate a candidate dataset from the learned features** import pandas as pd import data.maker df = pd.read_csv('sample.csv') id = 'id' column = 'gender' context = 'demo' data.maker.generate(context=context,data=df,id=id,column=column,logs='logs') ## Limitations GANS will generate data assuming the original data has all the value space needed: - No new data will be created Assuming we have a dataset with an gender attribute with values [M,F]. The synthetic data will not be able to generate genders outside [M,F] - Not advised on continuous values GANS work well on discrete values and thus are not advised to be used. e.g:measurements (height, blood pressure, ...) - For now will only perform on a single feature. ## Credits : - [Chao Yan](chao.yan@vanderbilt.edu) - [Ziqi Zhang](ziqi.zhang@vanderbilt.edu) - [Brad Malin](b.malin@vanderbilt.edu) - [Steve L. Nyemba](steve.l.nyemba@vumc.org)