Go to file
Steve Nyemba 363bd376e1 credits 2019-12-12 12:16:52 -06:00
data-maker structuring the repository 2019-12-12 11:43:08 -06:00
Dockerfile bug fix, and documentation 2019-12-12 12:13:31 -06:00
README.md credits 2019-12-12 12:16:52 -06:00
setup.py bug fix 2019-12-12 11:49:06 -06:00

README.md

Introduction


This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques

- Generative Adversarial Networks
- With "Earth mover's distance"

Installation


pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release

Usage


After installing the easiest way to get started is as follows (using pandas). The process is as follows:

  1. Train the GAN on the original/raw dataset

     import pandas as pd
     import data.maker
    
     df  = pd.read_csv('myfile.csv')
     cols= ['f1','f2','f2']  
     data.maker.train(data=df,cols=cols,logs='logs')
    
  2. Generate a candidate dataset from the learnt features

    import pandas as pd import data.maker

    df = data.maker.generate(logs='logs') df.head()

Limitations


GANS will generate data assuming the original data has all the value space needed:

  • No new data will be created

      Assuming we have a dataset with an gender attribute with values [M,F]. 
      The synthetic data will not be able to generate genders outside [M,F]
    
  • Not advised on continuous values

      GANS work well on discrete values and thus are not advised to be used.
      e.g:measurements (height, blood pressure, ...)
    

Credits :