data-maker

Go to file

Steve Nyemba 587248c63b bug fix		2022-04-21 11:07:56 -05:00
bin	new features, bug fixes	2021-03-29 18:53:57 -05:00
data	bug fix	2022-04-21 11:07:56 -05:00
drive	Bug fix with the number of candidates generated	2020-03-14 11:12:13 -05:00
Dockerfile	bug fix, and documentation	2019-12-12 12:13:31 -06:00
README.md	Update 'README.md'	2020-01-07 09:40:13 -06:00
binder.py	bug fixes: enhancements	2022-03-24 11:38:52 -05:00
pipeline.py	bug fixes and simplified interface	2022-04-11 18:33:07 -05:00
setup.py	bug fix with shuffler	2022-04-21 10:53:19 -05:00

README.md

Introduction

This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques

- Generative Adversarial Networks
- With "Earth mover's distance"

Installation

pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release

Usage

After installing the easiest way to get started is as follows (using pandas). The process is as follows:

Train the GAN on the original/raw dataset

import pandas as pd
import data.maker

df      = pd.read_csv('sample.csv')
column  = 'gender'
id      = 'id' 
context = 'demo'
data.maker.train(context=context,data=df,column=column,id=id,logs='logs')

The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data.

Generate a candidate dataset from the learned features

import pandas as pd
import data.maker

df  = pd.read_csv('sample.csv')
id  = 'id'
column = 'gender'
context = 'demo'
data.maker.generate(context=context,data=df,id=id,column=column,logs='logs')

Limitations

GANS will generate data assuming the original data has all the value space needed:

No new data will be created

  Assuming we have a dataset with an gender attribute with values [M,F]. 

  The synthetic data will not be able to generate genders outside [M,F]

Not advised on continuous values

  GANS work well on discrete values and thus are not advised to be used.
  e.g:measurements (height, blood pressure, ...)

For now will only perform on a single feature.

README.md

Introduction

Installation

Usage

Limitations

Credits :