data-maker/README.md

66 lines
2.0 KiB
Markdown

## Introduction
This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques
- Generative Adversarial Networks
- With "Earth mover's distance"
## Installation
pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release
## Usage
After installing the easiest way to get started is as follows (using pandas). The process is as follows:
Read about [data-transport on github](https://github.com/lnyemba/data-transport) or on [healthcareio.the-phi.com/git/code/transport](https://healthcareio.the-phi.com/git/code/transport.git)
**Train the GAN on the original/raw dataset**
1. We define the data sources
The sources will consists in source, target and logger20.
import pandas as pd
import data.maker
import transport
from transport import providers
The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data.
**Generate a candidate dataset from the learned features**
import pandas as pd
import data.maker
df = pd.read_csv('sample.csv')
id = 'id'
column = 'gender'
context = 'demo'
data.maker.generate(context=context,data=df,id=id,column=column,logs='logs')
## Limitations
GANS will generate data assuming the original data has all the value space needed:
- No new data will be created
Assuming we have a dataset with an gender attribute with values [M,F].
The synthetic data will not be able to generate genders outside [M,F]
- Not advised on continuous values
GANS work well on discrete values and thus are not advised to be used.
e.g:measurements (height, blood pressure, ...)
- For now will only perform on a single feature.
## Credits :
- [Chao Yan](chao.yan@vanderbilt.edu)
- [Ziqi Zhang](ziqi.zhang@vanderbilt.edu)
- [Brad Malin](b.malin@vanderbilt.edu)
- [Steve L. Nyemba](steve.l.nyemba@vumc.org)