data-maker/README.md

57 lines
1.4 KiB
Markdown
Raw Normal View History

2019-12-12 18:13:31 +00:00
## Introduction
---
This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques
- Generative Adversarial Networks
- With "Earth mover's distance"
## Installation
---
pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release
## Usage
---
After installing the easiest way to get started is as follows (using pandas). The process is as follows:
1. Train the GAN on the original/raw dataset
import pandas as pd
import data.maker
df = pd.read_csv('myfile.csv')
cols= ['f1','f2','f2']
data.maker.train(data=df,cols=cols,logs='logs')
2. Generate a candidate dataset from the learnt features
import pandas as pd
import data.maker
df = data.maker.generate(logs='logs')
df.head()
2019-12-12 18:15:05 +00:00
2019-12-12 18:13:31 +00:00
## Limitations
---
GANS will generate data assuming the original data has all the value space needed:
- No new data will be created
2019-12-12 18:15:05 +00:00
Assuming we have a dataset with an gender attribute with values [M,F].
The synthetic data will not be able to generate genders outside [M,F]
2019-12-12 18:13:31 +00:00
- Not advised on continuous values
2019-12-12 18:15:05 +00:00
GANS work well on discrete values and thus are not advised to be used.
e.g:measurements (height, blood pressure, ...)
2019-10-07 16:40:57 +00:00
2019-12-12 18:16:52 +00:00
## Credits :
---
- [Ziqi Zhang](ziqi.zhang@vanderbilt.edu)
- [Brad Malin](b.malin@vanderbilt.edu)
- [Steve L. Nyemba](steve.l.nyemba@vanderbilt.edu)