data-maker/README.md

66 lines
2.0 KiB
Markdown
Raw Permalink Normal View History

2019-12-12 18:13:31 +00:00
## Introduction
This package is designed to generate synthetic data from a dataset from an original dataset using deep learning techniques
- Generative Adversarial Networks
- With "Earth mover's distance"
## Installation
pip install git+https://hiplab.mc.vanderbilt.edu/git/aou/data-maker.git@release
## Usage
2019-12-12 18:19:37 +00:00
2019-12-12 18:13:31 +00:00
After installing the easiest way to get started is as follows (using pandas). The process is as follows:
2022-09-16 23:18:15 +00:00
Read about [data-transport on github](https://github.com/lnyemba/data-transport) or on [healthcareio.the-phi.com/git/code/transport](https://healthcareio.the-phi.com/git/code/transport.git)
2020-01-01 05:34:04 +00:00
**Train the GAN on the original/raw dataset**
2019-12-12 18:13:31 +00:00
2022-09-16 23:18:15 +00:00
1. We define the data sources
The sources will consists in source, target and logger20.
2019-12-12 18:13:31 +00:00
2020-01-01 05:34:04 +00:00
import pandas as pd
import data.maker
2022-09-16 23:18:15 +00:00
import transport
from transport import providers
2020-01-01 05:34:04 +00:00
The trainer will store the data on disk (for now) in a structured folder that will hold training models that will be used to generate the synthetic data.
2019-12-12 18:13:31 +00:00
2020-01-01 05:34:04 +00:00
**Generate a candidate dataset from the learned features**
2019-12-12 18:13:31 +00:00
2020-01-01 05:38:52 +00:00
import pandas as pd
import data.maker
2019-12-12 18:15:05 +00:00
2020-01-01 05:38:52 +00:00
df = pd.read_csv('sample.csv')
id = 'id'
column = 'gender'
context = 'demo'
2020-01-01 05:39:54 +00:00
data.maker.generate(context=context,data=df,id=id,column=column,logs='logs')
2019-12-12 18:13:31 +00:00
## Limitations
GANS will generate data assuming the original data has all the value space needed:
- No new data will be created
2019-12-12 18:15:05 +00:00
Assuming we have a dataset with an gender attribute with values [M,F].
2020-01-01 05:34:04 +00:00
2019-12-12 18:15:05 +00:00
The synthetic data will not be able to generate genders outside [M,F]
2020-01-01 05:38:52 +00:00
2019-12-12 18:13:31 +00:00
- Not advised on continuous values
2019-12-12 18:15:05 +00:00
GANS work well on discrete values and thus are not advised to be used.
e.g:measurements (height, blood pressure, ...)
2020-01-01 05:34:04 +00:00
- For now will only perform on a single feature.
2019-10-07 16:40:57 +00:00
2019-12-12 18:16:52 +00:00
## Credits :
2020-01-07 15:40:13 +00:00
- [Chao Yan](chao.yan@vanderbilt.edu)
2019-12-12 18:16:52 +00:00
- [Ziqi Zhang](ziqi.zhang@vanderbilt.edu)
- [Brad Malin](b.malin@vanderbilt.edu)
2020-01-07 15:40:13 +00:00
- [Steve L. Nyemba](steve.l.nyemba@vumc.org)