privacykit/README.md

33 lines
1.1 KiB
Markdown
Raw Normal View History

2018-09-06 17:45:59 +00:00
# deid-risk
2018-09-10 14:48:04 +00:00
This project is intended to compute an estimated value of risk for a given database.
2018-09-27 15:35:35 +00:00
1. Pull meta data of the database and create a dataset via joins
2018-09-10 14:53:18 +00:00
2. Generate the dataset with random selection of features
2018-09-27 15:35:35 +00:00
3. Compute risk via SQL using group by
## Python environment
The following are the dependencies needed to run the code:
pandas
numpy
pandas-gbq
google-cloud-bigquery
## Usage
2018-09-27 15:47:02 +00:00
*Generate The merged dataset
2018-09-27 15:35:35 +00:00
python risk.py create --i_dataset <in dataset|schema> --o_dataset <out dataset|schema> --table <name> --path <bigquery-key-file> --key <patient-id-field-name> [--file ]
2018-09-27 15:47:02 +00:00
* * Compute risk (marketer, prosecutor)
2018-09-27 15:35:35 +00:00
python risk.py compute --i_dataset <dataset> --table <name> --path <bigquery-key-file> --key <patient-id-field-name>
## Limitations
- It works against bigquery for now
@TODO:
- Need to write a transport layer (database interface)
- Support for referential integrity, so one table can be selected and a dataset derived given referential integrity
- Add support for journalist risk