Data protection regulations often prohibit organizations from sharing person-specific records in an “identifiable” form, but what does it mean for data to be identifiable? In today’s information-centric society, personal data is increasingly collected, shared, and analyzed with rapidly diminishing computational constraints that significantly erode apparently strong legal and policy safeguards. The goal of our research is to build computational methods to determine how and when seemingly anonymous data can be re-identified to named individuals without “hacking” into private computer systems.

Our current research focuses on defining and evaluating the threats associated with sharing de-identified DNA and health data derived from electronic medical records. (See Data Protection to see how we are addressing these threats).

Current Grant Support: National Human Genome Research Institute (2011-2018)

Re-identification Threats:

To what extent can data, devoid of explicit identifiers, such as names or Social Security Number, be re-identified to the individuals from which it was derived? This line of research has led to the development of novel machine learning and data mining models. An overview of privacy and identity issues in genomics databases can be found in:

Re-identification Risk Assessments:

Though risks exist, to what extent are they realized in the real world? A summary of our review of re-identification risks in the context of HIPAA de-identification standards can be found in: