OMOP/rmd/cdmPrivacy.Rmd

---
title: "**Preserving Privacy in an OMOP CDM Implementation**"
output: 
  html_document:
        toc: TRUE
        toc_float: TRUE
---
*By Kristin Kostka*

# Background
The OMOP CDM is a person-centric model. Being person-centric means the model can retain attributes that may be considered personal identified information (PII) or protected health information (PHI). There are many different ways a site may treat their OMOP CDM to uphold their privacy protocols. In this article we provide guidance on overall process and the potential fields that should be monitored to adhere to these various privacy preserving protocols.

### Defining PII and PHI 
- PII is defined as any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred to either direct or indirect means [1].
- The United States Department of Health & Human Services´ Office for Civil Rights has defined PHI as any Personal Identifying Information  (PII) that – individually or combined – could potentially identify a specific individual, their past, present or future healthcare, or the method of payment. There are eighteen unique identifiers considered to be PHI: 1) names, 2) geographic data, 3) all elements of dates, 4) telephone numbers, 5) FAX numbers, 6) email addresses, 7) Social Security numbers (SSN), 8) medical record numbers (MRN), 9) health plan beneficiary numbers, 10) account numbers, 11) certificate/license numbers, 12) vehicle identifiers and serial numbers including license places, 13) device identifiers and serial numbers, 14) web URLs, 15) internet protocol addresses, 16) biometric identifiers (i.e. retinal scan, fingerprints), 17) full face photos and comparable images, and 18) any unique identifying number, characteristic or code. PHI is no longer considered PHI when it de-identified of these unique attributes. PHI is commonly referred to in relation to the Health Insurance Portability and Accountability Act (HIPAA) and associated legislation such as the Health Information Technology for Economic and Clinical Health Act (HITECH) [2]. 

# The Data Holder's Responsibility 
In OHDSI, it is the responsibility of each data holder to know, understand and follow local data governance processes related to use of the OMOP CDM. In the United States, these processes will follow your organization's local interpretation for maintaining compliance to PII and PHI protection. In OMOP CDM implementations containing European Union citizen data, local governance processes will include measures to comply with General Data Protection Regulation (GDPR) [3]. As a community, the OHDSI data network covers more than 330 databases from 34 countries. There is extensive community knowledge on the interpretation of rule sets and exemplar IRB and local governance workflows that can be made available to institutions navigating these processes for the first time. If your organization does not have an established data governance process, please reach out on the [OHDSI Forums](forums.ohdsi.org) under "Implementers" and the community can respond with shared guidance from their own deployments. As a community, we aim to conduct research that keeps patient-level data local and share only aggregate results.

# Complying with Privacy Preservation
Complying with local governance processes depends on the rule set being used. There may be allowable times when data use agreements and data transfer agreements exist between collaborating institutions to facilitate sharing of PII and PHI. In this section we will discuss common rule sets that organizations adhere to.

## Limited Data Sets
A limited data set (LDS) is defined as protected health information that excludes certain direct identifiers of an individual or of relatives, employers or household members of the individual — but may include city, state, ZIP code and elements of dates. A LDS can be disclosed only for purposes of research, public health or health care operations. LDS requirements are dictated by the HIPAA Privacy Rule.

## De-identified Data Sets
A de-identified data, as defined by Section 164.514(a) of the HIPAA Privacy rule, is health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information. There are two methods for achieving de-identification in accordance with HIPAA [4].

1) Expert Determination (§164.514(a))- Implementation specifications: requirements for de-identification of protected health information. A covered entity may determine that health information is not individually identifiable health information only if:
(1) A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:
(i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and
(ii) Documents the methods and results of the analysis that justify such determination.

2) Safe Harbor (§164.514(b)) The eighteen unique identifiers are obfuscated. This includes processes such as:
A) Dates of service are algorithmically shifted to protect patient privacy.
B) Patient ZIP codes are truncated to the first three digits or removed entirely if the ZIP code represents fewer than 20,000 individuals.
C) Removing and, when necessary, replacing unique identifiers
AND
The entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

## Field-level Implications of De-identification Processes

### PERSON Table Attributes
In the OMOP CDM, the PERSON table serves as the central identity management for all Persons in the database. It contains records that uniquely identify each person or patient, and some demographic information. It is a table that has a number of field-level implications for privacy preserving protocols.

Considerations include:
- PERSON.person_id should never contain Medical Record Number, Social Security Number or similar uniquely identifiable number. This should be a number that is essentially meaningless  but has the ability to be a primary key across tables.
- PERSON.year_of_birth, PERSON.month_of_birth and PERSON.date_of_birth, PERSON.birth_datetime may require some redaction or modification depending on interpretation of rule set. Consult local guidance on the need to modify these fields when creating compliant views of de-identified data.
- PERSON.person_source_value may contain sensitive information used to generate the person_id field. It is advised to practice caution when creating views of these data. It would be wise to obfuscate or redact this field if you are not sure what is contained in the raw information being extracted, transformed and loaded into the CDM.

## Date Fields Across Domains
Date fields are used across many OMOP domains including: OBSERVATION_PERIOD, VISIT_OCCURRENCE, VISIT_DETAIL, CONDITION_OCCURRENCE, DRUG_EXPOSURE, PROCEDURE_OCCURRENCE, DEVICE_EXPOSURE, MEASUREMENT, OBSERVATION, DEATH, NOTE, NOTE_NLP, SPECIMEN, PAYER_PLAN_PERIOD, DRUG_ERA, DOSE_ERA, and CONDITION_ERA.

As discussed previously, some rule sets may require algorithmically shifting dates. It is advised that when date shifting is applied, it is done holistically. This means that when shifting dates, you should not treat each record independently. Instead, a robust date shifting algorithm will link off the *.person_id (where * is the domain name such as CONDITION_OCCURRENCE, etc) and apply the same offset to all events. This allows researchers to have the ability to understand the sequence of events while preserving patient privacy.

The implications of not holistically shifting all events together by the same offset means that information may be out of sequence or illogical. An example would be a death record that happens prior to other event records (conditions, drugs, procedures, etc). When applying an algorithmic shift of dates, it is important to educate your OMOP CDM user group of the known offset. This is especially important in temporal studies which may be looking to make statements about disease history relative to the time when an event is observed. 

Some rule sets do not require algorithmic shifting of dates, such as Limited Data Sets. In these situations, a user of a LDS OMOP CDM would not be expecting dates to the shifted. If a shift is applied, it should be disclosed and the offset amount (e.g. +/- 7 days, +/- 30 days, etc) should be made available to those who have received permission to use a LDS dataset. Otherwise, these data are not upholding the assumptions of the rule set applied.

## LOCATION Table Attributes
The LOCATION table represents a generic way to capture physical location or address information of Persons and Care Sites. When applying privacy preserving procedures, this table should be reviewed and scrubbed relative to the rule set. The LOCATION.zip field should be redacted relative to the type of process applied (e.g. 3-digit zip  for de-identified data). The LOCATION.location_source_value should be reviewed for potential PII/PHI. It would be wise to obfuscate or redact this field if you are not sure what is contained in the raw information being extracted, transformed and loaded into the CDM.

## PROVIDER Table Attributes
The PROVIDER table contains a list of uniquely identified healthcare providers. These are individuals providing hands-on healthcare to patients, such as physicians, nurses, midwives, physical therapists etc. In some privacy preserving processes, the PROVIDER.npi and PROVIDER.dea fields may be redacted. Please review this field and confirm that you are adhering to privacy rule sets. The PROVIDER.year_of_birth field is an optional field that may also require treatment in certain rule sets.

## OBSERVATION Table Attributes
The OBSERVATION table captures clinical facts about a Person obtained in the context of examination, questioning or a procedure. Any data that cannot be represented by any other domains, such as social and lifestyle facts, medical history, family history, etc. are recorded here.

We strongly caution ETL teams to review the OBSERVATION table for potential PII/PHI. In some source systems, there can be information coming in these vocabularies that are not laboratory or clinical observations but instead are patient identifiers. If you search in ATHENA, you will find there are a number of standard terms in the SNOMED and LOINC vocabularies that can represent phone numbers, emails, and other PII information. 

It is difficult to create an exhaustive list of terms because these ontologies do not maintain or publish lists of terms that may contain patient identifiers. It is, therefore, up to data holders to perform a review of this domain with an eye for these potential privacy issues. The National COVID Cohort Collaborative, a NIH consortium which uses the OMOP CDM, has published a resource for sites needing assistance with identifying these potentially problematic records.  A “live” version of this table that will track updates over time is hosted at https://github.com/data2health/next-gen-data-sharing/blob/master/CodesWithPPIPotential.csv. We welcome additions to this list from the community.

## Scrutinizing *_source_value 
The *_source_value (where * is the domain name such as CONDITION_OCCURRENCE, etc) fields present an opportunity for sites to carry forward potential PII/PHI in ETL processes. Because the convention of the OMOP CDM has minimal boundaries on what are retained in these fields, it is important to treat all source_value fields as potentially containing PII/PHI. We highly advise all data holders to scrutinize these fields when applying privacy preserving processes. It is not uncommon for fields to be overloaded and contain potential patient identifiers. Please use caution when transmitting or making views of these fields available to users. 

## Scrutinizing String Fields
Across OMOP Domains, there are many fields which permit the use of strings (e.g. DRUG_EXPOSURE.sig, MEASUREMENT.value_as_string, OBSERVATION.value_as_string,  We've discussed some of these in prior sections. It is advised that fields with strings often have the potential to contain unintentional PII/PHI. Targeted regular expressions can be built into ETL processes in order to “sniff” out any additional PII (or potential PII)--such as any data in the format of a phone number, or a person or place name. (E.g., the regular expression "Mr\.|Mrs\.|\bMiss\b|Dr\.|, M\.?D\.?" will find any string with an English name prefix.) Depending on risk tolerance, the expressions could err toward sensitivity or specificity, and could be tweaked over time to meet different rule sets.

Extensive regular expression matching during ETL may add significant processing time and should therefore not be relied upon as a sole solution, but rather an extra protection against edge cases. Other algorithmic rules may also prove useful, such as automatically quarantining records with lengthy string values (which could signal the presence of free text). If these approaches are implemented, records that match the regular expressions or rules can be quarantined in a separate table or staging area to be manually reviewed by a data broker. Thus, in addition to adding another layer of PII protection, another advantage of these approaches is the potential to uncover ways that underlying vocabularies may be contributing to unintentional sharing of PII and create awareness for future privacy preserving processes shared across the community.

## NOTE and NOTE_NLP Table Attributes
The NOTE table captures unstructured information that was recorded by a provider about a patient in free text (in ASCII, or preferably in UTF8 format) notes on a given date. The NOTE_NLP table encodes all output of NLP on clinical notes. Each row represents a single extracted term from a note. There is a high potential these tables may retain information that is considered PII/PHI. In addition to overall string searching, these tables are likely to be dropped altogether to adhere to the most stringent rule sets.

It is highly advised that if you are conducting a study with NOTE and NOTE_NLP table information, please consult with your local governance and privacy officers to ensure compliance with local rule sets.

## Conclusion
Privacy preserving processes are not one-size fits all. There are many different rule sets that can be applied to datasets. Data holders are recommended to consult with their local privacy officer(s) to ensure all processes applied to a database are compliant with local interpretation of the selected rule set.

# References
1. 
BibText version:
@MISC{noauthor_undated-mg,
  title        = "Guidance on the Protection of Personal Identifiable
                  Information",
  abstract     = "Personal Identifiable Information (PII) is defined as:",
  howpublished = "\url{https://www.dol.gov/general/ppii}",
  note         = "Accessed: 2021-8-18"
}

Regular citation:
Guidance on the Protection of Personal Identifiable Information. [cited 18 Aug 2021]. Available: https://www.dol.gov/general/ppii

2.
BibText version:
@MISC{HIPAA_Journal2017-yo,
  title        = "What Does {PHI} Stand For?",
  author       = "{HIPAA Journal}",
  abstract     = "PHI is a term used in connection with health data, but what
                  does PHI stand for? What information is included in the
                  definition of PHI.",
  month        =  dec,
  year         =  2017,
  howpublished = "\url{https://www.hipaajournal.com/what-does-phi-stand-for/}",
  note         = "Accessed: 2021-8-18",
  language     = "en"
}

Regular citation:
HIPAA Journal. What Does PHI Stand For? 23 Dec 2017 [cited 18 Aug 2021]. Available: https://www.hipaajournal.com/what-does-phi-stand-for/

3.
BibText version:

@MISC{noauthor_2018-mt,
  title        = "General Data Protection Regulation ({GDPR}) Compliance
                  Guidelines",
  abstract     = "The EU General Data Protection Regulation went into effect on
                  May 25, 2018, replacing the Data Protection Directive
                  95/46/EC. Designed to increase data privacy for EU citizens,
                  the regulation levies steep fines on organizations that don't
                  follow the law.",
  month        =  jun,
  year         =  2018,
  howpublished = "\url{https://gdpr.eu/}",
  note         = "Accessed: 2021-8-18",
  language     = "en"
}


Regular citation:
General Data Protection Regulation (GDPR) Compliance Guidelines. 18 Jun 2018 [cited 18 Aug 2021]. Available: https://gdpr.eu/

4. 
BibText version:

@MISC{Office_for_Civil_Rights_OCR_undated-zy,
  title        = "Methods for De-identification of {PHI}",
  author       = "{Office for Civil Rights (OCR)}",
  abstract     = "Guidance about methods and approaches to achieve
                  de-identification in accordance with the Health Insurance
                  Portability and Accountability Act of 1996.",
  howpublished = "\url{https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html}",
  note         = "Accessed: 2021-8-19"
}

Regular citation:
Office for Civil Rights (OCR). Methods for De-identification of PHI. [cited 19 Aug 2021]. Available: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html