Overview

Database indices improve the performance of queries against a database by organizing the data in a way that increase query execution.

This article was written to provide guidance on the setting of indices, primary and foreign keys for data that has been transformed into the Observational Medical Outcome Partnership (OMOP) Common Data Model (CDM). The community that supports the design and development of the OHDSI/CommonDataModel Github repository is a diverse collaborative of healthcare and technical profesisonals whom have limited data base administrative (DBA) experience. As a result, the comments below should be interpreted as suggestions and recommendations to help increase performance. Your teams needs may call for a modified configuration.

General Recommendations

Should your database of choice support indexing, the OMOP CDM Working Group recommends

  • Indexing on all columns containing an “_id” (e.g. condition_occurrence_id, drug_exposure_id, measurement_id, procedure_occurrence_id, etc.)
  • Indexing on primary and foreign keys

For all databases, regardless of custom indice support, primary and foreign keys should be set. This is a step towards ensuring data integrity. Information on what table level attributes should be set as primary and foreign keys can be found within the *_Field_Level.csv file(s) located in the INST/CSV directory

Database support

The OHDSI/CommonDataModel package leverages OHDSI/SQLRender and as a result is only capable of supporting sources that are supported by OHDSI/SQLRender. The following databases are currently supported.

Microsoft SQL Server

Oracle

PostgreSQL

Amazon Redshift

On AWS Redshift it is important to ensure that your data is properly distributed and sorted across nodes. Compression on certain columns may also help. The designed DDL does set DISTKEYS in an effort to optimize performance. This configuration can be seen within the Redshift-specific DDL.

Impala

IBM Netezza

Google BigQuery

Google BigQuery does not require manual optimization and/or sizing. Google BigQuery does massive parallel full table scans and intensive caching, all under the hood. Reference

Microsoft Parallel Data Warehouse (PDW)

SQLite

Databricks