Skip to content

Commit 43c9131

Browse files
author
Kevin Armengol
committed
Revamped README and ipynb. Minor modifications to pipelines to support hydra_search.
1 parent 4bad23f commit 43c9131

20 files changed

+2197
-531
lines changed

README.md

Lines changed: 77 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
11
# data-dictionary-cui-mapping
22

3-
This package allows you to load in a data dictionary and semi-automatically query appropriate UMLS concepts using either the UMLS API, MetaMap API, and/or Semantic Search through a custom Pinecone vector database .
3+
This package assists with mapping a user's data dictionary fields to [UMLS](https://www.nlm.nih.gov/research/umls/index.html) concepts. It is designed to be modular and flexible to allow for different configurations and use cases.
4+
5+
Roughly, the high-level steps are as follows:
6+
- Configure yaml files
7+
- Load in data dictionary
8+
- Preprocess desired columns
9+
- Query for UMLS concepts using any or all of the following pipeline modules:
10+
- **umls** (*UMLS API*)
11+
- **metamap** (*MetaMap API*)
12+
- **semantic_search** (*relies on access to a custom Pinecone vector database*)
13+
- **hydra_search** (*combines any combination of the above three modules*)
14+
- Manually curate/select concepts in excel
15+
- Create data dictionary file with new UMLS concept fields
416

517
## Prerequisites
618

@@ -9,7 +21,7 @@ This package allows you to load in a data dictionary and semi-automatically quer
921

1022
## Installation
1123

12-
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install data-dictionary-cui-mapping or pip install from the GitHub repo.
24+
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install [data-dictionary-cui-mapping](https://pypi.org/project/data-dictionary-cui-mapping/) from PyPI or pip install from the [GitHub repo](https://github.com/kevon217/data-dictionary-cui-mapping). The project uses [poetry](https://python-poetry.org/) for packaging and dependency management.
1325

1426
```bash
1527
pip install data-dictionary-cui-mapping
@@ -18,7 +30,7 @@ pip install data-dictionary-cui-mapping
1830

1931
## Input: Data Dictionary
2032

21-
Below is a sample data dictionary format that can be used as input for this package.
33+
Below is a sample data dictionary format (*.csv*) that can be used as input for this package:
2234

2335
| variable name | title | permissible value descriptions |
2436
| ------------- | ---------------------- |--------------------------------|
@@ -51,60 +63,93 @@ In order to run and customize these pipelines, you will need to create/edit yaml
5163
│ │ │ embeddings.yaml
5264
```
5365

54-
## UMLS API and MetaMap Batch Queries
66+
## CUI Batch Query Pipelines
5567

56-
#### Import modules
68+
69+
### STEP-1A: RUN BATCH QUERY PIPELINE
70+
###### IMPORT PACKAGES
5771

5872
```python
59-
# import batch_query_pipeline modules from metamap OR umls package
60-
from ddcuimap.metamap import batch_query_pipeline as mm_bqp
61-
from ddcuimap.umls import batch_query_pipeline as umls_bqp
73+
# from ddcuimap.umls import batch_query_pipeline as umls_bqp
74+
# from ddcuimap.metamap import batch_query_pipeline as mm_bqp
75+
# from ddcuimap.semantic_search import batch_hybrid_query_pipeline as ss_bqp
76+
from ddcuimap.hydra_search import batch_hydra_query_pipeline as hs_bqp
6277

63-
# import helper functions for loading, viewing, composing configurations for pipeline run
6478
from ddcuimap.utils import helper
6579
from omegaconf import OmegaConf
66-
67-
# import modules to create data dictionary with curated CUIs and check the file for missing mappings
68-
from ddcuimap.curation import create_dictionary_import_file
69-
from ddcuimap.curation import check_cuis
7080
```
71-
#### Load/edit configuration files
81+
###### LOAD/EDIT CONFIGURATION FILES
7282
```python
73-
cfg = helper.compose_config.fn(overrides=["custom=de", "apis=config_metamap_api"]) # custom config for MetaMap on data element 'title' column
74-
# cfg = helper.compose_config.fn(overrides=["custom=de", "apis=config_umls_api"]) # custom config for UMLS API on data element 'title' column
75-
# cfg = helper.compose_config.fn(overrides=["custom=pvd", "apis=config_metamap_api"]) # custom config for MetaMap on 'permissible value descriptions' column
76-
# cfg = helper.compose_config.fn(overrides=["custom=pvd", "apis=config_umls_api"]) # custom config for UMLS API on 'permissible value descriptions' column
77-
cfg.apis.user_info.email = '' # enter your email
78-
cfg.apis.user_info.apiKey = '' # enter your api key
79-
print(OmegaConf.to_yaml(cfg))
83+
cfg_hydra = helper.compose_config.fn(overrides=["custom=hydra_base"])
84+
# cfg_umls = helper.compose_config.fn(overrides=["custom=de", "apis=config_umls_api"])
85+
cfg_mm = helper.compose_config.fn(overrides=["custom=de", "apis=config_metamap_api"])
86+
cfg_ss = helper.compose_config.fn(
87+
overrides=[
88+
"custom=title_def",
89+
"semantic_search=embeddings",
90+
"apis=config_pinecone_api",
91+
]
92+
)
93+
94+
# # UMLS API CREDENTIALS
95+
# cfg_umls.apis.umls.user_info.apiKey = ''
96+
# cfg_umls.apis.umls.user_info.email = ''
97+
98+
# # MetaMap API CREDENTIALS
99+
# cfg_mm.apis.metamap.user_info.apiKey = ''
100+
# cfg_mm.apis.metamap.user_info.email = ''
101+
#
102+
# # Pinecone API CREDENTIALS
103+
# cfg_ss.apis.pinecone.index_info.apiKey = ''
104+
# cfg_ss.apis.pinecone.index_info.environment = ''
105+
106+
print(OmegaConf.to_yaml(cfg_hydra))
80107
```
81108

82-
#### Step 1: Run batch query pipeline
109+
###### RUN BATCH QUERY PIPELINE
83110
```python
84-
df_final_mm = mm_bqp.run_mm_batch(cfg) # run MetaMap batch query pipeline
85-
# df_final_umls = umls_bqp.run_umls_batch(cfg) # run UMLS API batch query pipeline
111+
# df_umls, cfg_umls = umls_bqp.run_umls_batch(cfg_umls)
112+
# df_mm, cfg_mm = mm_bqp.run_mm_batch(cfg_mm)
113+
# df_ss, cfg_ss = ss_bqp.run_hybrid_ss_batch(cfg_ss)
114+
df_hydra, cfg_step1 = hs_bqp.run_hydra_batch(cfg_hydra, cfg_umls=None, cfg_mm=cfg_mm, cfg_ss=cfg_ss)
115+
116+
print(df_hydra.head())
86117
```
87118

88-
#### Step 2: *Manual curation step in excel file
119+
### STEP-1B: **MANUAL CURATION STEP IN EXCEL*
89120

121+
###### CURATION/SELECTION
90122
*see curation example in ***notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx***
91123

92-
#### Step 3: Create data dictionary import file
124+
### STEP-2A: CREATE DATA DICTIONARY IMPORT FILE
125+
126+
###### IMPORT CURATION MODULES
127+
```python
128+
from ddcuimap.curation import create_dictionary_import_file
129+
from ddcuimap.curation import check_cuis
130+
from ddcuimap.utils import helper
131+
```
132+
###### CREATE DATA DICTIONARY IMPORT FILE
93133

94134
```python
95-
cfg = helper.load_config.fn(helper.choose_file.fn("Load config file from Step 1"))
96-
create_dictionary_import_file.create_dd_file(cfg)
135+
cfg_step1 = helper.load_config.fn(helper.choose_file("Load config file from Step 1"))
136+
df_dd = create_dictionary_import_file.create_dd_file(cfg_step1)
137+
print(df_dd.head())
97138
```
98139

99-
#### Step 4: Check curated cui mappings
140+
### STEP-2B: CHECK CUIS IN DATA DICTIONARY IMPORT FILE
100141

142+
###### CHECK CUIS
101143
```python
102-
cfg = helper.load_config.fn(helper.choose_file.fn("Load config file from Step 2"))
103-
check_cuis.check_cuis(cfg)
144+
cfg_step2 = helper.load_config.fn(helper.choose_file("Load config file from Step 2"))
145+
df_check = check_cuis.check_cuis(cfg_step2)
146+
print(df_check.head())
104147
```
105148

106149
## Output: Data Dictionary + CUIs
107-
Below is the final output of the data dictionary with curated CUIs.
150+
Below is a sample modified data dictionary with curated CUIs after:
151+
1. Running Steps 1-2 on **title** then taking the generated output dictionary file and;
152+
2. Running Steps 1-2 again on **permissible value descriptions** to get the final output dictionary file.
108153

109154
| variable name | title | data element concept identifiers | data element concept names | data element terminology sources | permissible values | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names | permissible value terminology sources |
110155
| ------------- | ---------------------- | -------------------------------- | -------------------------- | -------------------------------- | -------------------- | ------------------------------ | ------------------------------ | ------------------------------------- | ----------------------------------------- | ------------------------------------- |

ddcuimap/configs/config.yaml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,5 @@ defaults:
44
- config_umls_api
55
- config_metamap_api
66
- config_pinecone_api
7-
- custom:
8-
- de
9-
- title_def
7+
- custom: null
108
- semantic_search: null

ddcuimap/configs/custom/de.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ data_dictionary_settings:
1313

1414
preprocessing_settings:
1515
remove_stopwords : true
16-
stopwords_filepath: 'C:\\Users\\armengolkm\\Desktop\\Full Pipeline Test v1.1.0\\MetaMap_Settings_StopWords.csv'
16+
stopwords_filepath:
1717
use_cheatsheet : false
1818
cheatsheet_filepath:
1919

ddcuimap/configs/custom/hydra_base.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ data_dictionary_settings:
1414

1515
preprocessing_settings:
1616
remove_stopwords :
17-
stopwords_filepath: 'C:\\Users\\armengolkm\\Desktop\\Full Pipeline Test v1.1.0\\MetaMap_Settings_StopWords.csv'
17+
stopwords_filepath:
1818
use_cheatsheet :
1919
cheatsheet_filepath:
2020

ddcuimap/configs/custom/pvd.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ data_dictionary_settings:
1414

1515
preprocessing_settings:
1616
remove_stopwords : true
17-
stopwords_filepath: 'C:\\Users\\armengolkm\\Desktop\\Full Pipeline Test v1.1.0\\MetaMap_Settings_StopWords.csv'
17+
stopwords_filepath:
1818
use_cheatsheet : false
1919
cheatsheet_filepath:
2020

ddcuimap/configs/custom/title_def.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ data_dictionary_settings:
1414

1515
preprocessing_settings:
1616
remove_stopwords : false
17-
stopwords_filepath: 'C:\\Users\\armengolkm\\Desktop\\Full Pipeline Test v1.1.0\\MetaMap_Settings_StopWords.csv'
17+
stopwords_filepath:
1818
use_cheatsheet : false
1919
cheatsheet_filepath:
2020

ddcuimap/curation/utils/curation_functions.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -226,11 +226,20 @@ def concat_cols_umls(df, umls_columns: list):
226226
# @task(name="Reordering examples dictionary columns")
227227
def reorder_cols(df, order: list):
228228
"""Reorder columns"""
229-
230-
df = df[order]
229+
order_exists = keep_existing_cols(df.columns, order)
230+
df = df[order_exists]
231231
return df
232232

233233

234+
def keep_existing_cols(df_cols, cols_to_check: list):
235+
"""Keep existing columns"""
236+
cols_incl = list(set(cols_to_check).intersection(df_cols))
237+
cols_excl = list(set(cols_to_check).difference(df_cols))
238+
cols = [x for x in df_cols if x not in cols_excl]
239+
print(f"The following columns were not found and will be excluded: {cols_excl}")
240+
return cols
241+
242+
234243
@task(name="Manual override of column values")
235244
def override_cols(df, override: dict):
236245
"""Custom function to accommodate current bug in BRICS examples dictionary import process that wants multi-CUI concepts to have a single source terminology

ddcuimap/utils/process_data_dictionary.py renamed to ddcuimap/curation/utils/process_data_dictionary.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import pandas as pd
22
from prefect import flow, task
33

4-
from . import helper as helper
4+
from ddcuimap.utils import helper as helper
55
from . import text_processing as tp
66

77

File renamed without changes.

ddcuimap/hydra_search/batch_hydra_query_pipeline.py

Lines changed: 49 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,13 @@
99
from pathlib import Path
1010

1111
import ddcuimap.utils.helper as helper
12-
import ddcuimap.utils.process_data_dictionary as proc_dd
12+
import ddcuimap.curation.utils.process_data_dictionary as proc_dd
1313
import ddcuimap.curation.utils.curation_functions as cur
1414
import ddcuimap.umls.batch_query_pipeline as umls
1515
import ddcuimap.metamap.batch_query_pipeline as mm
1616
import ddcuimap.semantic_search.batch_hybrid_query_pipeline as ss
1717

18-
cfg = helper.compose_config.fn(overrides=["custom=hydra_base"])
18+
cfg_hydra = helper.compose_config.fn(overrides=["custom=hydra_base"])
1919
cfg_umls = helper.compose_config.fn(overrides=["custom=de", "apis=config_umls_api"])
2020
cfg_mm = helper.compose_config.fn(overrides=["custom=de", "apis=config_metamap_api"])
2121
cfg_ss = helper.compose_config.fn(
@@ -32,73 +32,86 @@
3232
flow_run_name="Running UMLS/MetaMap/Semantic Search hydra search pipeline",
3333
log_prints=True,
3434
)
35-
def run_hydra_batch(cfg, cfg_umls, cfg_mm, cfg_ss, **kwargs):
35+
def run_hydra_batch(cfg_hydra, **kwargs):
3636
# LOAD DATA DICTIONARY FILE
37-
df_dd, fp_dd = proc_dd.load_data_dictionary(cfg)
37+
df_dd, fp_dd = proc_dd.load_data_dictionary(cfg_hydra)
3838

3939
# CREATE STEP 1 DIRECTORY
4040
dir_step1 = helper.create_folder.fn(
4141
Path(fp_dd).parent.joinpath(
42-
f"{cfg.custom.curation_settings.file_settings.directory_prefix}_Step-1_Hydra-search"
42+
f"{cfg_hydra.custom.curation_settings.file_settings.directory_prefix}_Step-1_Hydra-search"
4343
)
4444
)
4545

46+
# STORE PIPELINE RESULTS
47+
cat_dfs = []
48+
4649
## UMLS API ##
47-
dir_step1_umls = helper.create_folder(
48-
Path(dir_step1).joinpath(
49-
f"{cfg.custom.curation_settings.file_settings.directory_prefix}_Step-1_umls-api-search"
50+
cfg_umls = kwargs.get("cfg_umls")
51+
if cfg_umls:
52+
dir_step1_umls = helper.create_folder(
53+
Path(dir_step1).joinpath(
54+
f"{cfg_hydra.custom.curation_settings.file_settings.directory_prefix}_Step-1_umls-api-search"
55+
)
5056
)
51-
)
52-
df_umls, cfg_umls = umls.run_umls_batch(
53-
cfg_umls, df_dd=df_dd, dir_step1=dir_step1_umls
54-
)
57+
df_umls, cfg_umls = umls.run_umls_batch(
58+
cfg_umls, df_dd=df_dd, dir_step1=dir_step1_umls
59+
)
60+
cat_dfs.append(df_umls)
5561

5662
## METAMAP API ##
57-
dir_step1_mm = helper.create_folder(
58-
Path(dir_step1).joinpath(
59-
f"{cfg.custom.curation_settings.file_settings.directory_prefix}_Step-1_metamap-search"
63+
cfg_mm = kwargs.get("cfg_mm")
64+
if cfg_mm:
65+
dir_step1_mm = helper.create_folder(
66+
Path(dir_step1).joinpath(
67+
f"{cfg_hydra.custom.curation_settings.file_settings.directory_prefix}_Step-1_metamap-search"
68+
)
6069
)
61-
)
62-
df_metamap, cfg_mm = mm.run_mm_batch(cfg_mm, df_dd=df_dd, dir_step1=dir_step1_mm)
70+
df_metamap, cfg_mm = mm.run_mm_batch(
71+
cfg_mm, df_dd=df_dd, dir_step1=dir_step1_mm
72+
)
73+
cat_dfs.append(df_metamap)
6374

6475
## SEMANTIC SEARCH ##
65-
66-
dir_step1_ss = helper.create_folder(
67-
Path(dir_step1).joinpath(
68-
f"{cfg.custom.curation_settings.file_settings.directory_prefix}_Step-1_hybrid-semantic-search_alpha={cfg_ss.semantic_search.query.alpha}"
76+
cfg_ss = kwargs.get("cfg_ss")
77+
if cfg_ss:
78+
dir_step1_ss = helper.create_folder(
79+
Path(dir_step1).joinpath(
80+
f"{cfg_hydra.custom.curation_settings.file_settings.directory_prefix}_Step-1_hybrid-semantic-search_alpha={cfg_ss.semantic_search.query.alpha}"
81+
)
6982
)
70-
)
71-
df_semantic_search, cfg_ss = ss.run_hybrid_ss_batch(
72-
cfg_ss, df_dd=df_dd, dir_step1=dir_step1_ss
73-
)
83+
df_semantic_search, cfg_ss = ss.run_hybrid_ss_batch(
84+
cfg_ss, df_dd=df_dd, dir_step1=dir_step1_ss
85+
)
86+
cat_dfs.append(df_semantic_search)
7487

7588
## COMBINE RESULTS ##
7689

77-
df_results = pd.concat(
78-
[df_umls, df_metamap, df_semantic_search], axis=0, ignore_index=True
79-
)
90+
df_results = pd.concat(cat_dfs, axis=0, ignore_index=True)
8091
df_results.to_csv(Path(dir_step1).joinpath("hydra_search_results.csv"), index=False)
8192

8293
# FORMAT CURATION DATAFRAME
83-
df_dd_preprocessed = proc_dd.process_data_dictionary(df_dd, cfg)
84-
pipeline_name = f"hydra-search (custom={cfg.custom.settings.custom_config})"
94+
df_dd_preprocessed = proc_dd.process_data_dictionary(df_dd, cfg_hydra)
95+
pipeline_name = f"hydra-search (custom={cfg_hydra.custom.settings.custom_config})"
8596
df_curation = cur.format_curation_dataframe(
86-
df_dd, df_dd_preprocessed, pipeline_name, cfg
97+
df_dd, df_dd_preprocessed, pipeline_name, cfg_hydra
8798
)
88-
curation_cols = list(cfg.custom.curation_settings.information_columns) + [
99+
curation_cols = list(cfg_hydra.custom.curation_settings.information_columns) + [
89100
"search_ID"
90101
]
91102
df_curation = df_curation[curation_cols]
92103

93104
## CREATE CURATION FILE ##
94105
df_final = cur.create_curation_file(
95-
dir_step1, df_dd, df_dd_preprocessed, df_curation, df_results, cfg
106+
dir_step1, df_dd, df_dd_preprocessed, df_curation, df_results, cfg_hydra
96107
)
97-
helper.save_config(cfg, dir_step1)
108+
helper.save_config(cfg_hydra, dir_step1)
98109
print("FINISHED batch hydra search query pipeline!!!")
99110

100-
return df_final
111+
return df_final, cfg_hydra
101112

102113

103114
if __name__ == "__main__":
104-
df_final = run_hydra_batch(cfg, cfg_umls, cfg_mm, cfg_ss)
115+
df_final, cfg_hydra = run_hydra_batch(
116+
cfg_hydra, cfg_umls=cfg_umls, cfg_mm=cfg_mm, cfg_ss=cfg_ss
117+
) # TODO: maybe put module cfgs into a list

0 commit comments

Comments
 (0)