11 datasets found

h
i2b2-query-data-1.0
huggingface.co
Updated Sep 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholai (2023). i2b2-query-data-1.0 [Dataset]. https://huggingface.co/datasets/nmitchko/i2b2-query-data-1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2023
Authors
Nicholai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
i2b2 query data 1.0

This is a dataset of i2b2 query builder examples that are taken from a test environment of i2b2 and then pre-processed with AI descriptions.
f
Data from: Advancing clinical cohort selection with genomics analysis on a...
figshare.com
txt
Updated Feb 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaclyn Smith; Melvin Lathara; Hollis Wright; Brian Hill; Ganapati Srinivasa; Christopher T Denny (2020). Advancing clinical cohort selection with genomics analysis on a distributed platform [Dataset]. http://doi.org/10.6084/m9.figshare.11796126.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11796126.v1
Dataset updated
Feb 3, 2020
Dataset provided by
figshare
Authors
Jaclyn Smith; Melvin Lathara; Hollis Wright; Brian Hill; Ganapati Srinivasa; Christopher T Denny
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw runtimes for metadata associated with the Advancing clinical cohort selection with genomics analysis on a distributed platform manuscript. Markdown used to generate plots at: https://github.com/OmicsDataAutomation/i2b2-oda-framework/blob/master/genomicsdb/results.Rmd.
h
GC-HBOC database explorer
health-atlas.de
Updated Jun 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Engel; Silke Zachariae (2021). GC-HBOC database explorer [Dataset]. https://www.health-atlas.de/data_files/403
Explore at:
Dataset updated
Jun 11, 2021
Authors
Christoph Engel; Silke Zachariae
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset with information on cancer history, mutation status and surveillance history for more than 100 000 study patients is provided in i2b2 (Informatics for Integrating Biology and the Bedside, http://www.i2b2.org/software). Members of the German Consortium for Hereditary Breast and Ovarian Cancer can request access to i2b2 and will be able to perform database queries independently, e.g. with regard to identify suitable patient populations for scientific evaluation projects.
s
Smoking NLP Challenge Data
scicrunch.org
neuinfo.org
+2more
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Smoking NLP Challenge Data [Dataset]. http://identifiers.org/RRID:SCR_008644
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008644
Dataset updated
Mar 7, 2024
Description
The data for the smoking challenge consisted exclusively of discharge summaries from Partners HealthCare which were preprocessed and converted into XML format, and separated into training and test sets. I2B2 is a data warehouse containing clinical data on over 150k patients, including outpatient DX, lab results, medications, and inpatient procedures. ETL processes authored to pull data from EMR and finance systems Institutional review boards of Partners HealthCare approved the challenge and the data preparation process. The data were annotated by pulmonologists and classified patients into Past Smokers, Current Smokers, Smokers, Non-smokers, and unknown. Second-hand smokers were considered non-smokers. Other institutions involved include Massachusetts Institute of Technology, and the State University of New York at Albany. i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement. In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1. A similar set of notes from the Second i2b2 Challenge will be released on the one year anniversary of that Challenge (November, 2010).
Combining clinical and genomics queries using i2b2 – Three methods - Table 1...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shawn N. Murphy; Paul Avillach; Riccardo Bellazzi; Lori Phillips; Matteo Gabetta; Alal Eran; Michael T. McDuffie; Isaac S. Kohane (2023). Combining clinical and genomics queries using i2b2 – Three methods - Table 1 [Dataset]. http://doi.org/10.1371/journal.pone.0172187.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0172187.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Shawn N. Murphy; Paul Avillach; Riccardo Bellazzi; Lori Phillips; Matteo Gabetta; Alal Eran; Michael T. McDuffie; Isaac S. Kohane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Combining clinical and genomics queries using i2b2 – Three methods - Table 1
Data from: Advancing clinical cohort selection with genomics analysis on a...
plos.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaclyn M. Smith; Melvin Lathara; Hollis Wright; Brian Hill; Nalini Ganapati; Ganapati Srinivasa; Christopher T. Denny (2023). Advancing clinical cohort selection with genomics analysis on a distributed platform [Dataset]. http://doi.org/10.1371/journal.pone.0231826
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0231826
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jaclyn M. Smith; Melvin Lathara; Hollis Wright; Brian Hill; Nalini Ganapati; Ganapati Srinivasa; Christopher T. Denny
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The affordability of next-generation genomic sequencing and the improvement of medical data management have contributed largely to the evolution of biological analysis from both a clinical and research perspective. Precision medicine is a response to these advancements that places individuals into better-defined subsets based on shared clinical and genetic features. The identification of personalized diagnosis and treatment options is dependent on the ability to draw insights from large-scale, multi-modal analysis of biomedical datasets. Driven by a real use case, we premise that platforms that support precision medicine analysis should maintain data in their optimal data stores, should support distributed storage and query mechanisms, and should scale as more samples are added to the system. We extended a genomics-based columnar data store, GenomicsDB, for ease of use within a distributed analytics platform for clinical and genomic data integration, known as the ODA framework. The framework supports interaction from an i2b2 plugin as well as a notebook environment. We show that the ODA framework exhibits worst-case linear scaling for array size (storage), import time (data construction), and query time for an increasing number of samples. We go on to show worst-case linear time for both import of clinical data and aggregate query execution time within a distributed environment. This work highlights the integration of a distributed genomic database with a distributed compute environment to support scalable and efficient precision medicine queries from a HIPAA-compliant, cohort system in a real-world setting. The ODA framework is currently deployed in production to support precision medicine exploration and analysis from clinicians and researchers at UCLA David Geffen School of Medicine.
c
SDTM datasets of clinical data and measurements for selected cancer...
dev.cancerimagingarchive.net
csv, n/a, xpt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive, SDTM datasets of clinical data and measurements for selected cancer collections to TCIA [Dataset]. http://doi.org/10.7937/TCIA.2019.zfv154m9
Explore at:
n/a, xpt, csvAvailable download formats
Unique identifier
https://doi.org/10.7937/TCIA.2019.zfv154m9
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
Jun 21, 2019
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying 4 TCIA breast cancer collections (Multi-center breast DCE-MRI data and segmentations from patients in the I-SPY 1/ACRIN 6657 trials (ISPY1), BREAST-DIAGNOSIS, Single site breast DCE-MRI data and segmentations from patients undergoing neoadjuvant chemotherapy (Breast-MRI-NACT-Pilot), The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA)) and the Ivy Glioblastoma Atlas Project (IvyGAP) brain cancer collection. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data imported into I2B2 were cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 in SDTM compliant SAS transport files. The SDTM data was derived from data taken from both the curated TCIA spreadsheets as well as tumor measurements and dates from the TCIA Restful API. Due to the nature of the available data not all SDTM conformance rules were applicable or adhered to. These Study Data Tabulation Model format (SDTM) datasets were validated using Pinnacle 21 CDISC validation software. The validation software reviews datasets according to their degree of conformance to rules developed for the purposes of FDA submissions of electronic data. Iterative refinements were made to the datasets based upon group discussions and feedback from the validation tool. Export datasets for the following SDTM domains were generated:

DM (Demographics)

DS (Disposition)

MI (Microscopic Findings)

PR (Procedures)

SS (Subject Status)

TU (Tumor/Lesion Identification)

TR (Tumor/Lesion Results)
s
FURTHeR
scicrunch.org
rrid.site
Updated Jan 25, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2008). FURTHeR [Dataset]. http://identifiers.org/RRID:SCR_006383
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006383
Dataset updated
Jan 25, 2008
Description
Data and knowledge management infrastructure for the new Center for Clinical and Translational Science (CCTS) at the University of Utah. This clinical cohort search tool is used to search across the University of Utah clinical data warehouse and the Utah Population Database for people who satisfy various criteria of the researchers. It uses the i2b2 front end but has a set of terminology servers, metadata servers and federated query tool as the back end systems. FURTHeR does on-the-fly translation of search terms and data models across the source systems and returns a count of results by unique individuals. They are extending the set of databases that can be queried.
Weekly supervised Multilingual Data Set to train Named Entity Recognition...
zenodo.org
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc (2025). Weekly supervised Multilingual Data Set to train Named Entity Recognition for Symptom Extraction [Dataset]. http://doi.org/10.5281/zenodo.13918009
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13918009
Dataset updated
Apr 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Sets were generated using the Weakly Supervised NER pipeline (https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) to train the symptom extraction NER models.

Supported Languages and dataset locations for the specific language:

English (base language): https://huggingface.co/HUMADEX/english_medical_ner
German: https://huggingface.co/HUMADEX/german_medical_ner
Italian: https://huggingface.co/HUMADEX/italian_medical_ner
Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
Greek: https://huggingface.co/HUMADEX/german_medical_ner
Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
Polish: https://huggingface.co/HUMADEX/polish_medical_ner
Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner

Dataset Building

Data Integration and Preprocessing

Data Cleaning

Annotation with Stanza's i2b2 Clinical Model

Translation into the targeted language

Word Alignment

Data Augmentation

Acknowledgement
This dataset had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.

Authors:
dr. Izidor Mlakar, Rigona Sallauka, dr. Umut Arioz, dr. Matej Rojc

Please cite as:

Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
Doi: 10.20944/preprints202504.1356.v1
Website: https://www.preprints.org/manuscript/202504.1356/v1" href="https://www.preprints.org/manuscript/202504.1356/v1">https://www.preprints.org/manuscript/202504.1356/v1
c
DICOM SR of clinical data and measurement for breast cancer collections to...
cancerimagingarchive.net
dicom, n/a
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive, DICOM SR of clinical data and measurement for breast cancer collections to TCIA [Dataset]. http://doi.org/10.7937/TCIA.2019.wgllssg1
Explore at:
dicom, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/TCIA.2019.wgllssg1
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 26, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying various TCIA breast cancer collections. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data was imported into I2B2 and cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 to CSV and thence converted to DICOM SR according to the the DICOM Breast Imaging Report template [1] , which supports description of patient characteristics, histopathology, receptor status and clinical findings including measurements. The purpose was not to advocate DICOM SR as an appropriate format for interchange or storage of such information for query purposes, but rather to demonstrate that use of standard concepts harmonized across multiple collections could be transformed into an existing standard report representation. The DICOM SR can be stored and used together with the images in repositories such as TCIA and in image viewers that support rendering of DICOM SR content. During the project, various deficiencies in the DICOM Breast Imaging Report template were identified with respect to describing breast MR studies, laterality of findings versus procedures, more recently developed receptor types, and patient characteristics and status. These were addressed via DICOM CP 1838, finalized in Jan 2019, and this subset reflects those changes. DICOM Breast Imaging Report Templates available from: http://dicom.nema.org/medical/dicom/current/output/chtml/part16/sect_BreastImagingReportTemplates.html
p
Data from: RadCoref: Fine-tuning coreference resolution for different styles...
physionet.org
Updated Jan 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuxiang Liao; Hantao Liu; Irena Spasic (2024). RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives [Dataset]. http://doi.org/10.13026/z67q-xy65
Explore at:
Unique identifier
https://doi.org/10.13026/z67q-xy65
Dataset updated
Jan 30, 2024
Authors
Yuxiang Liao; Hantao Liu; Irena Spasic
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.
Not seeing a result you expected?
Learn how you can add new datasets to our index.