11 datasets found
  1. h

    i2b2-query-data-1.0

    • huggingface.co
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholai (2023). i2b2-query-data-1.0 [Dataset]. https://huggingface.co/datasets/nmitchko/i2b2-query-data-1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2023
    Authors
    Nicholai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    i2b2 query data 1.0

    This is a dataset of i2b2 query builder examples that are taken from a test environment of i2b2 and then pre-processed with AI descriptions.

  2. f

    Data from: Advancing clinical cohort selection with genomics analysis on a...

    • figshare.com
    txt
    Updated Feb 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaclyn Smith; Melvin Lathara; Hollis Wright; Brian Hill; Ganapati Srinivasa; Christopher T Denny (2020). Advancing clinical cohort selection with genomics analysis on a distributed platform [Dataset]. http://doi.org/10.6084/m9.figshare.11796126.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 3, 2020
    Dataset provided by
    figshare
    Authors
    Jaclyn Smith; Melvin Lathara; Hollis Wright; Brian Hill; Ganapati Srinivasa; Christopher T Denny
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw runtimes for metadata associated with the Advancing clinical cohort selection with genomics analysis on a distributed platform manuscript. Markdown used to generate plots at: https://github.com/OmicsDataAutomation/i2b2-oda-framework/blob/master/genomicsdb/results.Rmd.

  3. h

    GC-HBOC database explorer

    • health-atlas.de
    Updated Jun 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Engel; Silke Zachariae (2021). GC-HBOC database explorer [Dataset]. https://www.health-atlas.de/data_files/403
    Explore at:
    Dataset updated
    Jun 11, 2021
    Authors
    Christoph Engel; Silke Zachariae
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset with information on cancer history, mutation status and surveillance history for more than 100 000 study patients is provided in i2b2 (Informatics for Integrating Biology and the Bedside, http://www.i2b2.org/software). Members of the German Consortium for Hereditary Breast and Ovarian Cancer can request access to i2b2 and will be able to perform database queries independently, e.g. with regard to identify suitable patient populations for scientific evaluation projects.

  4. s

    Smoking NLP Challenge Data

    • scicrunch.org
    • neuinfo.org
    • +2more
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Smoking NLP Challenge Data [Dataset]. http://identifiers.org/RRID:SCR_008644
    Explore at:
    Dataset updated
    Mar 7, 2024
    Description

    The data for the smoking challenge consisted exclusively of discharge summaries from Partners HealthCare which were preprocessed and converted into XML format, and separated into training and test sets. I2B2 is a data warehouse containing clinical data on over 150k patients, including outpatient DX, lab results, medications, and inpatient procedures. ETL processes authored to pull data from EMR and finance systems Institutional review boards of Partners HealthCare approved the challenge and the data preparation process. The data were annotated by pulmonologists and classified patients into Past Smokers, Current Smokers, Smokers, Non-smokers, and unknown. Second-hand smokers were considered non-smokers. Other institutions involved include Massachusetts Institute of Technology, and the State University of New York at Albany. i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement. In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1. A similar set of notes from the Second i2b2 Challenge will be released on the one year anniversary of that Challenge (November, 2010).

  5. Combining clinical and genomics queries using i2b2 – Three methods - Table 1...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shawn N. Murphy; Paul Avillach; Riccardo Bellazzi; Lori Phillips; Matteo Gabetta; Alal Eran; Michael T. McDuffie; Isaac S. Kohane (2023). Combining clinical and genomics queries using i2b2 – Three methods - Table 1 [Dataset]. http://doi.org/10.1371/journal.pone.0172187.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shawn N. Murphy; Paul Avillach; Riccardo Bellazzi; Lori Phillips; Matteo Gabetta; Alal Eran; Michael T. McDuffie; Isaac S. Kohane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Combining clinical and genomics queries using i2b2 – Three methods - Table 1

  6. Data from: Advancing clinical cohort selection with genomics analysis on a...

    • plos.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaclyn M. Smith; Melvin Lathara; Hollis Wright; Brian Hill; Nalini Ganapati; Ganapati Srinivasa; Christopher T. Denny (2023). Advancing clinical cohort selection with genomics analysis on a distributed platform [Dataset]. http://doi.org/10.1371/journal.pone.0231826
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jaclyn M. Smith; Melvin Lathara; Hollis Wright; Brian Hill; Nalini Ganapati; Ganapati Srinivasa; Christopher T. Denny
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The affordability of next-generation genomic sequencing and the improvement of medical data management have contributed largely to the evolution of biological analysis from both a clinical and research perspective. Precision medicine is a response to these advancements that places individuals into better-defined subsets based on shared clinical and genetic features. The identification of personalized diagnosis and treatment options is dependent on the ability to draw insights from large-scale, multi-modal analysis of biomedical datasets. Driven by a real use case, we premise that platforms that support precision medicine analysis should maintain data in their optimal data stores, should support distributed storage and query mechanisms, and should scale as more samples are added to the system. We extended a genomics-based columnar data store, GenomicsDB, for ease of use within a distributed analytics platform for clinical and genomic data integration, known as the ODA framework. The framework supports interaction from an i2b2 plugin as well as a notebook environment. We show that the ODA framework exhibits worst-case linear scaling for array size (storage), import time (data construction), and query time for an increasing number of samples. We go on to show worst-case linear time for both import of clinical data and aggregate query execution time within a distributed environment. This work highlights the integration of a distributed genomic database with a distributed compute environment to support scalable and efficient precision medicine queries from a HIPAA-compliant, cohort system in a real-world setting. The ODA framework is currently deployed in production to support precision medicine exploration and analysis from clinicians and researchers at UCLA David Geffen School of Medicine.

  7. c

    SDTM datasets of clinical data and measurements for selected cancer...

    • dev.cancerimagingarchive.net
    csv, n/a, xpt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive, SDTM datasets of clinical data and measurements for selected cancer collections to TCIA [Dataset]. http://doi.org/10.7937/TCIA.2019.zfv154m9
    Explore at:
    n/a, xpt, csvAvailable download formats
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Jun 21, 2019
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying 4 TCIA breast cancer collections (Multi-center breast DCE-MRI data and segmentations from patients in the I-SPY 1/ACRIN 6657 trials (ISPY1), BREAST-DIAGNOSIS, Single site breast DCE-MRI data and segmentations from patients undergoing neoadjuvant chemotherapy (Breast-MRI-NACT-Pilot), The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA)) and the Ivy Glioblastoma Atlas Project (IvyGAP) brain cancer collection. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data imported into I2B2 were cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 in SDTM compliant SAS transport files. The SDTM data was derived from data taken from both the curated TCIA spreadsheets as well as tumor measurements and dates from the TCIA Restful API. Due to the nature of the available data not all SDTM conformance rules were applicable or adhered to. These Study Data Tabulation Model format (SDTM) datasets were validated using Pinnacle 21 CDISC validation software. The validation software reviews datasets according to their degree of conformance to rules developed for the purposes of FDA submissions of electronic data. Iterative refinements were made to the datasets based upon group discussions and feedback from the validation tool. Export datasets for the following SDTM domains were generated:

    • DM (Demographics)
    • DS (Disposition)
    • MI (Microscopic Findings)
    • PR (Procedures)
    • SS (Subject Status)
    • TU (Tumor/Lesion Identification)
    • TR (Tumor/Lesion Results)

  8. s

    FURTHeR

    • scicrunch.org
    • rrid.site
    Updated Jan 25, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2008). FURTHeR [Dataset]. http://identifiers.org/RRID:SCR_006383
    Explore at:
    Dataset updated
    Jan 25, 2008
    Description

    Data and knowledge management infrastructure for the new Center for Clinical and Translational Science (CCTS) at the University of Utah. This clinical cohort search tool is used to search across the University of Utah clinical data warehouse and the Utah Population Database for people who satisfy various criteria of the researchers. It uses the i2b2 front end but has a set of terminology servers, metadata servers and federated query tool as the back end systems. FURTHeR does on-the-fly translation of search terms and data models across the source systems and returns a count of results by unique individuals. They are extending the set of databases that can be queried.

  9. Weekly supervised Multilingual Data Set to train Named Entity Recognition...

    • zenodo.org
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc (2025). Weekly supervised Multilingual Data Set to train Named Entity Recognition for Symptom Extraction [Dataset]. http://doi.org/10.5281/zenodo.13918009
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Sets were generated using the Weakly Supervised NER pipeline (https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) to train the symptom extraction NER models.

    Supported Languages and dataset locations for the specific language:

    English (base language): https://huggingface.co/HUMADEX/english_medical_ner
    German: https://huggingface.co/HUMADEX/german_medical_ner
    Italian: https://huggingface.co/HUMADEX/italian_medical_ner
    Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
    Greek: https://huggingface.co/HUMADEX/german_medical_ner
    Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
    Polish: https://huggingface.co/HUMADEX/polish_medical_ner
    Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner

    Dataset Building

    • Data Integration and Preprocessing
    • Data Cleaning
    • Annotation with Stanza's i2b2 Clinical Model
    • Translation into the targeted language
    • Word Alignment
    • Data Augmentation

    Acknowledgement
    This dataset had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.

    Authors:
    dr. Izidor Mlakar, Rigona Sallauka, dr. Umut Arioz, dr. Matej Rojc

    Please cite as:

    Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
    Doi: 10.20944/preprints202504.1356.v1
    Website: https://www.preprints.org/manuscript/202504.1356/v1" href="https://www.preprints.org/manuscript/202504.1356/v1">https://www.preprints.org/manuscript/202504.1356/v1

  10. c

    DICOM SR of clinical data and measurement for breast cancer collections to...

    • cancerimagingarchive.net
    dicom, n/a
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive, DICOM SR of clinical data and measurement for breast cancer collections to TCIA [Dataset]. http://doi.org/10.7937/TCIA.2019.wgllssg1
    Explore at:
    dicom, n/aAvailable download formats
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 26, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying various TCIA breast cancer collections. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data was imported into I2B2 and cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 to CSV and thence converted to DICOM SR according to the the DICOM Breast Imaging Report template [1] , which supports description of patient characteristics, histopathology, receptor status and clinical findings including measurements. The purpose was not to advocate DICOM SR as an appropriate format for interchange or storage of such information for query purposes, but rather to demonstrate that use of standard concepts harmonized across multiple collections could be transformed into an existing standard report representation. The DICOM SR can be stored and used together with the images in repositories such as TCIA and in image viewers that support rendering of DICOM SR content. During the project, various deficiencies in the DICOM Breast Imaging Report template were identified with respect to describing breast MR studies, laterality of findings versus procedures, more recently developed receptor types, and patient characteristics and status. These were addressed via DICOM CP 1838, finalized in Jan 2019, and this subset reflects those changes. DICOM Breast Imaging Report Templates available from: http://dicom.nema.org/medical/dicom/current/output/chtml/part16/sect_BreastImagingReportTemplates.html

  11. p

    Data from: RadCoref: Fine-tuning coreference resolution for different styles...

    • physionet.org
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuxiang Liao; Hantao Liu; Irena Spasic (2024). RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives [Dataset]. http://doi.org/10.13026/z67q-xy65
    Explore at:
    Dataset updated
    Jan 30, 2024
    Authors
    Yuxiang Liao; Hantao Liu; Irena Spasic
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nicholai (2023). i2b2-query-data-1.0 [Dataset]. https://huggingface.co/datasets/nmitchko/i2b2-query-data-1.0

i2b2-query-data-1.0

nmitchko/i2b2-query-data-1.0

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2023
Authors
Nicholai
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

i2b2 query data 1.0

This is a dataset of i2b2 query builder examples that are taken from a test environment of i2b2 and then pre-processed with AI descriptions.

Search
Clear search
Close search
Google apps
Main menu