15 datasets found
  1. r

    i2b2 Research Data Warehouse

    • rrid.site
    • scicrunch.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). i2b2 Research Data Warehouse [Dataset]. http://identifiers.org/RRID:SCR_013276
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    A data warehouse that integrates information on patients from multiple sources and consists of patient information from all the visits to Cincinnati Children''''s between 2003 and 2007. This information includes demographics (age, gender, race), diagnoses (ICD-9), procedures, medications and lab results. They have included extracts from Epic, DocSite, and the new Cerner laboratory system and will eventually load public data sources, data from the different divisions or research cores (such as images or genetic data), as well as the research databases from individual groups or investigators. This information is aggregated, cleaned and de-identified. Once this process is complete, it is presented to the user, who will then be able to query the data. The warehouse is best suited for tasks like cohort identification, hypothesis generation and retrospective data analysis. Automated software tools will facilitate some of these functions, while others will require more of a manual process. The initial software tools will be focused around cohort identification. They have developed a set of web-based tools that allow the user to query the warehouse after logging in. The only people able to see your data are those to whom you grant authorization. If the information can be provided to the general research community, they will add it to the warehouse. If it cannot, they will mark it so that only you (or others in your group with proper approval) can access it.

  2. h

    i2b2-query-data-1.0

    • huggingface.co
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholai (2023). i2b2-query-data-1.0 [Dataset]. https://huggingface.co/datasets/nmitchko/i2b2-query-data-1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2023
    Authors
    Nicholai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    i2b2 query data 1.0

    This is a dataset of i2b2 query builder examples that are taken from a test environment of i2b2 and then pre-processed with AI descriptions.

  3. Data from: Advancing clinical cohort selection with genomics analysis on a...

    • figshare.com
    txt
    Updated Feb 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaclyn Smith; Melvin Lathara; Hollis Wright; Brian Hill; Ganapati Srinivasa; Christopher T Denny (2020). Advancing clinical cohort selection with genomics analysis on a distributed platform [Dataset]. http://doi.org/10.6084/m9.figshare.11796126.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 3, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jaclyn Smith; Melvin Lathara; Hollis Wright; Brian Hill; Ganapati Srinivasa; Christopher T Denny
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw runtimes for metadata associated with the Advancing clinical cohort selection with genomics analysis on a distributed platform manuscript. Markdown used to generate plots at: https://github.com/OmicsDataAutomation/i2b2-oda-framework/blob/master/genomicsdb/results.Rmd.

  4. r

    Informatics for Integrating Biology and the Bedside

    • rrid.site
    Updated Sep 28, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2004). Informatics for Integrating Biology and the Bedside [Dataset]. http://identifiers.org/RRID:SCR_013629/resolver?q=*&i=rrid
    Explore at:
    Dataset updated
    Sep 28, 2004
    Description

    i2b2 (Informatics for Integrating Biology and the Bedside) is an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System. The i2b2 Center is developing a scalable informatics framework that will enable clinical researchers to use existing clinical data for discovery research and, when combined with IRB-approved genomic data, facilitate the design of targeted therapies for individual patients with diseases having genetic origin. For some resources (e.g. software) the use of the resource requires accepting a specific (e.g. OpenSource) license.

  5. h

    GC-HBOC database explorer

    • health-atlas.de
    Updated Jun 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Engel; Silke Zachariae (2021). GC-HBOC database explorer [Dataset]. https://www.health-atlas.de/data_files/403
    Explore at:
    Dataset updated
    Jun 11, 2021
    Authors
    Christoph Engel; Silke Zachariae
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset with information on cancer history, mutation status and surveillance history for more than 100 000 study patients is provided in i2b2 (Informatics for Integrating Biology and the Bedside, http://www.i2b2.org/software). Members of the German Consortium for Hereditary Breast and Ovarian Cancer can request access to i2b2 and will be able to perform database queries independently, e.g. with regard to identify suitable patient populations for scientific evaluation projects.

  6. d

    Smoking NLP Challenge Data

    • dknet.org
    • neuinfo.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Smoking NLP Challenge Data [Dataset]. http://identifiers.org/RRID:SCR_008644
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    The data for the smoking challenge consisted exclusively of discharge summaries from Partners HealthCare which were preprocessed and converted into XML format, and separated into training and test sets. I2B2 is a data warehouse containing clinical data on over 150k patients, including outpatient DX, lab results, medications, and inpatient procedures. ETL processes authored to pull data from EMR and finance systems Institutional review boards of Partners HealthCare approved the challenge and the data preparation process. The data were annotated by pulmonologists and classified patients into Past Smokers, Current Smokers, Smokers, Non-smokers, and unknown. Second-hand smokers were considered non-smokers. Other institutions involved include Massachusetts Institute of Technology, and the State University of New York at Albany. i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement. In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1. A similar set of notes from the Second i2b2 Challenge will be released on the one year anniversary of that Challenge (November, 2010).

  7. Combining clinical and genomics queries using i2b2 – Three methods - Table 1...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shawn N. Murphy; Paul Avillach; Riccardo Bellazzi; Lori Phillips; Matteo Gabetta; Alal Eran; Michael T. McDuffie; Isaac S. Kohane (2023). Combining clinical and genomics queries using i2b2 – Three methods - Table 1 [Dataset]. http://doi.org/10.1371/journal.pone.0172187.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shawn N. Murphy; Paul Avillach; Riccardo Bellazzi; Lori Phillips; Matteo Gabetta; Alal Eran; Michael T. McDuffie; Isaac S. Kohane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Combining clinical and genomics queries using i2b2 – Three methods - Table 1

  8. Demographic Traits Annotations

    • kaggle.com
    zip
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Health (2019). Demographic Traits Annotations [Dataset]. https://www.kaggle.com/google-health/demographic-traits-annotations
    Explore at:
    zip(16012 bytes)Available download formats
    Dataset updated
    Sep 27, 2019
    Dataset authored and provided by
    Google Health
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The free-form portions of clinical notes are a significant source of information for research. One path for protecting patient’s privacy is to fully de-identify this information prior to sharing for research purposes . De-identification efforts have focused on known named entities and other known identifier types (names, ages, dates, addresses, ID’s, etc.). However, a note may contain residual “Demographic Traits” (DTs), unique enough to identify the patient when combined with other such facts. While we believe that re-identification is not possible with these demographic traits alone, we hope that giving healthcare organizations the option to remove them will strengthen privacy standards of automatic de-identification systems and bolster their confidence in such systems.

    More specifically, this dataset was used to test the performance of our paper ‘Interactive Deep Learning to Detect Demographic Traits in Free-Form Clinical Notes’. We evaluated our pipeline using a subset of the I2b2 2006 and MIMIC-III datasets.

    Content

    The data contains sentence tagging for MIMIC-III and I2b2 2006 datasets that was used in the paper ‘Interactive Deep Learning to Detect Demographic Traits in Free-Form Clinical Notes’. Every sentence is tagged with its own demographic trait tag (as defined in the "Annotations Guide" file). More formally, the data contains CSV tables each containing rows corresponding to annotated sentences such that every row contains the following example properties: row ID, offset within the note’s text, length and label.

    The label mapping (from character to tag) appears in the "Tagged Categories" file. Furthermore, every note in the MIMIC-III dataset contains a unique row-id (appears in a field within the note). In I2b2 2006, every note also contains a unique number, referred to as record-id (which also appears within the note). These features can be found in our attached CSV's under the row_id and record_id columns appropriately. In both cases the offset is defined from the beginning of the note's text.

  9. c

    SDTM datasets of clinical data and measurements for selected cancer...

    • cancerimagingarchive.net
    • stage.cancerimagingarchive.net
    csv, n/a, xpt
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2019). SDTM datasets of clinical data and measurements for selected cancer collections to TCIA [Dataset]. http://doi.org/10.7937/TCIA.2019.zfv154m9
    Explore at:
    xpt, n/a, csvAvailable download formats
    Dataset updated
    Jun 20, 2019
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Jun 21, 2019
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying 4 TCIA breast cancer collections (Multi-center breast DCE-MRI data and segmentations from patients in the I-SPY 1/ACRIN 6657 trials (ISPY1), BREAST-DIAGNOSIS, Single site breast DCE-MRI data and segmentations from patients undergoing neoadjuvant chemotherapy (Breast-MRI-NACT-Pilot), The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA)) and the Ivy Glioblastoma Atlas Project (IvyGAP) brain cancer collection. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data imported into I2B2 were cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 in SDTM compliant SAS transport files. The SDTM data was derived from data taken from both the curated TCIA spreadsheets as well as tumor measurements and dates from the TCIA Restful API. Due to the nature of the available data not all SDTM conformance rules were applicable or adhered to. These Study Data Tabulation Model format (SDTM) datasets were validated using Pinnacle 21 CDISC validation software. The validation software reviews datasets according to their degree of conformance to rules developed for the purposes of FDA submissions of electronic data. Iterative refinements were made to the datasets based upon group discussions and feedback from the validation tool. Export datasets for the following SDTM domains were generated:

    • DM (Demographics)
    • DS (Disposition)
    • MI (Microscopic Findings)
    • PR (Procedures)
    • SS (Subject Status)
    • TU (Tumor/Lesion Identification)
    • TR (Tumor/Lesion Results)

  10. c

    DICOM SR of clinical data and measurement for breast cancer collections to...

    • cancerimagingarchive.net
    dicom, n/a
    Updated May 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2020). DICOM SR of clinical data and measurement for breast cancer collections to TCIA [Dataset]. http://doi.org/10.7937/TCIA.2019.wgllssg1
    Explore at:
    dicom, n/aAvailable download formats
    Dataset updated
    May 31, 2020
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 26, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying various TCIA breast cancer collections. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data was imported into I2B2 and cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 to CSV and thence converted to DICOM SR according to the the DICOM Breast Imaging Report template [1] , which supports description of patient characteristics, histopathology, receptor status and clinical findings including measurements. The purpose was not to advocate DICOM SR as an appropriate format for interchange or storage of such information for query purposes, but rather to demonstrate that use of standard concepts harmonized across multiple collections could be transformed into an existing standard report representation. The DICOM SR can be stored and used together with the images in repositories such as TCIA and in image viewers that support rendering of DICOM SR content. During the project, various deficiencies in the DICOM Breast Imaging Report template were identified with respect to describing breast MR studies, laterality of findings versus procedures, more recently developed receptor types, and patient characteristics and status. These were addressed via DICOM CP 1838, finalized in Jan 2019, and this subset reflects those changes. DICOM Breast Imaging Report Templates available from: http://dicom.nema.org/medical/dicom/current/output/chtml/part16/sect_BreastImagingReportTemplates.html

  11. n

    FURTHeR

    • neuinfo.org
    Updated Sep 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). FURTHeR [Dataset]. http://identifiers.org/RRID:SCR_006383
    Explore at:
    Dataset updated
    Sep 8, 2024
    Description

    Data and knowledge management infrastructure for the new Center for Clinical and Translational Science (CCTS) at the University of Utah. This clinical cohort search tool is used to search across the University of Utah clinical data warehouse and the Utah Population Database for people who satisfy various criteria of the researchers. It uses the i2b2 front end but has a set of terminology servers, metadata servers and federated query tool as the back end systems. FURTHeR does on-the-fly translation of search terms and data models across the source systems and returns a count of results by unique individuals. They are extending the set of databases that can be queried.

  12. Weekly supervised Multilingual Data Set to train Named Entity Recognition...

    • zenodo.org
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc (2025). Weekly supervised Multilingual Data Set to train Named Entity Recognition for Symptom Extraction [Dataset]. http://doi.org/10.5281/zenodo.13918009
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Sets were generated using the Weakly Supervised NER pipeline (https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) to train the symptom extraction NER models.

    Supported Languages and dataset locations for the specific language:

    English (base language): https://huggingface.co/HUMADEX/english_medical_ner
    German: https://huggingface.co/HUMADEX/german_medical_ner
    Italian: https://huggingface.co/HUMADEX/italian_medical_ner
    Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
    Greek: https://huggingface.co/HUMADEX/german_medical_ner
    Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
    Polish: https://huggingface.co/HUMADEX/polish_medical_ner
    Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner

    Dataset Building

    • Data Integration and Preprocessing
    • Data Cleaning
    • Annotation with Stanza's i2b2 Clinical Model
    • Translation into the targeted language
    • Word Alignment
    • Data Augmentation

    Acknowledgement
    This dataset had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.

    Authors:
    dr. Izidor Mlakar, Rigona Sallauka, dr. Umut Arioz, dr. Matej Rojc

    Please cite as:

    Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
    Doi: 10.20944/preprints202504.1356.v1
    Website: https://www.preprints.org/manuscript/202504.1356/v1" href="https://www.preprints.org/manuscript/202504.1356/v1">https://www.preprints.org/manuscript/202504.1356/v1

  13. a

    Data from: SEnDAE: A resource for expanding research into social and...

    • usc-geohealth-hub-uscssi.hub.arcgis.com
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spatial Sciences Institute (2025). SEnDAE: A resource for expanding research into social and environmental determinants of health [Dataset]. https://usc-geohealth-hub-uscssi.hub.arcgis.com/datasets/sendae-a-resource-for-expanding-research-into-social-and-environmental-determinants-of-health
    Explore at:
    Dataset updated
    Nov 17, 2025
    Dataset authored and provided by
    Spatial Sciences Institute
    Description

    Abstract: “Social and Environmental Determinants of Health (SEDoH) are of increasing interest to researchers in personal and public health. Collecting SEDoH and associating them with patient medical record can be challenging, especially for environmental variables. We announce here the release of SEnDAE, the Social and Environmental Determinants Address Enhancement toolkit, and open-source resource for ingesting a range of environmental variables and measurements from a variety of sources and associated them with arbitrary addresses.SEnDAE includes optional components for geocoding addresses, in case an organization does not have independent capabilities in that area, and recipes for extending the OMOP CDM and the ontology of an i2b2 instance to display and compute over the SEnDAE variables within i2b2. On a set of 5000 synthetic addresses, SEnDAE was able to geocode 83%. SEnDAE geocodes addresses to the same Census tract as ESRI 98.1% of the time. Development of SEnDAE is ongoing, but we hope that teams will find it useful to increase their usage of environmental variables and increase the field's general understanding of these important determinants of health.”

  14. p

    Data from: RadCoref: Fine-tuning coreference resolution for different styles...

    • physionet.org
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuxiang Liao; Hantao Liu; Irena Spasic (2024). RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives [Dataset]. http://doi.org/10.13026/z67q-xy65
    Explore at:
    Dataset updated
    Jan 30, 2024
    Authors
    Yuxiang Liao; Hantao Liu; Irena Spasic
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.

  15. NYUTron: Health System Scale Language Models Are All-purpose Prediction...

    • datacatalog.med.nyu.edu
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lavender Yao Jiang; Xujin Chris Liu; Nima Pour Nejatian; Mustafa Nasir-Moin; Duo Wang; Anas Abidin; Kevin Eaton; Howard Antony Riina; Ilya Laufer; Paawan Punjabi; Madeline Miceli; Nora C. Kim; Cordelia Orillac; Zane Schnurman; Christopher Livia; Hannah Weiss; David Kurland; Sean Neifert; Yosef Dastagirzada; Douglas Kondziolka; Alexander T. M. Cheung; Grace Yang; Ming Cao; Mona Flores; Anthony B. Costa; Yindalon Aphinyanaphongs; Kyunghyun Cho; Eric K. Oermann (2023). NYUTron: Health System Scale Language Models Are All-purpose Prediction Engines [Dataset]. https://datacatalog.med.nyu.edu/dataset/10633
    Explore at:
    Dataset updated
    Aug 9, 2023
    Dataset provided by
    NYU Health Sciences Library
    Authors
    Lavender Yao Jiang; Xujin Chris Liu; Nima Pour Nejatian; Mustafa Nasir-Moin; Duo Wang; Anas Abidin; Kevin Eaton; Howard Antony Riina; Ilya Laufer; Paawan Punjabi; Madeline Miceli; Nora C. Kim; Cordelia Orillac; Zane Schnurman; Christopher Livia; Hannah Weiss; David Kurland; Sean Neifert; Yosef Dastagirzada; Douglas Kondziolka; Alexander T. M. Cheung; Grace Yang; Ming Cao; Mona Flores; Anthony B. Costa; Yindalon Aphinyanaphongs; Kyunghyun Cho; Eric K. Oermann
    Description

    NYUTron is a large language model-based system that was developed with the objective of integrating clinical workflows centered around structured and unstructured notes and placing electronic orders in real time. The development team queried electronic health records from all NYU Langone facilities to generate two types of datasets: pre-training datasets ("NYU Notes", "NYU Notes–Manhattan", "NYU Notes–Brooklyn") which contain a total of 10 years of unlabelled inpatient clinical notes (387,144 patients, 4.1 billion words) and five fine-tuning datasets ("NYU Readmission", "NYU Readmission–Manhattan", "NYU Readmission–Brooklyn", "NYU Mortality", "NYU Binned LOS", "NYU Insurance Denial", "NYU Binned Comorbidity"), each containing 1 to 10 years of inpatient clinical notes (55,791 to 413,845 patients, 51 to 87 million words) with task-specific labels (2 to 4 classes). In addition, the team utilized two publicly available datasets, i2b2-2012 and MIMIC-III, for testing and fine-tuning.

    To assess the model's predictive capabilities, NYUTron was applied to a battery of five tasks: three clinical and two operational tasks (30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay (LOS) prediction and insurance denial prediction). In addition, a detailed analysis of our 30-day readmission task was performed to investigate data efficiency, generalizability, deployability, and potential clinical impact. NYUTron demonstrated an area under the curve (AUC) of 78.7–94.9%, with an improvement of 5.36–14.7% compared with traditional models.

    The investigators have shared code to replicate the pretraining, fine-tuning and testing of the predictive models obtained with NYU Langone electronic health records, as well as preprocessing code for the i2b2-2012 dataset and implementation steps for MIMIC-III.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). i2b2 Research Data Warehouse [Dataset]. http://identifiers.org/RRID:SCR_013276

i2b2 Research Data Warehouse

RRID:SCR_013276, nif-0000-33024, i2b2 Research Data Warehouse (RRID:SCR_013276), i2b2, Informatics for Integrating Biology and the Bedside, Informatics for Integrating Biology and the Bedside Research Data Warehouse

Explore at:
29 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 29, 2022
Description

A data warehouse that integrates information on patients from multiple sources and consists of patient information from all the visits to Cincinnati Children''''s between 2003 and 2007. This information includes demographics (age, gender, race), diagnoses (ICD-9), procedures, medications and lab results. They have included extracts from Epic, DocSite, and the new Cerner laboratory system and will eventually load public data sources, data from the different divisions or research cores (such as images or genetic data), as well as the research databases from individual groups or investigators. This information is aggregated, cleaned and de-identified. Once this process is complete, it is presented to the user, who will then be able to query the data. The warehouse is best suited for tasks like cohort identification, hypothesis generation and retrospective data analysis. Automated software tools will facilitate some of these functions, while others will require more of a manual process. The initial software tools will be focused around cohort identification. They have developed a set of web-based tools that allow the user to query the warehouse after logging in. The only people able to see your data are those to whom you grant authorization. If the information can be provided to the general research community, they will add it to the warehouse. If it cannot, they will mark it so that only you (or others in your group with proper approval) can access it.

Search
Clear search
Close search
Google apps
Main menu