Facebook
TwitterA data warehouse that integrates information on patients from multiple sources and consists of patient information from all the visits to Cincinnati Children''''s between 2003 and 2007. This information includes demographics (age, gender, race), diagnoses (ICD-9), procedures, medications and lab results. They have included extracts from Epic, DocSite, and the new Cerner laboratory system and will eventually load public data sources, data from the different divisions or research cores (such as images or genetic data), as well as the research databases from individual groups or investigators. This information is aggregated, cleaned and de-identified. Once this process is complete, it is presented to the user, who will then be able to query the data. The warehouse is best suited for tasks like cohort identification, hypothesis generation and retrospective data analysis. Automated software tools will facilitate some of these functions, while others will require more of a manual process. The initial software tools will be focused around cohort identification. They have developed a set of web-based tools that allow the user to query the warehouse after logging in. The only people able to see your data are those to whom you grant authorization. If the information can be provided to the general research community, they will add it to the warehouse. If it cannot, they will mark it so that only you (or others in your group with proper approval) can access it.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
i2b2 query data 1.0
This is a dataset of i2b2 query builder examples that are taken from a test environment of i2b2 and then pre-processed with AI descriptions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw runtimes for metadata associated with the Advancing clinical cohort selection with genomics analysis on a distributed platform manuscript. Markdown used to generate plots at: https://github.com/OmicsDataAutomation/i2b2-oda-framework/blob/master/genomicsdb/results.Rmd.
Facebook
Twitteri2b2 (Informatics for Integrating Biology and the Bedside) is an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System. The i2b2 Center is developing a scalable informatics framework that will enable clinical researchers to use existing clinical data for discovery research and, when combined with IRB-approved genomic data, facilitate the design of targeted therapies for individual patients with diseases having genetic origin. For some resources (e.g. software) the use of the resource requires accepting a specific (e.g. OpenSource) license.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset with information on cancer history, mutation status and surveillance history for more than 100 000 study patients is provided in i2b2 (Informatics for Integrating Biology and the Bedside, http://www.i2b2.org/software). Members of the German Consortium for Hereditary Breast and Ovarian Cancer can request access to i2b2 and will be able to perform database queries independently, e.g. with regard to identify suitable patient populations for scientific evaluation projects.
Facebook
TwitterThe data for the smoking challenge consisted exclusively of discharge summaries from Partners HealthCare which were preprocessed and converted into XML format, and separated into training and test sets. I2B2 is a data warehouse containing clinical data on over 150k patients, including outpatient DX, lab results, medications, and inpatient procedures. ETL processes authored to pull data from EMR and finance systems Institutional review boards of Partners HealthCare approved the challenge and the data preparation process. The data were annotated by pulmonologists and classified patients into Past Smokers, Current Smokers, Smokers, Non-smokers, and unknown. Second-hand smokers were considered non-smokers. Other institutions involved include Massachusetts Institute of Technology, and the State University of New York at Albany. i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement. In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1. A similar set of notes from the Second i2b2 Challenge will be released on the one year anniversary of that Challenge (November, 2010).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Combining clinical and genomics queries using i2b2 â Three methods - Table 1
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The free-form portions of clinical notes are a significant source of information for research. One path for protecting patientâs privacy is to fully de-identify this information prior to sharing for research purposes . De-identification efforts have focused on known named entities and other known identifier types (names, ages, dates, addresses, IDâs, etc.). However, a note may contain residual âDemographic Traitsâ (DTs), unique enough to identify the patient when combined with other such facts. While we believe that re-identification is not possible with these demographic traits alone, we hope that giving healthcare organizations the option to remove them will strengthen privacy standards of automatic de-identification systems and bolster their confidence in such systems.
More specifically, this dataset was used to test the performance of our paper âInteractive Deep Learning to Detect Demographic Traits in Free-Form Clinical Notesâ. We evaluated our pipeline using a subset of the I2b2 2006 and MIMIC-III datasets.
The data contains sentence tagging for MIMIC-III and I2b2 2006 datasets that was used in the paper âInteractive Deep Learning to Detect Demographic Traits in Free-Form Clinical Notesâ. Every sentence is tagged with its own demographic trait tag (as defined in the "Annotations Guide" file). More formally, the data contains CSV tables each containing rows corresponding to annotated sentences such that every row contains the following example properties: row ID, offset within the noteâs text, length and label.
The label mapping (from character to tag) appears in the "Tagged Categories" file. Furthermore, every note in the MIMIC-III dataset contains a unique row-id (appears in a field within the note). In I2b2 2006, every note also contains a unique number, referred to as record-id (which also appears within the note). These features can be found in our attached CSV's under the row_id and record_id columns appropriately. In both cases the offset is defined from the beginning of the note's text.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying 4 TCIA breast cancer collections (Multi-center breast DCE-MRI data and segmentations from patients in the I-SPY 1/ACRIN 6657 trials (ISPY1), BREAST-DIAGNOSIS, Single site breast DCE-MRI data and segmentations from patients undergoing neoadjuvant chemotherapy (Breast-MRI-NACT-Pilot), The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA)) and the Ivy Glioblastoma Atlas Project (IvyGAP) brain cancer collection. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data imported into I2B2 were cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 in SDTM compliant SAS transport files. The SDTM data was derived from data taken from both the curated TCIA spreadsheets as well as tumor measurements and dates from the TCIA Restful API. Due to the nature of the available data not all SDTM conformance rules were applicable or adhered to. These Study Data Tabulation Model format (SDTM) datasets were validated using Pinnacle 21 CDISC validation software. The validation software reviews datasets according to their degree of conformance to rules developed for the purposes of FDA submissions of electronic data. Iterative refinements were made to the datasets based upon group discussions and feedback from the validation tool. Export datasets for the following SDTM domains were generated:
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Data Integration & Imaging Informatics (DI-Cubed) project explored the issue of lack of standardized data capture at the point of data creation, as reflected in the non-image data accompanying various TCIA breast cancer collections. The work addressed the desire for semantic interoperability between various NCI initiatives by aligning on common clinical metadata elements and supporting use cases that connect clinical, imaging, and genomics data. Accordingly, clinical and measurement data was imported into I2B2 and cross-mapped to industry standard concepts for names and values including those derived from BRIDG, CDISC SDTM, DICOM Structured Reporting models and using NCI Thesaurus, SNOMED CT and LOINC controlled terminology. A subset of the standardized data was then exported from I2B2 to CSV and thence converted to DICOM SR according to the the DICOM Breast Imaging Report template [1] , which supports description of patient characteristics, histopathology, receptor status and clinical findings including measurements. The purpose was not to advocate DICOM SR as an appropriate format for interchange or storage of such information for query purposes, but rather to demonstrate that use of standard concepts harmonized across multiple collections could be transformed into an existing standard report representation. The DICOM SR can be stored and used together with the images in repositories such as TCIA and in image viewers that support rendering of DICOM SR content. During the project, various deficiencies in the DICOM Breast Imaging Report template were identified with respect to describing breast MR studies, laterality of findings versus procedures, more recently developed receptor types, and patient characteristics and status. These were addressed via DICOM CP 1838, finalized in Jan 2019, and this subset reflects those changes. DICOM Breast Imaging Report Templates available from: http://dicom.nema.org/medical/dicom/current/output/chtml/part16/sect_BreastImagingReportTemplates.html
Facebook
TwitterData and knowledge management infrastructure for the new Center for Clinical and Translational Science (CCTS) at the University of Utah. This clinical cohort search tool is used to search across the University of Utah clinical data warehouse and the Utah Population Database for people who satisfy various criteria of the researchers. It uses the i2b2 front end but has a set of terminology servers, metadata servers and federated query tool as the back end systems. FURTHeR does on-the-fly translation of search terms and data models across the source systems and returns a count of results by unique individuals. They are extending the set of databases that can be queried.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Sets were generated using the Weakly Supervised NER pipeline (https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) to train the symptom extraction NER models.
Supported Languages and dataset locations for the specific language:
English (base language): https://huggingface.co/HUMADEX/english_medical_ner
German: https://huggingface.co/HUMADEX/german_medical_ner
Italian: https://huggingface.co/HUMADEX/italian_medical_ner
Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
Greek: https://huggingface.co/HUMADEX/german_medical_ner
Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
Polish: https://huggingface.co/HUMADEX/polish_medical_ner
Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner
Dataset Building
Acknowledgement
This dataset had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie SkĹodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.
Authors:
dr. Izidor Mlakar, Rigona Sallauka, dr. Umut Arioz, dr. Matej Rojc
Please cite as:
Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
Doi: 10.20944/preprints202504.1356.v1
Website: https://www.preprints.org/manuscript/202504.1356/v1" href="https://www.preprints.org/manuscript/202504.1356/v1">https://www.preprints.org/manuscript/202504.1356/v1
Facebook
TwitterAbstract: âSocial and Environmental Determinants of Health (SEDoH) are of increasing interest to researchers in personal and public health. Collecting SEDoH and associating them with patient medical record can be challenging, especially for environmental variables. We announce here the release of SEnDAE, the Social and Environmental Determinants Address Enhancement toolkit, and open-source resource for ingesting a range of environmental variables and measurements from a variety of sources and associated them with arbitrary addresses.SEnDAE includes optional components for geocoding addresses, in case an organization does not have independent capabilities in that area, and recipes for extending the OMOP CDM and the ontology of an i2b2 instance to display and compute over the SEnDAE variables within i2b2. On a set of 5000 synthetic addresses, SEnDAE was able to geocode 83%. SEnDAE geocodes addresses to the same Census tract as ESRI 98.1% of the time. Development of SEnDAE is ongoing, but we hope that teams will find it useful to increase their usage of environmental variables and increase the field's general understanding of these important determinants of health.â
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.
Facebook
TwitterNYUTron is a large language model-based system that was developed with the objective of integrating clinical workflows centered around structured and unstructured notes and placing electronic orders in real time. The development team queried electronic health records from all NYU Langone facilities to generate two types of datasets: pre-training datasets ("NYU Notes", "NYU NotesâManhattan", "NYU NotesâBrooklyn") which contain a total of 10 years of unlabelled inpatient clinical notes (387,144 patients, 4.1 billion words) and five fine-tuning datasets ("NYU Readmission", "NYU ReadmissionâManhattan", "NYU ReadmissionâBrooklyn", "NYU Mortality", "NYU Binned LOS", "NYU Insurance Denial", "NYU Binned Comorbidity"), each containing 1 to 10 years of inpatient clinical notes (55,791 to 413,845 patients, 51 to 87 million words) with task-specific labels (2 to 4 classes). In addition, the team utilized two publicly available datasets, i2b2-2012 and MIMIC-III, for testing and fine-tuning.
To assess the model's predictive capabilities, NYUTron was applied to a battery of five tasks: three clinical and two operational tasks (30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay (LOS) prediction and insurance denial prediction). In addition, a detailed analysis of our 30-day readmission task was performed to investigate data efficiency, generalizability, deployability, and potential clinical impact. NYUTron demonstrated an area under the curve (AUC) of 78.7â94.9%, with an improvement of 5.36â14.7% compared with traditional models.
The investigators have shared code to replicate the pretraining, fine-tuning and testing of the predictive models obtained with NYU Langone electronic health records, as well as preprocessing code for the i2b2-2012 dataset and implementation steps for MIMIC-III.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterA data warehouse that integrates information on patients from multiple sources and consists of patient information from all the visits to Cincinnati Children''''s between 2003 and 2007. This information includes demographics (age, gender, race), diagnoses (ICD-9), procedures, medications and lab results. They have included extracts from Epic, DocSite, and the new Cerner laboratory system and will eventually load public data sources, data from the different divisions or research cores (such as images or genetic data), as well as the research databases from individual groups or investigators. This information is aggregated, cleaned and de-identified. Once this process is complete, it is presented to the user, who will then be able to query the data. The warehouse is best suited for tasks like cohort identification, hypothesis generation and retrospective data analysis. Automated software tools will facilitate some of these functions, while others will require more of a manual process. The initial software tools will be focused around cohort identification. They have developed a set of web-based tools that allow the user to query the warehouse after logging in. The only people able to see your data are those to whom you grant authorization. If the information can be provided to the general research community, they will add it to the warehouse. If it cannot, they will mark it so that only you (or others in your group with proper approval) can access it.