CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Get premium quality Off-the-shelf transcribed medical records dataset to develop better performing machine learning models. Deep domain expertise. Fast & Cost-effective.
Overview This dataset is a collection of multimodal high quality image sets of medical data that are ready to use for optimizing the accuracy of computer vision models. All of the contents are sourced from Pixta AI's partner network with high quality & full data compliance.
Data subject The datasets consist of various models
X-ray datasets
CT datasets
MRI datasets
Mammography datasets
Segmentation datasets
Classification datasets
Regression datasets
Use case The dataset could be used for various Healthcare & Medical models:
Medical Image Analysis
Remote Diagnosis
Medical Record Keeping ... Each data set is supported by both AI and expert doctors review process to ensure labelling consistency and accuracy. Contact us for more custom datasets.
About PIXTA PIXTASTOCK is the largest Asian-featured stock platform providing data, contents, tools and services since 2005. PIXTA experiences 15 years of integrating advanced AI technology in managing, curating, processing over 100M visual materials and serving global leading brands for their creative and data demands. Visit us at https://www.pixta.ai/ or contact via our email admin.bi@pixta.co.jp.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.
Dataset Overview
This dataset includes five CSV files:
patients.csv – Patient demographics, contact details, registration info, and insurance data
doctors.csv – Doctor profiles with specializations, experience, and contact information
appointments.csv – Appointment dates, times, visit reasons, and statuses
treatments.csv – Treatment types, descriptions, dates, and associated costs
billing.csv – Billing amounts, payment methods, and status linked to treatments
📁 Files & Column Descriptions
** patients.csv**
Contains patient demographic and registration details.
Column Description
patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address
** doctors.csv**
Details about the doctors working in the hospital.
Column Description
doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address
appointments.csv
Records of scheduled and completed patient appointments.
Column Description
appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)
treatments.csv
Information about the treatments given during appointments.
Column Description
treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given
** billing.csv**
Billing and payment details for treatments.
Column Description
bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)
Possible Use Cases
SQL queries and relational database design
Exploratory data analysis (EDA) and dashboarding
Machine learning projects (e.g., cost prediction, no-show analysis)
Feature engineering and data cleaning practice
End-to-end healthcare analytics workflows
Recommended Tools & Resources
SQL (joins, filters, window functions)
Pandas and Matplotlib/Seaborn for EDA
Scikit-learn for ML models
Pandas Profiling for automated EDA
Plotly for interactive visualizations
Please Note that :
All data is synthetically generated for educational and project use. No real patient information is included.
If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We generated this dataset to train a machine learning model for automatically generating psychiatric case notes from doctor-patient conversations. Since, we didn't have access to real doctor-patient conversations, we used transcripts from two different sources to generate audio recordings of enacted conversations between a doctor and a patient. We employed eight students who worked in pairs to generate these recordings. Six of the transcripts that we used to produce this recordings were hand-written by Cheryl Bristow and rest of the transcripts were adapted from Alexander Street which were generated from real doctor-patient conversations. Our study requires recording the doctor and the patient(s) in seperate channels which is the primary reason behind generating our own audio recordings of the conversations.
We used Google Cloud Speech-To-Text API to transcribe the enacted recordings. These newly generated transcripts are auto-generated entirely using AI powered automatic speech recognition whereas the source transcripts are either hand-written or fine-tuned by human transcribers (transcripts from Alexander Street).
We provided the generated transcripts back to the students and asked them to write case notes. The students worked independently using a software that we developed earlier for this purpose. The students had past experience of writing case notes and we let the students write case notes as they practiced without any training or instructions from us.
NOTE: Audio recordings are not included in Zenodo due to large file size but they are available in the GitHub repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A 100,000-patient database that contains in total 100,000 virtual patients, 361,760 admissions, and 107,535,387 lab observations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: We recommend to download from Zenodo official link, which is integrated with our code. However, if you find download problem, you can also use this mirror link from Google Drive.
Abstract
We introduce MedMNIST v2, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of 708,069 2D images and 10,214 3D images in total, could support numerous research / educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.
Note: This dataset is NOT intended for clinical use.
We recommend our official code to download, parse and use the MedMNIST dataset:
pip install medmnist
Citation
If you find this project useful, please cite both v1 and v2 paper as:
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. "MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification". arXiv preprint arXiv:2110.14795, 2021. Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.
or using bibtex:
@article{medmnistv2, title={MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={arXiv preprint arXiv:2110.14795}, year={2021} } @inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }
Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.
License
The dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
The code is under Apache-2.0 License.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a synthetic simulation of medical fitness reports collected from routine health check-ups. It contains 5,000 rows representing individuals' vitals, lab test results, physical exam notes, and disease screenings.
Each row is labeled with a FIT or UNFIT outcome to indicate overall health status based on the data. This dataset can be used for machine learning classification tasks, health analytics, and smart form automation.
Use Cases:
Medical classification model (FIT/UNFIT)
EDA & visualization
Data cleaning & preprocessing
Health dashboards
ML feature engineering
Note: All values are randomly generated for educational purposes and do not represent real individuals.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically.
Dataset content
OpenChart-SE, version 1 corpus (txt files and and dataset.csv)
The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication.
Codebook.xlsx
The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection.
suppl_data_1_openchart-se_form.pdf
OpenChart-SE mock emergency care EHR form.
suppl_data_3_openchart-se_dataexploration.ipynb
This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus.
More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se).
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
These synthetic patient datasets were created for machine learning (ML) study of lung cancer risk prediction and simulation study of learning health systems.
In subfolder "unconverted": Five populations of 30K patients were generated by the Synthea patient generator. About 1100 lung cancer patients and 3000 control patients (without lung cancer) were selected and their electronic health records (EHR) were processed to data table files ready for machine learning using common algorithms like XGBoost.
In root directory: The five 30K-patient datasets were combined sequentially to form 5 different size datasets, from 30K to 150K patients. The new datasets were resampled to keep all lung cancer patients plus about 3x control patients. The ML-ready table files also had the continuous numeric values converted to categorical values.
Because Synthea patients are closely resemble real patients, the Synthea patient data can be used to develop and test ML algorithms and pipelines, and train researchers. Unlike real patient data, these Synthea datasets can be shared with collaborators anywhere without privacy concerns.
The first LHS simulation study titled "Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data" has been published in Nature Scientific Reports (see https://www.nature.com/articles/s41598-022-23011-4).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Ethics Reference No: 209113723/2023/1Source Code is available on Github. The datasets are used to reproduce the same results: https://github.com/DHollenbach/record-linkage-and-deduplication/blob/main/README.mdAbstract:The research emphasised the vital role of a Master Patient Index (MPI) solution in addressing the challenges public healthcare facilities face in eliminating duplicate patient records and improving record linkage. The study recognised that traditional MPI systems may have limitations in terms of efficiency and accuracy. To address this, the study focused on utilising machine learning techniques to enhance the effectiveness of MPI systems, aiming to support the growing record linkage healthcare ecosystem.It was essential to highlight that integrating machine learning into MPI systems is crucial for optimising their capabilities. The study aimed to improve data linking and deduplication processes within MPI systems by leveraging machine learning techniques. This emphasis on machine learning represented a significant shift towards more sophisticated and intelligent healthcare technologies. Ultimately, the goal was to ensure safe and efficient patient care, benefiting individuals and the broader healthcare industry.This research investigated the performance of five machine learning classification algorithms (random forests, extreme gradient boosting, logistic regression, stacking ensemble, and deep multilayer perceptron) for data linkage and deduplication on four datasets. These techniques improved data linking and deduplication for use in an MPI system.The findings demonstrate the applicability of machine learning models for effective data linkage and deduplication of electronic health records. The random forest algorithm achieved the best performance (identifying duplicates correctly) based on accuracy, F1-Score, and AUC-score for three datasets (Electronic Practice-Based Research Network (ePBRN): Acc = 99.83%, F1-score = 81.09%, AUC = 99.98%; Freely Extensible Biomedical Record Linkage (FEBRL) 3: Acc = 99.55%, F1-score = 96.29%, AUC = 99.77%; Custom-synthetic: Acc = 99.98%, F1-score = 99.18%, AUC = 99.99%). In contrast, the experimentation on the FEBRL4 dataset revealed that the Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) and logistic regression algorithms outperformed the random forest algorithm. The performance results for the MLP-ANN were (FEBRL4: Acc = 99.93%, F1-score = 96.95%, AUC = 99.97%). For the logistic regression algorithm, the results were (FEBRL4: Acc = 99.99%, F1 = 96.91%, AUC = 99.97%).In conclusion, the results of this research have significant implications for the healthcare industry, as they are expected to enhance the utilisation of MPI systems and improve their effectiveness in the record linkage healthcare ecosystem. By improving patient record linking and deduplication, healthcare providers can ensure safer and more efficient care, ultimately benefiting patients and the industry.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Get premium quality off-the-shelf EHR dataset to develop better performing machine learning models. Speak to our experts for Electronic Health Records data needs.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains images of various cancer types, compiled for research and analysis purposes. It includes 8 main cancer classes and 26 subclasses, providing a rich resource for medical image classification and machine learning applications.
If you use this dataset in your research or project, please make sure to cite it appropriately. Thanks! ❤️
You can check DOI Citation
section at the bottom.
Obuli Sai Naren. (2022). Multi Cancer Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3415848
Cancer | Classes | Images |
---|---|---|
Acute Lymphoblastic Leukemia | 4 | 20,000 |
Brain Cancer | 3 | 15,000 |
Breast Cancer | 2 | 10,000 |
Cervical Cancer | 5 | 25,000 |
Kidney Cancer | 2 | 10,000 |
Lung and Colon Cancer | 5 | 25,000 |
Lymphoma | 3 | 15,000 |
Oral Cancer | 2 | 10,000 |
Total Images: 130,000
Format: JPEG
Dimensions: 512px × 512px
Each subclass folder contains 5,000 images. The datasets referenced for each cancer type are linked below.
<subclass>_<serial_number>.jpg
for easy reference.For more detailed information on the dataset structure, preprocessing, and usage, please refer to the README.md file included in the dataset's main directory.
Feel free to download, analyze, and contribute! 📊💻
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We assessed the generalizability of machine learning methods using natural language processing (NLP) techniques to detect adverse drug events (ADEs) from clinical narratives in electronic medical records (EMRs). We constructed a new corpus correlating drugs with adverse drug events using 1,394 clinical notes of 47 randomly selected patients who received immune checkpoint inhibitors (ICIs) from 2011 to 2018 at The Ohio State University James Cancer Hospital, annotating 189 drug-ADE relations in single sentences within the medical records. We also used data from Harvard’s publicly available 2018 National Clinical Challenge (n2c2), which includes 505 discharge summaries with annotations of 1,355 single-sentence drug-ADE relations. We applied classical machine learning (support vector machine (SVM)), deep learning (convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM)), and state-of-the-art transformer-based (bidirectional encoder representations from transformers (BERT) and ClinicalBERT) methods trained and tested in the two different corpora and compared performance among them to detect drug–ADE relationships. ClinicalBERT detected drug–ADE relationships better than the other methods when trained using our dataset and tested in n2c2 (ClinicalBERT F-score, 0.78; other methods, F-scores, 0.61–0.73) and when trained using the n2c2 dataset and tested in ours (ClinicalBERT F-score, 0.74; other methods, F-scores, 0.55–0.72). Comparison among several machine learning methods demonstrated the superior performance and, therefore, the greatest generalizability of findings of ClinicalBERT for the detection of drug–ADE relations from clinical narratives in electronic medical records.
Part of Janatahack Hackathon in Analytics Vidhya
The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.
MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).
MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.
One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.
The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides
information about several health issues through various awareness stalls.
Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.
Train / Test split:
Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.
Credits to AV
To share with the data science community to jump start their journey in Healthcare Analytics
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Ontvang hoogwaardige, kant-en-klare getranscribeerde medische datasets om beter presterende machine learning-modellen te ontwikkelen. Diepgaande domeinexpertise. Snel en kosteneffectief.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This is a lab-based dataset with videos recording volunteers (medical students) washing their hands as part of a hand-washing monitoring and feedback experiment. The dataset is collected in the Medical Education Technology Center (METC) of Riga Stradins University, Riga, Latvia. In total, 72 participants took part in the experiments, each washing their hands three times, in a randomized order, going through three different hand-washing feedback approaches (user interfaces of a mobile app). The data was annotated in real time by a human operator, in order to give the experiment participants real-time feedback on their performance. There are 212 hand washing episodes in total, each of which is annotated by a single person. The annotations classify the washing movements according to the World Health Organization's (WHO) guidelines by marking each frame in each video with a certain movement code.
This dataset is part on three dataset series all following the same format:
https://zenodo.org/record/4537209 - data collected in Pauls Stradins Clinical University Hospital
https://zenodo.org/record/5808764 - data collected in Jurmala Hospital
https://zenodo.org/record/5808789 - data collected in the Medical Education Technology Center (METC) of Riga Stradins University
Note #1: we recommend that when using this dataset for machine learning, allowances are made for the reaction speed of the human operator labeling the data. For example, the annotations can be expected to be incorrect a short while after the person in the video switches their washing movements.
Application: The intention of this dataset is to serve as a basis for training machine learning classifiers for automated hand washing movement recognition and quality control.
Statistics:
Frame rate: ~16 FPS (slightly variable, as the video are reconstructed from a sequence of jpg images taken with max framerate supported by the capturing devices).
Resolution: 640x480
Number of videos: 212
Number of annotation files: 212
Movement codes (in JSON files):
1: Hand washing movement — Palm to palm
2: Hand washing movement — Palm over dorsum, fingers interlaced
3: Hand washing movement — Palm to palm, fingers interlaced
4: Hand washing movement — Backs of fingers to opposing palm, fingers interlocked
5: Hand washing movement — Rotational rubbing of the thumb
6: Hand washing movement — Fingertips to palm
0: Other hand washing movement
Note #2: The original dataset of JPG images is available upon request. There are 13 annotation classes in the original dataset: for each of the six washing movements defined by the WHO, "correct" and "incorrect" execution is market with two different labels. In this published dataset, all incorrect executions are marked with code 0, as "other" washing movement.
Acknowledgments: The dataset collection was funded by the Latvian Council of Science project: "Automated hand washing quality control and quality evaluation system with real-time feedback", No: lzp - Nr. 2020/2-0309.
References: For more detailed information, see this article, describing a similar dataset collected in a different project:
M. Lulla, A. Rutkovskis, A. Slavinska, A. Vilde, A. Gromova, M. Ivanovs, A. Skadins, R. Kadikis, A. Elsts. Hand-Washing Video Dataset Annotated According to the World Health Organization’s Hand-Washing Guidelines. Data. 2021; 6(4):38. https://doi.org/10.3390/data6040038
Contact information: atis.elsts@edi.lv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.
This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].
Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.
Usage: We recommend using the demo code and tutorials available on our GitHub repository.
Citation: If you find this work useful, please consider citing us:
@article{disalvo2024medmnist, title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions}, author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian}, journal={arXiv preprint arXiv:2406.17536}, year={2024} }
Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
This dataset presents a curated collection of preprocessed and labeled clinical notes derived from the MIMIC-IV-Note database. The primary aim of this resource is to facilitate the development and training of machine learning models focused on summarizing brief hospital courses (BHC) from clinical discharge notes.
The dataset contains 270,033 meticulously cleaned and standardized clinical notes containing an average token length of 2,267, ensuring usability for machine learning (ML) applications. Each clinical note is paired with a corresponding BHC summary, providing a robust foundation for supervised learning tasks. The preprocessing pipeline employed uses regular expressions to address common issues in the raw clinical text, such as special characters, extraneous whitespace, inconsistent formatting, and irrelevant text, to produce a high-quality, structured dataset with separated clinical note sections through appropriate headings.
By offering this resource, we aim to support healthcare professionals and researchers in their efforts to enhance patient care through the automation of BHC summarization. This dataset is ideal for exploring various NLP techniques, developing predictive models, and improving the efficiency and accuracy of clinical documentation practices. We invite the research community to utilize this dataset to advance the field of medical informatics and contribute to better health outcomes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A new dataset associated with EHR data, involving patients suspicious of PE, including positive PE, healthy and COVID-19 patients suspected of PE. The dataset included PE diagnosis based on CTA imaging results, biographic data, vital signs, laboratory test results, past medical history and, medications. This dataset can utilize in the evolution of PE studies based on machine learning and artificial intelligence.
The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.
Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.
About Dataset:
333 scholarly articles cite this dataset.
Unique identifier: DOI
Dataset updated: 2023
Authors: Haoyang Mi
In this dataset, we have two dataset:
1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time
2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS
Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Get premium quality Off-the-shelf transcribed medical records dataset to develop better performing machine learning models. Deep domain expertise. Fast & Cost-effective.