6 datasets found
  1. MedQA-USMLE

    • kaggle.com
    Updated Jul 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moaaz Tameer (2023). MedQA-USMLE [Dataset]. https://www.kaggle.com/datasets/moaaztameer/medqa-usmle/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Moaaz Tameer
    Description

    (This is taken directly from the github) This is the data for the paper: Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams." arXiv preprint arXiv:2009.13081 (2020). If you would like to use the data, please cite the paper.

    Data The data that contains both the QAs and textbooks can be downloaded from this google drive folder. A bit of details of data are explained as below:

    For QAs, we have three sources: US, Mainland of China, and Taiwan District, which are put in folders, respectively. All files for QAs are in jsonl file format, where each line is a data sample as a dict. The "XX_qbank.jsonl" files contain all data samples while we also provide an official random split into train, dev, and test sets. Those files in the "metamap" folders are extracted medical related phrases using the Metamap tool.

    For QAs, we also include the "4_options" version in for US and Mainland of China since we reported results for 4 options in the paper.

    For textbooks, we have two languages: English and simplified Chinese. For simplified Chinese, we provide two kinds of sentence splitting: one is split by sentences, and the other is split by paragraphs.

    MIT License

    Copyright (c) 2022 Di Jin

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

  2. Patient Insights: 2.8Lakh Drug & Condition Reviews

    • kaggle.com
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mukesh Kumar (2024). Patient Insights: 2.8Lakh Drug & Condition Reviews [Dataset]. http://doi.org/10.34740/kaggle/dsv/9196455
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    Kaggle
    Authors
    Mukesh Kumar
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Drug Information: The dataset likely includes the generic and brand names of medications, allowing researchers to analyze trends across different formulations.

    Condition Specificity: While "Bladder Infection" might be a category, the data could contain more specific diagnoses like "Urinary Tract Infection (UTI)" or "Cystitis." This granularity allows for targeted analysis within conditions.
    Sentiment Analysis: The text content of reviews can be analyzed to understand patient sentiment towards the medication. This goes beyond the rating by capturing positive experiences, concerns about side effects, and overall satisfaction.
    Side Effect Reporting: Reviews often mention side effects experienced by patients. Analyzing this data can help identify common side effects and potential drug interactions.
    

    Use Cases:

    Comparative Effectiveness Research: By comparing patient experiences with different medications for the same condition, researchers can gain insights into their relative effectiveness and tolerability.
    Patient-Centered Drug Development: Understanding patient perspectives on existing medications can inform the development of new drugs with improved side effect profiles and better patient experiences.
    Pharmacovigilance: The dataset can be a valuable source of real-world data on medication safety, helping identify potential adverse effects that may not be captured in clinical trials.
    Personalized Medicine: Analyzing patient reviews alongside their medical history could lead to the development of tools for personalized medicine, tailoring treatment plans based on individual responses to medications.
    Natural Language Processing (NLP): Techniques like NLP can be used to extract insights from the text content. This could involve identifying patterns in patient experiences, summarizing common themes, or even building chatbots that answer patient questions about medications.
    

    Limitations:

    Data Accuracy: Patient reviews might not always be accurate or complete. Users might misreport side effects or have pre-existing biases.
    Selection Bias: People with strong positive or negative experiences might be more likely to leave reviews, skewing the data towards extremes.
    Anonymity: While anonymized, the data may not capture the full picture of a patient's medical history, which could influence their experience with a medication.
    

    Overall, this patient review dataset offers a unique window into the real-world experiences of patients with various medications. By analyzing this data responsibly and considering its limitations, researchers and healthcare professionals can gain valuable insights to improve patient care and drug development.

  3. Skin diseases image dataset

    • kaggle.com
    zip
    Updated Aug 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ismail Hossain (2021). Skin diseases image dataset [Dataset]. https://www.kaggle.com/datasets/ismailpromus/skin-diseases-image-dataset
    Explore at:
    zip(5568507391 bytes)Available download formats
    Dataset updated
    Aug 16, 2021
    Authors
    Ismail Hossain
    Description

    Dataset

    This dataset was created by Ismail Hossain

    Released under Data files © Original Authors

    Contents

  4. Data from: Cuff-Less Blood Pressure Estimation

    • kaggle.com
    • paperswithcode.com
    zip
    Updated Jun 3, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Kachuee (2017). Cuff-Less Blood Pressure Estimation [Dataset]. https://www.kaggle.com/mkachuee/BloodPressureDataset
    Explore at:
    zip(4938221622 bytes)Available download formats
    Dataset updated
    Jun 3, 2017
    Authors
    Mohammad Kachuee
    Description

    Data Set Information:

    The main goal of this data set is providing clean and valid signals for designing cuff-less blood pressure estimation algorithms. The raw electrocardiogram (ECG), photoplethysmograph (PPG), and arterial blood pressure (ABP) signals are originally collected from the physionet.org and then some preprocessing and validation performed on them. (For more information about the process please refer to our paper)

    Attribute Information:

    This database consists of a cell array of matrices, each cell is one record part. In each matrix each row corresponds to one signal channel:

    1: PPG signal, FS=125Hz; photoplethysmograph from fingertip

    2: ABP signal, FS=125Hz; invasive arterial blood pressure (mmHg)

    3: ECG signal, FS=125Hz; electrocardiogram from channel II

    Note: dataset is splitted to multiple parts to make it easier to load on machines with low memory. Each cell is a record. There might be more than one record per patient (which is not possible to distinguish). However, records of the same patient appear next to each other. N-fold cross test and train is suggested to reduce the chance of trainset being contaminated by test patients.

    Relevant Papers:

    M. Kachuee, M. M. Kiani, H. Mohammadzade, M. Shabany, Cuff-Less High-Accuracy Calibration-Free Blood Pressure Estimation Using Pulse Transit Time, IEEE International Symposium on Circuits and Systems (ISCAS'15), 2015.

    A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. Ivanov, R. Mark, J.Mietus, G. Moody, C. Peng and H. Stanley, “Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals,†Circulation, vol. 101, no. 23, pp. 215–220, 2000.

    Citation Request:

    If you found this data set useful please cite the following:

    M. Kachuee, M. M. Kiani, H. Mohammadzade, M. Shabany, Cuff-Less High-Accuracy Calibration-Free Blood Pressure Estimation Using Pulse Transit Time, IEEE International Symposium on Circuits and Systems (ISCAS'15), 2015.

    M. Kachuee, M. M. Kiani, H. Mohammadzadeh, M. Shabany, Cuff-Less Blood Pressure Estimation Algorithms for Continuous Health-Care Monitoring, IEEE Transactions on Biomedical Engineering, 2016.

  5. KJM ECoG - faces_basic

    • kaggle.com
    zip
    Updated Nov 28, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chadwick Boulay (2019). KJM ECoG - faces_basic [Dataset]. https://www.kaggle.com/datasets/cboulay/kjm-ecog-faces-basic
    Explore at:
    zip(2832591791 bytes)Available download formats
    Dataset updated
    Nov 28, 2019
    Authors
    Chadwick Boulay
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Original location: https://exhibits.stanford.edu/data/catalog/zk881ps0522

    Electrophysiological data from implanted electrodes in the human brain are rare, and therefore scientific access to it has remained somewhat exclusive. Here we present a freely-available curated library of implanted electrocorticographic (ECoG) data and analyses for 16 benchmark behavioral experiments, with 204 individual datasets from 34 patients made with the same amplifiers (at the same sampling rate and filter settings). In every case, electrode positions have been carefully registered to brain anatomy. A large set of fully-commented analysis scripts to interpret these data using modern techniques is embedded in the library alongside the data. All data, anatomic correlations, and analysis files (MATLAB code) are in a common, intuitive file structure at https://searchworks.stanford.edu/view/zk881ps0522. The library may be used as course material or serve as a starter package for researchers early in their career or for established groups, to modify the analyses and re-apply them in new settings.

    This dataset comprises preprocessed forms of the data from the "faces basic" experiment in that study.

    Also see https://www.kaggle.com/cboulay/kjm-ecog-fingerflex

  6. FSboard

    • kaggle.com
    zip
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2025). FSboard [Dataset]. https://www.kaggle.com/googleai/fsboard
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Googlehttp://google.com/
    Authors
    Google Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    FSboard is an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments. At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x.

    We previously hosted a Kaggle competition using MediaPipe Holistic landmarks for the FSboard data; this release now includes the underlying RGB videos and val/test sets.

    See the our paper for a more complete exposition of the dataset: FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones

    The dataset consists of several categories of synthetically generated phrases (examples in the table below, not real PII) recorded as video clips of ASL fingerspelling (example frames in the figure below, faces blurred here but not in the dataset).

    DirectoryCategoryExample
    "dmk"MacKenzie phrasesprevailing wind from the east
    "daun"URLs/dfinance/list.asp?id=418/
    Addresses9841 gritt hill
    Phone Numbers166-893-6320
    Namesmohammed kim

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20954272%2F2a7512937441315b8ddf742e9d02195d%2Ffs-blurred.png?generation=1739550608040254&alt=media" alt="">

    Responsible Use

    While facial expressions are an essential component of sign language and are therefore included in the dataset, we ask that you blur the signers’ faces when publicizing examples. You should not attempt to reidentify the signers or use their likenesses to generate and publish other content (deepfakes). Please be culturally respectful of the Deaf/Hard of Hearing community in your use of the dataset and do not exaggerate the significance of improving ASL fingerspelling performance, which is only one small component of American Sign Language.

    Landmarks

    Landmarks were extracted using MediaPipe Holistic . They are provided as tf.train.SequenceExample entries in TFRecordio files. There is also a script which converts these TFRecordio files to Parquet files in a similar format to the one used in the previous Kaggle Competition. Since each entry in the Parquet file represents a single landmark frame, the script also produces a supplemental csv file with video level information.

    Sensitive Content Filtering

    The synthetic URLs generated in the dataset were created by recombining parts from real URLs. As such, the full breadth of content available on the internet is represented. It is important not to infantilize the Deaf community, and therefore important to ensure that any applications in this space is able to produce arbitrary output. Imagine the frustration when your keyboard r*****s to produce certain ducking words. However, it's also important to ensure that an application doesn't easily produce offensive unintended content. In an effort to facilitate people making sane decisions with this data, we've run a sensitive content filter and keyword searches on the phrases used and manually reviewed the result to produce a boolean tag "sensitiveContent" which is available in the json files. Please ensure that the Deaf community is involved in the creation of any applications targeted to them.

    Attribution

    If you use FSboard in your work, please cite: @misc{georg2024fsboard3millioncharacters, title={FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones}, author={Manfred Georg and Garrett Tanzer and Saad Hassan and Maximus Shengelia and Esha Uboweja and Sam Sepah and Sean Forbes and Thad Starner}, year={2024}, eprint={2407.15806}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.15806}, }

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Moaaz Tameer (2023). MedQA-USMLE [Dataset]. https://www.kaggle.com/datasets/moaaztameer/medqa-usmle/data
Organization logo

MedQA-USMLE

A Large-scale Open Domain Question Answering Dataset from Medical Exams

Explore at:
180 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Moaaz Tameer
Description

(This is taken directly from the github) This is the data for the paper: Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams." arXiv preprint arXiv:2009.13081 (2020). If you would like to use the data, please cite the paper.

Data The data that contains both the QAs and textbooks can be downloaded from this google drive folder. A bit of details of data are explained as below:

For QAs, we have three sources: US, Mainland of China, and Taiwan District, which are put in folders, respectively. All files for QAs are in jsonl file format, where each line is a data sample as a dict. The "XX_qbank.jsonl" files contain all data samples while we also provide an official random split into train, dev, and test sets. Those files in the "metamap" folders are extracted medical related phrases using the Metamap tool.

For QAs, we also include the "4_options" version in for US and Mainland of China since we reported results for 4 options in the paper.

For textbooks, we have two languages: English and simplified Chinese. For simplified Chinese, we provide two kinds of sentence splitting: one is split by sentences, and the other is split by paragraphs.

MIT License

Copyright (c) 2022 Di Jin

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Search
Clear search
Close search
Google apps
Main menu