2 datasets found

h
medinstruct
huggingface.co
Updated Oct 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collin Heenan (2023). medinstruct [Dataset]. https://huggingface.co/datasets/cxllin/medinstruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 13, 2023
Authors
Collin Heenan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Sources

Repository: [https://github.com/jind11/MedQA] Paper : [https://arxiv.org/abs/2009.13081]

Citation

@article{jin2020disease, title={What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams}, author={Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter}, journal={arXiv preprint arXiv:2009.13081}, year={2020} }
MedQA-USMLE
kaggle.com
Updated Jul 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moaaz Tameer (2023). MedQA-USMLE [Dataset]. https://www.kaggle.com/datasets/moaaztameer/medqa-usmle/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Moaaz Tameer
Description
(This is taken directly from the github) This is the data for the paper: Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams." arXiv preprint arXiv:2009.13081 (2020). If you would like to use the data, please cite the paper.

Data The data that contains both the QAs and textbooks can be downloaded from this google drive folder. A bit of details of data are explained as below:

For QAs, we have three sources: US, Mainland of China, and Taiwan District, which are put in folders, respectively. All files for QAs are in jsonl file format, where each line is a data sample as a dict. The "XX_qbank.jsonl" files contain all data samples while we also provide an official random split into train, dev, and test sets. Those files in the "metamap" folders are extracted medical related phrases using the Metamap tool.

For QAs, we also include the "4_options" version in for US and Mainland of China since we reported results for 4 options in the paper.

For textbooks, we have two languages: English and simplified Chinese. For simplified Chinese, we provide two kinds of sentence splitting: one is split by sentences, and the other is split by paragraphs.

MIT License

Copyright (c) 2022 Di Jin

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Collin Heenan (2023). medinstruct [Dataset]. https://huggingface.co/datasets/cxllin/medinstruct

medinstruct

cxllin/medinstruct

Explore at:

475 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 13, 2023

Authors

Collin Heenan

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Sources

Repository: [https://github.com/jind11/MedQA] Paper : [https://arxiv.org/abs/2009.13081]

  Citation

@article{jin2020disease, title={What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams}, author={Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter}, journal={arXiv preprint arXiv:2009.13081}, year={2020} }

Clear search

Close search

Google apps

Main menu

medinstruct

MedQA-USMLE

medinstruct

cxllin/medinstruct