100+ datasets found

o
NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and...
registry.opendata.aws
Updated Jul 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (NLM) (2021). NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS [Dataset]. https://registry.opendata.aws/ncbi-pmc/
Explore at:
Dataset updated
Jul 4, 2021
Dataset provided by
<a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
Description
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:

The PMC Open Access (OA) Subset, which includes all articles in PMC with a machine-readable Creative Commons license

The Author Manuscript Dataset, which includes all articles collected under a funder policy in PMC and made available in machine-readable formats for text mining

These datasets collectively span more than half of PMC’s total collection of full-text articles. PMC enables access to these datasets to expand the impact of open access and publicly-funded research; enable greater machine learning across the spectrum of scientific research; reach new audiences; and open new doors for discovery. The bucket in this registry contains individual articles in NISO Z39.96-2015 JATS XML format as well as in plain text as extracted from the XML. The bucket is updated daily with new and updated articles. Also included are file lists that include metadata for articles in each dataset.
d
Blog | Leveraging Open Data Science to Accelerate Innovation at NIH and...
catalog.data.gov
data.virginia.gov
+1more
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Kittrie (2025). Blog | Leveraging Open Data Science to Accelerate Innovation at NIH and Beyond [Dataset]. https://catalog.data.gov/dataset/blog-leveraging-open-data-science-to-accelerate-innovation-at-nih-and-beyond
Explore at:
Dataset updated
Mar 26, 2025
Dataset provided by
Elizabeth Kittrie
Description
This blog was posted by Elizabeth Kittrie on November 30, 2016. It was written by Elizabeth Kittrie, Senior Advisor for Open Innovation & Policy and Joe Bonner, Health Scientist-AAAS Fellow.
N
NCBI Datasets
datadiscovery.nlm.nih.gov
healthdata.gov
+2more
csv, xlsx, xml
Updated Feb 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). NCBI Datasets [Dataset]. https://datadiscovery.nlm.nih.gov/Molecular-biology-Genetics/NCBI-Datasets-BETA-/3br9-y2tm
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Feb 9, 2022
Description
NCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.
d
Data from: Public sharing of research datasets: a pilot study of...
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated May 26, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather A. Piwowar; Wendy W. Chapman (2011). Public sharing of research datasets: a pilot study of associations [Dataset]. http://doi.org/10.5061/dryad.3td2f
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3td2f
Dataset updated
May 26, 2011
Dataset provided by
Dryad
Authors
Heather A. Piwowar; Wendy W. Chapman
Time period covered
May 26, 2011
Description
Microarray study attributes and data sharing status397 rows, one row for each study that created gene expression microarray data as identified by Ochsner et al. (doi:10.1038/nmeth1208-991). Attributes of each study are included in 23 columns. Dependent variable is called is_data_shared.Piwowar_Metrics2009_rawdata.csvStatistical analysis R scriptStatistical R script for analysis and graphics as presented in the paper.Piwowar_Metrics2009_statistics.R
d
Study of Womens Health Across the Nation (SWAN) Public Use Data
catalog.data.gov
healthdata.gov
+2more
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (NIH) (2023). Study of Womens Health Across the Nation (SWAN) Public Use Data [Dataset]. https://catalog.data.gov/dataset/study-of-womens-health-across-the-nation-swan-public-use-data
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
National Institutes of Health (NIH)
Description
The SWAN Public Use Datasets provide access to longitudinal data describing the physical, biological, psychological, and social changes that occur during the menopausal transition. Data collected from 3,302 SWAN participants from Baseline through the 10th Annual Follow-Up visit are currently available to the public. Registered users are able to download datasets in a variety of formats, search variables and view recent publications.
d
Open-i
catalog.data.gov
datadiscovery.nlm.nih.gov
+3more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Open-i [Dataset]. https://catalog.data.gov/dataset/open-i
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
Open-i service provides search and retrieval of abstracts and images (including charts, graphs, clinical images, etc.) from the open source literature, and biomedical image collections. Searching may be done by text queries as well as by query images.
NIH Data Sharing Repositories
catalog.data.gov
healthdata.gov
+1more
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (NIH), Department of Health & Human Services (2025). NIH Data Sharing Repositories [Dataset]. https://catalog.data.gov/dataset/nih-data-sharing-repositories
Explore at:
Dataset updated
Jul 25, 2025
Dataset provided by
United States Department of Health and Human Serviceshttp://www.hhs.gov/
Description
A list of NIH-supported repositories that accept submissions of appropriate scientific research data from biomedical researchers. It includes resources that aggregate information about biomedical data and information sharing systems. Links are provided to information about submitting data to and accessing data from the listed repositories. Additional information about the repositories and points-of contact for further information or inquiries can be found on the websites of the individual repositories.
Data from: Visible Human Project
healthdata.gov
datadiscovery.nlm.nih.gov
+3more
csv, xlsx, xml
Updated Mar 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datadiscovery.nlm.nih.gov (2023). Visible Human Project [Dataset]. https://healthdata.gov/NIH/Visible-Human-Project/krti-uwg9
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Mar 1, 2023
Dataset provided by
datadiscovery.nlm.nih.gov
Description
The NLM Visible Human Project® has created publicly-available complete, anatomically detailed, three-dimensional representations of a human male body and a human female body. Specifically, the VHP provides a public-domain library of cross-sectional cryosection, CT, and MRI images obtained from one male cadaver and one female cadaver. The Visible Man data set was publicly released in 1994 and the Visible Woman in 1995.

https://www.nlm.nih.gov/research/visible/visible_human.html
RxNorm Data
kaggle.com
bioregistry.io
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2019). RxNorm Data [Dataset]. https://www.kaggle.com/datasets/nlm-nih/nlm-rxnorm
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
National Library of Medicine
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

RxNorm is a name of a US-specific terminology in medicine that contains all medications available on US market. Source: https://en.wikipedia.org/wiki/RxNorm

RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software, including those of First Databank, Micromedex, Gold Standard Drug Database, and Multum. By providing links between these vocabularies, RxNorm can mediate messages between systems not using the same software and vocabulary. Source: https://www.nlm.nih.gov/research/umls/rxnorm/

Content

RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs, defined as the combination of {ingredient + strength + dose form}. In addition to the naming system, the RxNorm dataset also provides structured information such as brand names, ingredients, drug classes, and so on, for each clinical drug. Typical uses of RxNorm include navigating between names and codes among different drug vocabularies and using information in RxNorm to assist with health information exchange/medication reconciliation, e-prescribing, drug analytics, formulary development, and other functions.

This public dataset includes multiple data files originally released in RxNorm Rich Release Format (RXNRRF) that are loaded into Bigquery tables. The data is updated and archived on a monthly basis.

The following tables are included in the RxNorm dataset:

RXNCONSO contains concept and source information

RXNREL contains information regarding relationships between entities

RXNSAT contains attribute information

RXNSTY contains semantic information

RXNSAB contains source info

RXNCUI contains retired rxcui codes

RXNATOMARCHIVE contains archived data

RXNCUICHANGES contains concept changes

Update Frequency: Monthly

Fork this kernel to get started with this dataset.

Acknowledgements

https://www.nlm.nih.gov/research/umls/rxnorm/

https://bigquery.cloud.google.com/dataset/bigquery-public-data:nlm_rxnorm

https://cloud.google.com/bigquery/public-data/rxnorm

Dataset Source: Unified Medical Language System RxNorm. The dataset is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. This dataset uses publicly available data from the U.S. National Library of Medicine (NLM), National Institutes of Health, Department of Health and Human Services; NLM is not responsible for the dataset, does not endorse or recommend this or any other dataset.

Banner Photo by @freestocks from Unsplash.

Inspiration

What are the RXCUI codes for the ingredients of a list of drugs?

Which ingredients have the most variety of dose forms?

In what dose forms is the drug phenylephrine found?

What are the ingredients of the drug labeled with the generic code number 072718?
NIH Data and Specimen Hub (DASH)
catalog.data.gov
datasets.ai
Updated Mar 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). NIH Data and Specimen Hub (DASH) [Dataset]. https://catalog.data.gov/dataset/nih-data-and-specimen-hub-dash
Explore at:
Dataset updated
Mar 23, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
"The NICHD Data and Specimen Hub (DASH) is a centralized resource that allows researchers to share and access de-identified data from studies funded by NICHD. DASH also serves as a portal for requesting biospecimens from selected DASH studies.". This dataset is associated with the following publication: Deluca, N., K. Thomas, A. Mullikin, R. Slover, L. Stanek, D. Pilant, and E. Hubal. Geographic and demographic variability in serum PFAS concentrations for pregnant women in the United States. Journal of Exposure Science and Environmental Epidemiology. Nature Publishing Group, London, UK, 33(1): 710-724, (2023).
NIH CheXmask Database: a dataset of anatomical seg
kaggle.com
zip
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyadarshi Mukhopadhyay (2025). NIH CheXmask Database: a dataset of anatomical seg [Dataset]. https://www.kaggle.com/datasets/poeticmage/chexmask-database-a-dataset-of-anatomical-segment
Explore at:
zip(944480270 bytes)Available download formats
Dataset updated
Jul 22, 2025
Authors
Priyadarshi Mukhopadhyay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CheXmask Database: a large-scale dataset of anatomical segmentation masks for chest x-ray images

This particular data contains only the segmentation for NIH Data set that is: Chest X Ray 8

Nicolas Gaggion , Candelaria Mosquera , Martina Aineseder , Lucas Mansilla , Diego Milone , Enzo Ferrante

Published: March 1, 2024. Version: 0.4

This data set was downloaded from "https://physionet.org/content/chexmask-cxr-segmentation-data/0.4/OriginalResolution/#files-panel">Physionet. PhysioNet is a repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology. Supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362. For more accessibility options, see the MIT Accessibility Page.

Abstract The CheXmask Database presents a comprehensive, uniformly annotated collection of chest radiographs, constructed from five public databases: ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest and VinDr-CXR. The database aggregates 657,566 anatomical segmentation masks derived from images which have been processed using the HybridGNet model to ensure consistent, high-quality segmentation. To confirm the quality of the segmentations, we include in this database individual Reverse Classification Accuracy (RCA) scores for each of the segmentation masks. This dataset is intended to catalyze further innovation and refinement in the field of semantic chest X-ray analysis, offering a significant resource for researchers in the medical imaging domain.

Ethics All publicly available datasets utilized in this study adhered to strict ethical standards and underwent thorough anonymization, with identifiable details removed. The study does not release any part of the original image datasets; it only provides already anonymized image identifiers to allow researchers to match the original images with our annotations. MIMIC-CXR-JPG dataset required additional ethics training and research courses for access. The study authors fulfilled all ethics courses and data use agreement requirements to ensure ethical data usage.

Conflicts of Interest The authors have no conflict of interests to declare.

References Wang R, Chen LC, Moukheiber L, Seastedt KP, Moukheiber M, Moukheiber D, Zaiman Z, Moukheiber S, Litchman T, Trivedi H, Steinberg R. Enabling chronic obstructive pulmonary disease diagnosis through chest X-rays: A multi-site and multi-modality study. International Journal of Medical Informatics. 2023 Oct 1;178:105211. Gaggion N, Mansilla L, Mosquera C, Milone DH, Ferrante E. Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis. IEEE Trans Med Imaging. 2022. doi:10.1109/TMI.2022.3224660. Wang X, et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. Irvin J, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI conference on artificial intelligence. 2019;33(01). Johnson AE, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint. 2019. arXiv:1901.07042. Bustos A, et al. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Med Image Anal. 2020;66:101797. Nguyen HQ, et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci Data. 2022;9(1):429. Valindria VV, et al. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans Med Imaging. 2017;36:1597–1606. Gaggion N, Vakalopoulou M, Milone DH, Ferrante E. Multi-center anatomical segmentation with heterogeneous labels via landmark-based models. In: 20th IEEE International Symposium on Biomedical Imaging (ISBI). IEEE; 2023. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A, editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham: Springer; 2015. p. 234-241. (Lecture Notes in Computer Science; vol 9351). Gaggion N. Chest-xray-landmark-dataset [Internet]. GitHub repository. Available from: https://github.com/ngaggion/Chest-xray-landmark-dataset. [Accessed 6/27/2023]
V
PubMed Central Open Access Subset (PMC OA)
data.virginia.gov
healthdata.gov
+4more
html
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). PubMed Central Open Access Subset (PMC OA) [Dataset]. https://data.virginia.gov/dataset/pubmed-central-open-access-subset-pmc-oa
Explore at:
htmlAvailable download formats
Dataset updated
Jun 18, 2025
Dataset provided by
National Library of Medicine
Description
Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.
Random Sample of NIH Chest X-ray Dataset
kaggle.com
zip
Updated Nov 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health Chest X-Ray Dataset (2017). Random Sample of NIH Chest X-ray Dataset [Dataset]. https://www.kaggle.com/nih-chest-xrays/sample
Explore at:
zip(4506359620 bytes)Available download formats
Dataset updated
Nov 23, 2017
Dataset authored and provided by
National Institutes of Health Chest X-Ray Dataset
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
NIH Chest X-ray Dataset Sample

National Institutes of Health Chest X-Ray Dataset

Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

Link to paper

File contents - This is a random sample (5%) of the full dataset:

sample.zip: Contains 5,606 images with size 1024 x 1024

sample_labels.csv: Class labels and patient data for the entire dataset

Image Index: File name

Finding Labels: Disease type (Class label)

Follow-up #

Patient ID

Patient Age

Patient Gender

View Position: X-ray orientation

OriginalImageWidth

OriginalImageHeight

OriginalImagePixelSpacing_x

OriginalImagePixelSpacing_y

Class descriptions

There are 15 classes (14 diseases, and one for "No findings") in the full dataset, but since this is drastically reduced version of the full dataset, some of the classes are sparse with the labeled as "No findings"

Hernia - 13 images

Pneumonia - 62 images

Fibrosis - 84 images

Edema - 118 images

Emphysema - 127 images

Cardiomegaly - 141 images

Pleural_Thickening - 176 images

Consolidation - 226 images

Pneumothorax - 271 images

Mass - 284 images

Nodule - 313 images

Atelectasis - 508 images

Effusion - 644 images

Infiltration - 967 images

No Finding - 3044 images

Full Dataset Content

The full dataset can be found here. There are 12 zip files in total and range from ~2 gb to 4 gb in size.

Data limitations:

The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.

Very limited numbers of disease region bounding boxes (See BBox_list_2017.csv)

Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation

Modifications to original data

Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform

CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory

Citations

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf

NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community

Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345

Acknowledgements

This work was supported by the Intramural Research Program of the NClinical Center (clinicalcenter.nih.gov) and National Library of Medicine (www.nlm.nih.gov).
V
NIH Common Data Elements Repository
data.virginia.gov
datahub.hhs.gov
+4more
html
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). NIH Common Data Elements Repository [Dataset]. https://data.virginia.gov/dataset/nih-common-data-elements-repository
Explore at:
htmlAvailable download formats
Dataset updated
Jun 18, 2025
Dataset provided by
National Library of Medicine
Description
The NIH Common Data Elements (CDE) Repository has been designed to provide access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers and other organizations for use in research and for other purposes. Visit the NIH CDE Resource Portal for contextual information about the repository.
nih-chest-xrays/data
kaggle.com
zip
Updated Jul 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elimane NDOYE (2025). nih-chest-xrays/data [Dataset]. https://www.kaggle.com/datasets/ellimann/nih-chest-xraysdata
Explore at:
zip(5568 bytes)Available download formats
Dataset updated
Jul 19, 2025
Authors
Elimane NDOYE
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Elimane NDOYE

Released under CC0: Public Domain

Contents
NIH Chest X ray 14 (224x224 resized)
kaggle.com
zip
Updated Jul 8, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khan Fashee Monowar (Sawrup) (2020). NIH Chest X ray 14 (224x224 resized) [Dataset]. https://www.kaggle.com/khanfashee/nih-chest-x-ray-14-224x224-resized
Explore at:
zip(2468882507 bytes)Available download formats
Dataset updated
Jul 8, 2020
Authors
Khan Fashee Monowar (Sawrup)
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
National Institutes of Health Chest X-Ray Dataset

Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

Data limitations:

The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%. Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv) Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation

File contents

Image format: 112,120 total images with size 1024 x 1024 images_001.zip: Contains 4999 images images_002.zip: Contains 10,000 images images_003.zip: Contains 10,000 images images_004.zip: Contains 10,000 images images_005.zip: Contains 10,000 images images_006.zip: Contains 10,000 images images_007.zip: Contains 10,000 images images_008.zip: Contains 10,000 images images_009.zip: Contains 10,000 images images_010.zip: Contains 10,000 images images_011.zip: Contains 10,000 images images_012.zip: Contains 7,121 images README_ChestXray.pdf: Original README file BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels Image Index: File name Finding Label: Disease type (Class label) Bbox x Bbox y Bbox w Bbox h Dataentry2017.csv: Class labels and patient data for the entire dataset Image Index: File name Finding Labels: Disease type (Class label) Follow-up # Patient ID Patient Age Patient Gender View Position: X-ray orientation OriginalImageWidth OriginalImageHeight OriginalImagePixelSpacing_x OriginalImagePixelSpacing_y

Class descriptions

There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:

Atelectasis Consolidation Infiltration Pneumothorax Edema Emphysema Fibrosis Effusion Pneumonia Pleural_thickening Cardiomegaly Nodule Mass Hernia

Full Dataset Content

There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.

Sample: sample.zip

Modifications to original data

Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory

Citations

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8Hospital-ScaleChestCVPR2017_paper.pdf NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
V
Influenza Research Database (IRD)
odgavaprod.ogopendata.com
healthdata.gov
+5more
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (NIH) (2023). Influenza Research Database (IRD) [Dataset]. https://odgavaprod.ogopendata.com/dataset/influenza-research-database-ird
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
National Institutes of Health (NIH)
Description
The Influenza Research Database (IRD) serves as a public repository and analysis platform for flu sequence, experiment, surveillance and related data.
T
Taxonomy
datahub.hhs.gov
data.virginia.gov
+4more
csv, xlsx, xml
Updated Sep 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datadiscovery.nlm.nih.gov (2021). Taxonomy [Dataset]. https://datahub.hhs.gov/NIH/Taxonomy/ega4-6afi
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Sep 1, 2021
Dataset provided by
datadiscovery.nlm.nih.gov
Description
The Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. This currently represents about 10% of the described species of life on the planet.
V
Data from: dbVar
data.virginia.gov
healthdata.gov
+3more
html
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (NIH) (2023). dbVar [Dataset]. https://data.virginia.gov/dataset/dbvar
Explore at:
htmlAvailable download formats
Dataset updated
Jul 25, 2023
Dataset provided by
National Institutes of Health (NIH)
Description
dbVar is a database of genomic structural variation. It accepts data from all species and includes clinical data. It can accept diverse types of events, including inversions, insertions and translocations. Additionally, both germline and somatic variants are accepted.
Annotated and classified variants from patients with MDS/AML detected in...
zenodo.org
data.niaid.nih.gov
Updated Jan 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henrik Banck; Henrik Banck (2021). Annotated and classified variants from patients with MDS/AML detected in seven public datasets [Dataset]. http://doi.org/10.5281/zenodo.4477289
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4477289
Dataset updated
Jan 29, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Henrik Banck; Henrik Banck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
990 unique validated variants from patients with MDS/AML detected in seven public datasets (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA388411). Databases and web services were accessed for variant annotation and classification on September 7, 2020.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Library of Medicine (NLM) (2021). NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS [Dataset]. https://registry.opendata.aws/ncbi-pmc/

NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS

Explore at:

Dataset updated

Jul 4, 2021

Dataset provided by

<a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>

Description

PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:

The PMC Open Access (OA) Subset, which includes all articles in PMC with a machine-readable Creative Commons license

The Author Manuscript Dataset, which includes all articles collected under a funder policy in PMC and made available in machine-readable formats for text mining

These datasets collectively span more than half of PMC’s total collection of full-text articles. PMC enables access to these datasets to expand the impact of open access and publicly-funded research; enable greater machine learning across the spectrum of scientific research; reach new audiences; and open new doors for discovery. The bucket in this registry contains individual articles in NISO Z39.96-2015 JATS XML format as well as in plain text as extracted from the XML. The bucket is updated daily with new and updated articles. Also included are file lists that include metadata for articles in each dataset.

Clear search

Close search

Google apps

Main menu

NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and...

Blog | Leveraging Open Data Science to Accelerate Innovation at NIH and...

NCBI Datasets

Data from: Public sharing of research datasets: a pilot study of...

Study of Womens Health Across the Nation (SWAN) Public Use Data

Open-i

NIH Data Sharing Repositories

Data from: Visible Human Project

RxNorm Data

Context

Content

Acknowledgements

Inspiration

NIH Data and Specimen Hub (DASH)

NIH CheXmask Database: a dataset of anatomical seg

PubMed Central Open Access Subset (PMC OA)

Random Sample of NIH Chest X-ray Dataset

NIH Chest X-ray Dataset Sample

National Institutes of Health Chest X-Ray Dataset

File contents - This is a random sample (5%) of the full dataset:

Class descriptions

Full Dataset Content

Data limitations:

Modifications to original data

Citations

Acknowledgements

NIH Common Data Elements Repository

nih-chest-xrays/data

Dataset

Contents

NIH Chest X ray 14 (224x224 resized)

National Institutes of Health Chest X-Ray Dataset

Data limitations:

File contents

Class descriptions

Full Dataset Content

Modifications to original data

Citations

Influenza Research Database (IRD)

Taxonomy

Data from: dbVar

Annotated and classified variants from patients with MDS/AML detected in...

NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS