100+ datasets found
  1. o

    NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and...

    • registry.opendata.aws
    Updated Jul 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (NLM) (2021). NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS [Dataset]. https://registry.opendata.aws/ncbi-pmc/
    Explore at:
    Dataset updated
    Jul 4, 2021
    Dataset provided by
    <a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
    Description

    PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:

    The PMC Open Access (OA) Subset, which includes all articles in PMC with a machine-readable Creative Commons license

    The Author Manuscript Dataset, which includes all articles collected under a funder policy in PMC and made available in machine-readable formats for text mining

    These datasets collectively span more than half of PMC’s total collection of full-text articles. PMC enables access to these datasets to expand the impact of open access and publicly-funded research; enable greater machine learning across the spectrum of scientific research; reach new audiences; and open new doors for discovery. The bucket in this registry contains individual articles in NISO Z39.96-2015 JATS XML format as well as in plain text as extracted from the XML. The bucket is updated daily with new and updated articles. Also included are file lists that include metadata for articles in each dataset.

  2. d

    Blog | Leveraging Open Data Science to Accelerate Innovation at NIH and...

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Kittrie (2025). Blog | Leveraging Open Data Science to Accelerate Innovation at NIH and Beyond [Dataset]. https://catalog.data.gov/dataset/blog-leveraging-open-data-science-to-accelerate-innovation-at-nih-and-beyond
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    Elizabeth Kittrie
    Description

    This blog was posted by Elizabeth Kittrie on November 30, 2016. It was written by Elizabeth Kittrie, Senior Advisor for Open Innovation & Policy and Joe Bonner, Health Scientist-AAAS Fellow.

  3. N

    NCBI Datasets

    • datadiscovery.nlm.nih.gov
    • healthdata.gov
    • +2more
    csv, xlsx, xml
    Updated Feb 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). NCBI Datasets [Dataset]. https://datadiscovery.nlm.nih.gov/Molecular-biology-Genetics/NCBI-Datasets-BETA-/3br9-y2tm
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Feb 9, 2022
    Description

    NCBI Datasets is one-stop shop for finding, browsing, and downloading genomic data. Find and download taxonomy, genome, gene, transcript, protein data, including installation of NCBI Datasets command-line tools.

  4. d

    Data from: Public sharing of research datasets: a pilot study of...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated May 26, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather A. Piwowar; Wendy W. Chapman (2011). Public sharing of research datasets: a pilot study of associations [Dataset]. http://doi.org/10.5061/dryad.3td2f
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 26, 2011
    Dataset provided by
    Dryad
    Authors
    Heather A. Piwowar; Wendy W. Chapman
    Time period covered
    May 26, 2011
    Description

    Microarray study attributes and data sharing status397 rows, one row for each study that created gene expression microarray data as identified by Ochsner et al. (doi:10.1038/nmeth1208-991). Attributes of each study are included in 23 columns. Dependent variable is called is_data_shared.Piwowar_Metrics2009_rawdata.csvStatistical analysis R scriptStatistical R script for analysis and graphics as presented in the paper.Piwowar_Metrics2009_statistics.R

  5. d

    Study of Womens Health Across the Nation (SWAN) Public Use Data

    • catalog.data.gov
    • healthdata.gov
    • +2more
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (NIH) (2023). Study of Womens Health Across the Nation (SWAN) Public Use Data [Dataset]. https://catalog.data.gov/dataset/study-of-womens-health-across-the-nation-swan-public-use-data
    Explore at:
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    National Institutes of Health (NIH)
    Description

    The SWAN Public Use Datasets provide access to longitudinal data describing the physical, biological, psychological, and social changes that occur during the menopausal transition. Data collected from 3,302 SWAN participants from Baseline through the 10th Annual Follow-Up visit are currently available to the public. Registered users are able to download datasets in a variety of formats, search variables and view recent publications.

  6. d

    Open-i

    • catalog.data.gov
    • datadiscovery.nlm.nih.gov
    • +3more
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Open-i [Dataset]. https://catalog.data.gov/dataset/open-i
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    Open-i service provides search and retrieval of abstracts and images (including charts, graphs, clinical images, etc.) from the open source literature, and biomedical image collections. Searching may be done by text queries as well as by query images.

  7. NIH Data Sharing Repositories

    • catalog.data.gov
    • healthdata.gov
    • +1more
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (NIH), Department of Health & Human Services (2025). NIH Data Sharing Repositories [Dataset]. https://catalog.data.gov/dataset/nih-data-sharing-repositories
    Explore at:
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    United States Department of Health and Human Serviceshttp://www.hhs.gov/
    Description

    A list of NIH-supported repositories that accept submissions of appropriate scientific research data from biomedical researchers. It includes resources that aggregate information about biomedical data and information sharing systems. Links are provided to information about submitting data to and accessing data from the listed repositories. Additional information about the repositories and points-of contact for further information or inquiries can be found on the websites of the individual repositories.

  8. Data from: Visible Human Project

    • healthdata.gov
    • datadiscovery.nlm.nih.gov
    • +3more
    csv, xlsx, xml
    Updated Mar 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datadiscovery.nlm.nih.gov (2023). Visible Human Project [Dataset]. https://healthdata.gov/NIH/Visible-Human-Project/krti-uwg9
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    datadiscovery.nlm.nih.gov
    Description

    The NLM Visible Human Project® has created publicly-available complete, anatomically detailed, three-dimensional representations of a human male body and a human female body. Specifically, the VHP provides a public-domain library of cross-sectional cryosection, CT, and MRI images obtained from one male cadaver and one female cadaver. The Visible Man data set was publicly released in 1994 and the Visible Woman in 1995.

    https://www.nlm.nih.gov/research/visible/visible_human.html

  9. RxNorm Data

    • kaggle.com
    • bioregistry.io
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2019). RxNorm Data [Dataset]. https://www.kaggle.com/datasets/nlm-nih/nlm-rxnorm
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    National Library of Medicine
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    RxNorm is a name of a US-specific terminology in medicine that contains all medications available on US market. Source: https://en.wikipedia.org/wiki/RxNorm

    RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software, including those of First Databank, Micromedex, Gold Standard Drug Database, and Multum. By providing links between these vocabularies, RxNorm can mediate messages between systems not using the same software and vocabulary. Source: https://www.nlm.nih.gov/research/umls/rxnorm/

    Content

    RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs, defined as the combination of {ingredient + strength + dose form}. In addition to the naming system, the RxNorm dataset also provides structured information such as brand names, ingredients, drug classes, and so on, for each clinical drug. Typical uses of RxNorm include navigating between names and codes among different drug vocabularies and using information in RxNorm to assist with health information exchange/medication reconciliation, e-prescribing, drug analytics, formulary development, and other functions.

    This public dataset includes multiple data files originally released in RxNorm Rich Release Format (RXNRRF) that are loaded into Bigquery tables. The data is updated and archived on a monthly basis.

    The following tables are included in the RxNorm dataset:

    • RXNCONSO contains concept and source information

    • RXNREL contains information regarding relationships between entities

    • RXNSAT contains attribute information

    • RXNSTY contains semantic information

    • RXNSAB contains source info

    • RXNCUI contains retired rxcui codes

    • RXNATOMARCHIVE contains archived data

    • RXNCUICHANGES contains concept changes

    Update Frequency: Monthly

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://www.nlm.nih.gov/research/umls/rxnorm/

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:nlm_rxnorm

    https://cloud.google.com/bigquery/public-data/rxnorm

    Dataset Source: Unified Medical Language System RxNorm. The dataset is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. This dataset uses publicly available data from the U.S. National Library of Medicine (NLM), National Institutes of Health, Department of Health and Human Services; NLM is not responsible for the dataset, does not endorse or recommend this or any other dataset.

    Banner Photo by @freestocks from Unsplash.

    Inspiration

    What are the RXCUI codes for the ingredients of a list of drugs?

    Which ingredients have the most variety of dose forms?

    In what dose forms is the drug phenylephrine found?

    What are the ingredients of the drug labeled with the generic code number 072718?

  10. NIH Data and Specimen Hub (DASH)

    • catalog.data.gov
    • datasets.ai
    Updated Mar 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). NIH Data and Specimen Hub (DASH) [Dataset]. https://catalog.data.gov/dataset/nih-data-and-specimen-hub-dash
    Explore at:
    Dataset updated
    Mar 23, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    "The NICHD Data and Specimen Hub (DASH) is a centralized resource that allows researchers to share and access de-identified data from studies funded by NICHD. DASH also serves as a portal for requesting biospecimens from selected DASH studies.". This dataset is associated with the following publication: Deluca, N., K. Thomas, A. Mullikin, R. Slover, L. Stanek, D. Pilant, and E. Hubal. Geographic and demographic variability in serum PFAS concentrations for pregnant women in the United States. Journal of Exposure Science and Environmental Epidemiology. Nature Publishing Group, London, UK, 33(1): 710-724, (2023).

  11. NIH CheXmask Database: a dataset of anatomical seg

    • kaggle.com
    zip
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyadarshi Mukhopadhyay (2025). NIH CheXmask Database: a dataset of anatomical seg [Dataset]. https://www.kaggle.com/datasets/poeticmage/chexmask-database-a-dataset-of-anatomical-segment
    Explore at:
    zip(944480270 bytes)Available download formats
    Dataset updated
    Jul 22, 2025
    Authors
    Priyadarshi Mukhopadhyay
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CheXmask Database: a large-scale dataset of anatomical segmentation masks for chest x-ray images

    This particular data contains only the segmentation for NIH Data set that is: Chest X Ray 8

    Nicolas Gaggion , Candelaria Mosquera , Martina Aineseder , Lucas Mansilla , Diego Milone , Enzo Ferrante

    Published: March 1, 2024. Version: 0.4

    This data set was downloaded from "https://physionet.org/content/chexmask-cxr-segmentation-data/0.4/OriginalResolution/#files-panel">Physionet. PhysioNet is a repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology. Supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362. For more accessibility options, see the MIT Accessibility Page.

    Abstract The CheXmask Database presents a comprehensive, uniformly annotated collection of chest radiographs, constructed from five public databases: ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest and VinDr-CXR. The database aggregates 657,566 anatomical segmentation masks derived from images which have been processed using the HybridGNet model to ensure consistent, high-quality segmentation. To confirm the quality of the segmentations, we include in this database individual Reverse Classification Accuracy (RCA) scores for each of the segmentation masks. This dataset is intended to catalyze further innovation and refinement in the field of semantic chest X-ray analysis, offering a significant resource for researchers in the medical imaging domain.

    Ethics All publicly available datasets utilized in this study adhered to strict ethical standards and underwent thorough anonymization, with identifiable details removed. The study does not release any part of the original image datasets; it only provides already anonymized image identifiers to allow researchers to match the original images with our annotations. MIMIC-CXR-JPG dataset required additional ethics training and research courses for access. The study authors fulfilled all ethics courses and data use agreement requirements to ensure ethical data usage.

    Conflicts of Interest The authors have no conflict of interests to declare.

    References Wang R, Chen LC, Moukheiber L, Seastedt KP, Moukheiber M, Moukheiber D, Zaiman Z, Moukheiber S, Litchman T, Trivedi H, Steinberg R. Enabling chronic obstructive pulmonary disease diagnosis through chest X-rays: A multi-site and multi-modality study. International Journal of Medical Informatics. 2023 Oct 1;178:105211. Gaggion N, Mansilla L, Mosquera C, Milone DH, Ferrante E. Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis. IEEE Trans Med Imaging. 2022. doi:10.1109/TMI.2022.3224660. Wang X, et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. Irvin J, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI conference on artificial intelligence. 2019;33(01). Johnson AE, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint. 2019. arXiv:1901.07042. Bustos A, et al. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Med Image Anal. 2020;66:101797. Nguyen HQ, et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci Data. 2022;9(1):429. Valindria VV, et al. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE Trans Med Imaging. 2017;36:1597–1606. Gaggion N, Vakalopoulou M, Milone DH, Ferrante E. Multi-center anatomical segmentation with heterogeneous labels via landmark-based models. In: 20th IEEE International Symposium on Biomedical Imaging (ISBI). IEEE; 2023. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A, editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham: Springer; 2015. p. 234-241. (Lecture Notes in Computer Science; vol 9351). Gaggion N. Chest-xray-landmark-dataset [Internet]. GitHub repository. Available from: https://github.com/ngaggion/Chest-xray-landmark-dataset. [Accessed 6/27/2023]

  12. V

    PubMed Central Open Access Subset (PMC OA)

    • data.virginia.gov
    • healthdata.gov
    • +4more
    html
    Updated Jun 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). PubMed Central Open Access Subset (PMC OA) [Dataset]. https://data.virginia.gov/dataset/pubmed-central-open-access-subset-pmc-oa
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    National Library of Medicine
    Description

    Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.

  13. Random Sample of NIH Chest X-ray Dataset

    • kaggle.com
    zip
    Updated Nov 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health Chest X-Ray Dataset (2017). Random Sample of NIH Chest X-ray Dataset [Dataset]. https://www.kaggle.com/nih-chest-xrays/sample
    Explore at:
    zip(4506359620 bytes)Available download formats
    Dataset updated
    Nov 23, 2017
    Dataset authored and provided by
    National Institutes of Health Chest X-Ray Dataset
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    NIH Chest X-ray Dataset Sample

    National Institutes of Health Chest X-Ray Dataset

    Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

    This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

    Link to paper


    File contents - This is a random sample (5%) of the full dataset:

    • sample.zip: Contains 5,606 images with size 1024 x 1024

    • sample_labels.csv: Class labels and patient data for the entire dataset

      • Image Index: File name
      • Finding Labels: Disease type (Class label)
      • Follow-up #
      • Patient ID
      • Patient Age
      • Patient Gender
      • View Position: X-ray orientation
      • OriginalImageWidth
      • OriginalImageHeight
      • OriginalImagePixelSpacing_x
      • OriginalImagePixelSpacing_y


    Class descriptions

    There are 15 classes (14 diseases, and one for "No findings") in the full dataset, but since this is drastically reduced version of the full dataset, some of the classes are sparse with the labeled as "No findings"

    • Hernia - 13 images
    • Pneumonia - 62 images
    • Fibrosis - 84 images
    • Edema - 118 images
    • Emphysema - 127 images
    • Cardiomegaly - 141 images
    • Pleural_Thickening - 176 images
    • Consolidation - 226 images
    • Pneumothorax - 271 images
    • Mass - 284 images
    • Nodule - 313 images
    • Atelectasis - 508 images
    • Effusion - 644 images
    • Infiltration - 967 images
    • No Finding - 3044 images


    Full Dataset Content

    The full dataset can be found here. There are 12 zip files in total and range from ~2 gb to 4 gb in size.


    Data limitations:

    1. The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
    2. Very limited numbers of disease region bounding boxes (See BBox_list_2017.csv)
    3. Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation


    Modifications to original data

    • Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform

    • CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory


    Citations


    Acknowledgements

    This work was supported by the Intramural Research Program of the NClinical Center (clinicalcenter.nih.gov) and National Library of Medicine (www.nlm.nih.gov).

  14. V

    NIH Common Data Elements Repository

    • data.virginia.gov
    • datahub.hhs.gov
    • +4more
    html
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). NIH Common Data Elements Repository [Dataset]. https://data.virginia.gov/dataset/nih-common-data-elements-repository
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    National Library of Medicine
    Description

    The NIH Common Data Elements (CDE) Repository has been designed to provide access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers and other organizations for use in research and for other purposes. Visit the NIH CDE Resource Portal for contextual information about the repository.

  15. nih-chest-xrays/data

    • kaggle.com
    zip
    Updated Jul 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elimane NDOYE (2025). nih-chest-xrays/data [Dataset]. https://www.kaggle.com/datasets/ellimann/nih-chest-xraysdata
    Explore at:
    zip(5568 bytes)Available download formats
    Dataset updated
    Jul 19, 2025
    Authors
    Elimane NDOYE
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Elimane NDOYE

    Released under CC0: Public Domain

    Contents

  16. NIH Chest X ray 14 (224x224 resized)

    • kaggle.com
    zip
    Updated Jul 8, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khan Fashee Monowar (Sawrup) (2020). NIH Chest X ray 14 (224x224 resized) [Dataset]. https://www.kaggle.com/khanfashee/nih-chest-x-ray-14-224x224-resized
    Explore at:
    zip(2468882507 bytes)Available download formats
    Dataset updated
    Jul 8, 2020
    Authors
    Khan Fashee Monowar (Sawrup)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    National Institutes of Health Chest X-Ray Dataset

    Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

    This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

    Data limitations:

    The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
    Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv)
    Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation
    

    File contents

    Image format: 112,120 total images with size 1024 x 1024
    
    images_001.zip: Contains 4999 images
    
    images_002.zip: Contains 10,000 images
    
    images_003.zip: Contains 10,000 images
    
    images_004.zip: Contains 10,000 images
    
    images_005.zip: Contains 10,000 images
    
    images_006.zip: Contains 10,000 images
    
    images_007.zip: Contains 10,000 images
    
    images_008.zip: Contains 10,000 images
    
    images_009.zip: Contains 10,000 images
    
    images_010.zip: Contains 10,000 images
    
    images_011.zip: Contains 10,000 images
    
    images_012.zip: Contains 7,121 images
    
    README_ChestXray.pdf: Original README file
    
    BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels
      Image Index: File name
      Finding Label: Disease type (Class label)
      Bbox x
      Bbox y
      Bbox w
      Bbox h
    
    Dataentry2017.csv: Class labels and patient data for the entire dataset
      Image Index: File name
      Finding Labels: Disease type (Class label)
      Follow-up #
      Patient ID
      Patient Age
      Patient Gender
      View Position: X-ray orientation
      OriginalImageWidth
      OriginalImageHeight
      OriginalImagePixelSpacing_x
      OriginalImagePixelSpacing_y
    

    Class descriptions

    There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:

    Atelectasis
    Consolidation
    Infiltration
    Pneumothorax
    Edema
    Emphysema
    Fibrosis
    Effusion
    Pneumonia
    Pleural_thickening
    Cardiomegaly
    Nodule Mass
    Hernia
    

    Full Dataset Content

    There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.

    Sample: sample.zip
    

    Modifications to original data

    Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform
    
    CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory
    

    Citations

    Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8Hospital-ScaleChestCVPR2017_paper.pdf
    
    NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community
    
    Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
    
  17. V

    Influenza Research Database (IRD)

    • odgavaprod.ogopendata.com
    • healthdata.gov
    • +5more
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (NIH) (2023). Influenza Research Database (IRD) [Dataset]. https://odgavaprod.ogopendata.com/dataset/influenza-research-database-ird
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    National Institutes of Health (NIH)
    Description

    The Influenza Research Database (IRD) serves as a public repository and analysis platform for flu sequence, experiment, surveillance and related data.

  18. T

    Taxonomy

    • datahub.hhs.gov
    • data.virginia.gov
    • +4more
    csv, xlsx, xml
    Updated Sep 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datadiscovery.nlm.nih.gov (2021). Taxonomy [Dataset]. https://datahub.hhs.gov/NIH/Taxonomy/ega4-6afi
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Sep 1, 2021
    Dataset provided by
    datadiscovery.nlm.nih.gov
    Description

    The Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. This currently represents about 10% of the described species of life on the planet.

  19. V

    Data from: dbVar

    • data.virginia.gov
    • healthdata.gov
    • +3more
    html
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (NIH) (2023). dbVar [Dataset]. https://data.virginia.gov/dataset/dbvar
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    National Institutes of Health (NIH)
    Description

    dbVar is a database of genomic structural variation. It accepts data from all species and includes clinical data. It can accept diverse types of events, including inversions, insertions and translocations. Additionally, both germline and somatic variants are accepted.

  20. Annotated and classified variants from patients with MDS/AML detected in...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jan 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henrik Banck; Henrik Banck (2021). Annotated and classified variants from patients with MDS/AML detected in seven public datasets [Dataset]. http://doi.org/10.5281/zenodo.4477289
    Explore at:
    Dataset updated
    Jan 29, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Henrik Banck; Henrik Banck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    990 unique validated variants from patients with MDS/AML detected in seven public datasets (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA388411). Databases and web services were accessed for variant annotation and classification on September 7, 2020.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Library of Medicine (NLM) (2021). NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS [Dataset]. https://registry.opendata.aws/ncbi-pmc/

NIH NCBI PubMed Central (PMC) Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS

Explore at:
Dataset updated
Jul 4, 2021
Dataset provided by
<a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
Description

PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:

The PMC Open Access (OA) Subset, which includes all articles in PMC with a machine-readable Creative Commons license

The Author Manuscript Dataset, which includes all articles collected under a funder policy in PMC and made available in machine-readable formats for text mining

These datasets collectively span more than half of PMC’s total collection of full-text articles. PMC enables access to these datasets to expand the impact of open access and publicly-funded research; enable greater machine learning across the spectrum of scientific research; reach new audiences; and open new doors for discovery. The bucket in this registry contains individual articles in NISO Z39.96-2015 JATS XML format as well as in plain text as extracted from the XML. The bucket is updated daily with new and updated articles. Also included are file lists that include metadata for articles in each dataset.

Search
Clear search
Close search
Google apps
Main menu