Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a combined curated dataset of COVID-19 Chest X-ray images obtained by collating 15 publically available datasets as listed under the references section. The present dataset contains 1281 COVID-19 X-Rays, 3270 Normal X-Rays, 1656 viral-pneumonia X-Rays, and 3001 bacterial-pneumonia X-Rays. This dataset is developed as a part of the following research publication.
"A deep-learning based multimodal system for Covid-19 diagnosis using breathing sounds and chest X-ray images" https://doi.org/10.1016/j.asoc.2021.107522
The collected datasetsāas cited by this datasetāare combined to form an integrated repository. This integrated repository contains a total of 4558 COVID-19 X-Rays, 5403 Normal X-Rays, 4497 Viral pneumonia X-Rays, and 5768 bacterial pneumonia X-Rays. Out of which 1379 COVID-19 X-Rays, 1476 normal X-Rays, 2690 viral pneumonia X-Rays, and 2588 bacterial pneumonia X-Rays are found to be duplicatesābased on the image similaritiesāand thus are removed. Inception V3 architecture is used to obtain the image embeddings, which is followed by the use of unsupervised learning algorithms based on cosine similarity distances. These distances are clustered and then visualized to find different categories of image defects which are listed below:ā
1.Noise 2.Pixelated 3.Compressed 4.Medical Implants 5.Washed out image 6.Side View 7.CT (sliced) image 8.Aspect Ratio distortion / Cropped / Zoomed 9.Rotated Images 10.Images with annotations
These clusters of defective images are removed during the curation process and a refined dataset is obtained which is available for download.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of publicly available images from COVID-19 positive patients collected from several sources over the net. All images are chest x-rays from frontal view (AP or PA). There is a ZIP file containing 900 images and a metadata in CSV format which includes information about 452 images.Note that some of the images are from pediatrics and/or from early-stage patients with no specific image findings noted by the radiologist; but all of them are from COVID-positive cases. Related guideline and details are available in the GitHub repo.
COVID-19 (coronavirus disease 2019) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a strain of coronavirus. The first cases were seen in Wuhan, China, in late December 2019 before spreading globally. The current outbreak was officially recognized as a pandemic by the World Health Organization (WHO) on 11 March 2020. Currently Reverse transcription polymerase chain reaction (RT-PCR) is used for diagnosis of the COVID-19. X-ray machines are widely available and provide images for diagnosis quickly so chest X-ray images can be very useful in early diagnosis of COVID-19.
Dataset is organized into 2 folders (train, test) and both train and test contain 3 subfolders (COVID19, PNEUMONIA, NORMAL). DataSet contains total 6432 x-ray images and test data have 20% of total images.
Images are collected from various publicly available resources. If you use the data for research please give credit to authors: Sources: 1. https://github.com/ieee8023/covid-chestxray-dataset 2. https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia 3. https://github.com/agchung
Application of Artificial Intelligence (AI) techniques with radiological images for COVID-19 diagnosis.
Deaths counts for influenza, pneumonia, and COVID-19 reported to NCHS by week ending date, by state and HHS region, and age group.
A team of researchers from Qatar University, Doha, Qatar, and the University of Dhaka, Bangladesh along with their collaborators from Pakistan and Malaysia in collaboration with medical doctors have created a database of chest X-ray images for COVID-19 positive cases along with Normal and Viral Pneumonia images. This COVID-19, normal, and other lung infection dataset is released in stages. In the first release, we have released 219 COVID-19, 1341 normal, and 1345 viral pneumonia chest X-ray (CXR) images. In the first update, we have increased the COVID-19 class to 1200 CXR images. In the 2nd update, we have increased the database to 3616 COVID-19 positive cases along with 10,192 Normal, 6012 Lung Opacity (Non-COVID lung infection), and 1345 Viral Pneumonia images and corresponding lung masks. We will continue to update this database as soon as we have new x-ray images for COVID-19 pneumonia patients.
Description:
š Download the dataset here
This dataset provides a comprehensive collection of chest X-ray images representing three types of pneumonia: COVID-19 pneumonia, viral pneumonia, and bacterial pneumonia. The dataset is curate from 15 publicly available sources and has been meticulously process to ensure high-quality, relevant data for research and development in medical imaging, AI, and machine learning applications.
The dataset comprises the following categories of X-ray images:
COVID-19 Pneumonia: 1281 X-rays
Normal (No Pneumonia): 3270 X-rays
Viral Pneumonia: 1656 X-rays
Bacterial Pneumonia: 3001 X-rays
Download Dataset
Dataset Curation Process
The initial dataset, comprising over 19,000 images, was refine using image similarity algorithms to remove duplicates, noisy images, and other defects. The Inception V3 model was employe to extract image embeddings, which were further analyze using unsupervise learning techniques to filter out images that exhibite poor quality or anomalies. Images exhibiting defects such as noise, pixelation, compression artifacts, and medical implants were systematically remove to ensure the datasetās integrity.
Features of the Dataset
Diverse Representation: The dataset provides X-rays for three distinct types of pneumonia, offering an ideal foundation for training AI models in medical diagnostics.
Cleaned and Curated: All duplicate and faulty images have been removed, with the final dataset being subjected to quality control processes such as image clustering and manual review.
Visualization and Disease Highlighting: Tools such as Inception V3 have been utilize to visually highlight abnormalities and disease characteristics, making the dataset highly suitable for
visualization-base medical research.
Common Image Defects Addressed
Throughout the dataset cleaning process, several types of image defects were identify and addressed. These include:
Noise and Pixelation: Images with significant noise and pixelation were remove to enhance clarity.
Compression Artifacts: X-rays affected by excessive compression were exclude.
Medical Implants: X-rays with visible implants that might interfere with pneumonia diagnosis were filtered out.
Washed-out Images: Images with poor contrast or exposure were eliminated.
Side View and CT Images: Non-standard views and non-X-ray images, such as CT slices, were remove.
Aspect Ratio Distortion: Cropped or zoom images that distorted the aspect ratio were correct or exclude.
Annotated Images: X-rays with visible annotations or markings were remove.
This dataset is sourced from Kaggle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Chest X-ray images were selected from a database of chest X-ray images for COVID-19 positive cases along with normal and viral pneumonia images which were collated by researchers from Qatar University and the University of Dhaka along with collaborators from Pakistan and Malaysia and some medical doctors. In their current release, there are 219 COVID-19 positive images, 1341 normal images and 1345 viral pneumonia images (Chowdhury et al., 2020). To ensure multiple representations, the dataset of chest X-ray images for both COVID-19 and normal cases were also selected from Mendeley dataset repository (El-Shafai, 2020) which contains 5500 Non-COVID X-ray images and 4044 COVID-19 X-ray images. This study, therefore, adopted these multi source datasets. Due to limited computing resources, in this study, 1,300 images were selected from each category. That is, 1,300 images of COVID-19 positive cases, 1,300 Normal images and 1,300 images of viral pneumonia cases, totaling 3,900 images in all. It is noted here that further descriptions of the datasets were not provided by the authors of the sources of the datasets.El-Shafai, Walid; Abd El-Samie, Fathi (2020), āExtensive COVID-19 X-Ray and CT Chest Images Datasetā, Mendeley Data, V3, doi: 10.17632/8h65ywd2jr.3Chowdhury, M. E., Rahman, T., Khandakar, A., Mazhar, R., Kadir, M. A., Mahbub, Z. B., . . . Reaz, M. B. (2020, March 29). Can AI help in screening Viral and COVID-19 pneumonia? Retrieved from https://arxiv.org/abs/2003.13145; https://www.kaggle.com/tawsifurrahman/covid19-radiography-database
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Note: results generated from this dataset should not be considered diagnostic.
Significant credit to researchers making their efforts available in COVID-Net.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a sample dataset of COVID-19 and non-COVID-19 pneumonia. It is from the Khorshid COVID Cohort (KCC) study. It is related to the following published paper on "Frontiers in Medicine":Marateb HR, Ziaei Nezhad F, Mohebian MR, Sami R,Haghjooy Javanmard S, Dehghan Niri F, Akafzadeh-Savari M, Mansourian M, MaƱanas MA, Wolkewitz M and Binder H (2021) Automatic Classification BetweenCOVID-19 and Non-COVID-19 Pneumonia Using Symptoms,Comorbidities, and Laboratory Findings: The Khorshid COVID Cohort Study. Front. Med. 8:768467.doi: 10.3389/fmed.2021.768467https://www.frontiersin.org/articles/10.3389/fmed.2021.768467/fullPlease cite the above reference when the dataset is used.
Authors of the Dataset:
Pratik Bhowal (B.E., Dept of Electronics and Instrumentation Engineering, Jadavpur University Kolkata, India) [LinkedIn], [Github] Subhankar Sen (B.Tech, Dept of Computer Science Engineering, Manipal University Jaipur, India) [LinkedIn], [Github], [Google Scholar] Jin Hee Yoon (faculty of the Dept. of Mathematics and Statistics at Sejong University, Seoul, South Korea) [LinkedIn], [Google Scholar] Zong Woo Geem (faculty of College of IT Convergence at Gachon University, South Korea) [LinkedIn], [Google Scholar] Ram Sarkar( Professor at Dept. of Computer Science Engineering, Jadavpur Univeristy Kolkata, India) [LinkedIn], [Google Scholar]
Overview The authors have created a new dataset known as Novel COVID-19 Chestxray Repository by the fusion of publicly available chest-xray image repositories. In creating this combined dataset, three different datasets obtained from the Github and Kaggle databases,created by the authors of other research studies in this field, were utilized.In our study,frontal and lateral chest X-ray images are used since this view of radiography is widely used by radiologist in clinical diagnosis.In the following section, authors have summarized how this dataset is created.
COVID-19 Radiography Database: The first release of this dataset reports 219 COVID-19,1345 viral pneumonia and 1341 normal radiographic chest X-ray images. This dataset was created by a team of researchers from Qatar University, Doha, Qatar, and the University of Dhaka, Bangladesh in collaboration with medical doctors and specialists from Pakistan and Malaysia.This database is regularly updated with the emergence of new cases of COVID-19 patients worldwide.Related Paper:https://arxiv.org/abs/2003.13145
COVID-Chestxray set:Joseph Paul Cohen and Paul Morrison and Lan Dao have created a public image repository on Github which consists both CT scans and digital chest x-rays.The data was collected mainly from retrospective cohorts of pediatric patients from Guangzhou Women and Childrenās medical center.With the aid of metadata information provided along with the dataset,we were able to extract 521 COVID-19 positive,239 viral and bacterial pneumonias;which are of the following three broad categories:Middle East Respiratory Syndrome (MERS),Severe Acute Respiratory Syndrome (SARS), and Acute Respiratory Distress syndrome (ARDS);and 218 normal radiographic chest X-ray images of varying image resolutions. Related Paper: https://arxiv.org/abs/2006.11988
Actualmed COVID chestxray dataset:Actualmed-COVID-chestxray-dataset comprises of 12 COVID-19 positive and 80 normal radiographic chest x-ray images.
The combined dataset includes chest X-ray images of COVID-19,Pneumonia and Normal (healthy) classes, with a total of 752, 1584, and 1639 images respectively. Information about the Novel COVID-19 Chestxray Database and its parent image repositories is provided in Table 1.
Table 1: Dataset Description | Dataset| COVID-19 |Pneumonia | Normal | | ------------- | ------------- | ------------- | -------------| | COVID Chestxray set | 521 |239|218| | COVID-19 Radiography Database(first release) | 219 |1345|1341| | Actualmed COVID chestxray dataset| 12 |0|80| | Total|752|1584|1639|
DATA ACCESS AND USE: Academic/Non-Commercial Use Dataset License : Database: Open Database, Contents: Database Contents
This dataset was created by MUAWAZSALEEM2
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The proposed dataset has been combined from three popular lung segmentation datasets: Darwin, Montgomery, and Shenzhen. The combined data allow researchers and clinicians to gain access to a good quality dataset, a large proportion of which has been manually annotated. The combined dataset consists of 6,810 images, with corresponding binary masks of lungs with the following distribution of images between the three datasets: ⢠6,106 images from the Darwin dataset; ⢠139 images from the Montgomery dataset; ⢠566 images from the Shenzhen dataset.
The Darwin dataset [1, 2] images include most of the heart, revealing lung opacities behind the heart, which may be relevant for assessing the severity of viral pneumonia. The lower-most part of the lungs, where visible, is defined by the extent of the diaphragm. Where present and not obstructive to the distinguishability of the lungs, the diaphragm is included up until the lower-most visible part of the lungs. A key property of this dataset is that image resolutions, sources, and orientations vary across the dataset, with the smallest image being 156x156 pixels and the largest being 5600x4700 pixels. Furthermore, we included the portable X-ray images which are of significantly lower quality as compared to standard X-rays. A key limitation of the Darwin dataset is that it does not contain lateral X-ray lung segmentations. It is worth noting that lung segmentations were performed by human annotators using Darwin's Auto-Annotate AI and then adjusted and reviewed by expert radiologists.
Both the Montgomery and Shenzhen datasets [3] were published by the United States National Library of Medicine and are made of posteroanterior chest X-ray images. These images are available to foster research in computer-aided diagnosis of pulmonary diseases with a special focus on pulmonary tuberculosis. The datasets were acquired from the Department of Health and Human Services (Maryland, USA) and Shenzhen ā3 People's Hospital (Shenzhen, China). Both datasets contain normal and abnormal chest X-ray images with manifestations of tuberculosis and include associated radiologist readings.
References: 1. Darwinās Auto-Annotate AI. Available: https://www.v7labs.com/automated-annotation 2. COVID-19 X-ray dataset. Available: https://github.com/v7labs/covid-19-xray-dataset 3. Jaeger S, Candemir S, Antani S, WĆ”ng Y-XJ, Lu P-X, Thoma G. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg. 2014;4: 475ā477. doi:10.3978/j.issn.2223-4292.2014.11.20
[1] https://bimcv.cipf.es/bimcv-projects/bimcv-covid19/#1590858128006-9e640421-6711 [2] https://github.com/ml-workgroup/covid-19-image-repository/tree/master/png [3] https://sirm.org/category/senza-categoria/covid-19/ [4] https://eurorad.org [5] https://github.com/ieee8023/covid-chestxray-dataset [6] https://figshare.com/articles/COVID-19_Chest_X-Ray_Image_Repository/12580328 [7] https://github.com/armiro/COVID-CXNet [8] https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data [9] https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70230281
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this study, a primary dataset containing 21,909 X-ray images as shown in table 1[11-13], that is basic dataset consists of three classes (COVID-19, Normal and Pneumonia) each class has different numbers of images. The dataset used is available on the best and famous websites such as (GitHub, Eurorad, Figshare and Kaggle).
The researchers of Qatar University have compiled the COVID-QU-Ex dataset, which consists of 33,920 chest X-ray (CXR) images including: * 11,956 COVID-19 * 11,263 Non-COVID infections (Viral or Bacterial Pneumonia) * 10,701 Normal Ground-truth lung segmentation masks are provided for the entire dataset. This is the largest ever created lung mask dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database consists 3,616 COVID-19 positive cases, 6,012 Lung Opacity (Non-COVID lung infection) and 1,345 Viral Pneumonia images. For the original datasets, it is the 2nd update version, and in which three classes (namely, COVID-19, Non-COVID lung infection and Viral Pneumonia) are used in our paper.
The COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset is generated by the fusion of three publicly available datasets: COVID-19 cxr image, Radiological Society of North America (RSNA), and U.S. national library of medicine (USNLM) collected Montgomery country - NLM(MC). The dataset consists of samples of diseases labeled as COVID-19, Tuberculosis, Other pneumonia (SARS, MERS, etc.), and Normal. The dataset can be utilized to train an evaulate deep learning and machine learning models as binary and multi-class classification problem.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of āProvisional Death Counts for Influenza, Pneumonia, and COVID-19ā provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/c78f3ba6-04af-4ecd-bd87-4fdfc6a97344 on 12 February 2022.
--- Dataset description provided by original source is as follows ---
Deaths counts for influenza, pneumonia, and coronavirus disease 2019 (COVID-19) reported to NCHS by week ending date, by state and HHS region, and age group.
--- Original source retains full ownership of the source dataset ---
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The COVID-19 pandemic is a global healthcare emergency. Prediction models for COVID-19 imaging are rapidly being developed to support medical decision making in imaging. However, inadequate availability of a diverse annotated dataset has limited the performance and generalizability of existing models.
To create the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset made freely available to the machine learning community as a research and educational resource for COVID-19 chest imaging. The Radiological Society of North America (RSNA) assembled the RSNA International COVID-19 Open Radiology Database (RICORD) collection of COVID-related imaging datasets and expert annotations to support research and education. RICORD data will be incorporated in the Medical Imaging and Data Resource Center (MIDRC), a multi-institutional research data repository funded by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health.
This dataset was created through a collaboration between the RSNA and Society of Thoracic Radiology (STR). Clinical annotation by thoracic radiology subspecialists was performed for all COVID positive chest radiography (CXR) imaging studies using a labeling schema based upon guidelines for reporting classification of COVID-19 findings in CXRs (see Review of Chest Radiograph Findings of COVID-19 Pneumonia and Suggested Reporting Language, Journal of Thoracic Imaging).
The RSNA International COVID-19 Open Annotated Radiology Database (RICORD) consists of 998 chest x-rays from 361 patients at four international sites annotated with diagnostic labels.
Patient Selection: Patients at least 18 years in age receiving positive diagnosis for COVID-19.
998 Chest x-ray examinations from 361 patients.
Annotations with labels:
Classification
Typical Appearance
Multifocal bilateral, peripheral opacities, and/or Opacities with rounded morphology
Lower lung-predominant distribution (Required Feature - must be present with either or both of the first two opacity patterns)
Indeterminate Appearance
Absence of typical findings AND Unilateral, central or upper lung predominant distribution of airspace disease
Negative for Pneumonia
No lung opacities
Airspace Disease Grading
Lungs are divided on frontal chest xray into 3 zones per lung (6 zones total). The upper zone extends from the apices to the superior hilum. The mid zone spans between the superior and inferior hilar margins. The lower zone extends from the inferior hilar margins to the costophrenic sulci.
Mild - Required if not negative for pneumonia
Opacities in 1-2 lung zones
Moderate - Required if not negative for pneumonia
Opacities in 3-4 lung zones
Severe - Required if not negative for pneumonia
Opacities in >4 lung zones
Supporting clinical variables: MRN*, Age, Study Date*, Exam Description, Sex, Study UID*, Image Count, Modality, Testing Result, Specimen Source (* pseudonymous values).
How to use the JSON annotations
More information about how the JSON annotations are organized can be found on https://docs.md.ai/data/json/. Steps 2 & 3 in this example code demonstrate how to to load the JSON into a Dataframe. The JSON file can be downloaded via the data access table below; it is not available via MD.ai. This Jupyter Notebook may also be helpful.
RICORD is available for non-commercial use (and further enrichment) by the research and education communities which may include development of educational resources for COVID-19, use of RICORD to create AI systems for diagnosis and quantification, benchmarking performance for existing solutions, exploration of distributed/federated learning, further annotation or data augmentation efforts, and evaluation of the examinations for disease entities beyond COVID-19 pneumonia. Deliberate consideration of the detailed annotation schema, demographics, and other included meta-data will be critical when generating cohorts with RICORD, particularly as more public COVID-19 imaging datasets are made available via complementary and parallel efforts. It is important to emphasize that there are limitations to the clinical āground truthā as the SARS-CoV-2 RT-PCR tests have widely documented limitations and are subject to both false-negative and false-positive results which impact the distribution of the included imaging data, and may have led to an unknown epidemiologic distortion of patients based on the inclusion criteria. These limitations notwithstanding, RICORD has achieved the stated objectives for data complexity, heterogeneity, and high-quality expert annotations as a comprehensive COVID-19 thoracic imaging data resource.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the data on ICU-transferred (N=100) and Stable (N=131) patients with COVID-19 (N=156) and Non-COVID-19 viral pneumonia (N=75). Among COVID-19 patients of this study, 82 patients developed Refractory Respiratory Failure (RRF) or Severe Acute Respiratory Distress Syndrome (SARDS) and were transferred to Intensive Care Unit (ICU), 74 patients had a Stable course of disease and were not transferred to ICU. Collected data are presented as a table with columns:
- Gender;
- Age (years);
- SARS-CoV-2 RT-PCR testing results;
- Time between the disease onset and admission to the hospital (days);
- Time between admission to the hospital and transfer to ICU (days);
- Artificial lung ventilation in ICU needed;
- C-reactive protein (CRP) upon admission (mg/L);
- International Normalized Ratio (INR) upon admission;
- Prothrombin Time (PT) upon admission (sec.);
- Fibrinogen upon admission (mg/L);
- Chest Computed Tomography (CT) upon admission: lung tissue affected (%);
- Platelet count upon admission (10^9/L);
- Chest CT, 1 week after admission: lung tissue affected (%);
- CRP, 1 week after admission (mg/L);
- Platelet count, 1 week after admission (10^9/L).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a combined curated dataset of COVID-19 Chest X-ray images obtained by collating 15 publically available datasets as listed under the references section. The present dataset contains 1281 COVID-19 X-Rays, 3270 Normal X-Rays, 1656 viral-pneumonia X-Rays, and 3001 bacterial-pneumonia X-Rays. This dataset is developed as a part of the following research publication.
"A deep-learning based multimodal system for Covid-19 diagnosis using breathing sounds and chest X-ray images" https://doi.org/10.1016/j.asoc.2021.107522
The collected datasetsāas cited by this datasetāare combined to form an integrated repository. This integrated repository contains a total of 4558 COVID-19 X-Rays, 5403 Normal X-Rays, 4497 Viral pneumonia X-Rays, and 5768 bacterial pneumonia X-Rays. Out of which 1379 COVID-19 X-Rays, 1476 normal X-Rays, 2690 viral pneumonia X-Rays, and 2588 bacterial pneumonia X-Rays are found to be duplicatesābased on the image similaritiesāand thus are removed. Inception V3 architecture is used to obtain the image embeddings, which is followed by the use of unsupervised learning algorithms based on cosine similarity distances. These distances are clustered and then visualized to find different categories of image defects which are listed below:ā
1.Noise 2.Pixelated 3.Compressed 4.Medical Implants 5.Washed out image 6.Side View 7.CT (sliced) image 8.Aspect Ratio distortion / Cropped / Zoomed 9.Rotated Images 10.Images with annotations
These clusters of defective images are removed during the curation process and a refined dataset is obtained which is available for download.