100+ datasets found

Z
Worrying confessions: A look at data safety labels on Android
data.niaid.nih.gov
zenodo.org
Updated Sep 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Altpeter (2022). Worrying confessions: A look at data safety labels on Android [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7088556
Explore at:
Dataset updated
Sep 18, 2022
Dataset authored and provided by
Benjamin Altpeter
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Google Play Store recently introduced a data safety section in order to give users accessible insights into apps’ data collection practices. We analyzed the labels of 43,927 of the most popular apps. Almost one third of the apps with a label claims not to collect any data. But we also saw popular apps, including apps meant for children, admitting to collecting and sharing highly sensitive data like the user’s sexual orientation or health information for tracking and advertising purposes. To verify the declarations, we recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.

This data set contains a dump of our database, including the top chart data and data safety labels from September 07, 2022, and the recorded network traffic.

The analysis is available at our blog: https://www.datarequests.org/blog/android-data-safety-labels-analysis/ The source code for the analysis is available on GitHub: https://github.com/datenanfragen/android-data-safety-label-analysis
Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinseok Kim; Jason Owen-Smith (2023). Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale [Dataset]. http://doi.org/10.6084/m9.figshare.13404986.v4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13404986.v4
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jinseok Kim; Jason Owen-Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print).1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009.2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID.3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation. 4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key.Please see the paper for details on how the datasets were created.Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6The uploaded datasets were created by combining several data sources below.1. ORCID data were downloaded from the link below for the 2018 version.Please refer to the policies on the use of ORCID data.https://info.orcid.org/public-data-file-use-policy/2. MEDLINE baseline data were downloaded from the link below for the 2016 version.Please refer to the policies on the use of MEDLINE data.https://www.nlm.nih.gov/databases/download/pubmed_medline.html3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below.Please refer to the policies on the use of those datasets.https://databank.illinois.edu/datasets/IDB-9087546Please cite three papers below to properly give credits to the creators of the original datasets.Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.24677204. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below.https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1Please cite the paper below to properly give credits to the creators of the original dataset.Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731
Z
GPT-2 generated form fields
data.niaid.nih.gov
zenodo.org
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Davis (2022). GPT-2 generated form fields [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6544100
Explore at:
Dataset updated
May 13, 2022
Dataset authored and provided by
Brian Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt

The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.

For example:

[ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]
c
Random Sample of NIH Chest X ray Dataset
cubig.ai
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Random Sample of NIH Chest X ray Dataset [Dataset]. https://cubig.ai/store/products/354/random-sample-of-nih-chest-x-ray-dataset
Explore at:
Dataset updated
May 28, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
Description
1) Data Introduction • The Random Sample of NIH Chest X-ray Dataset is a sample version of a large public medical imaging dataset containing 112,120 chest X-ray images and 15 disease (or normal) labels collected from 30,805 patients.

2) Data Utilization (1) Random Sample of NIH Chest X-ray Dataset has characteristics that: • Each sample comes with detailed metadata such as image file name, disease label, patient ID, age, gender, direction of shooting, and image size, and the label extracts the radiographic reading report with NLP, showing an accuracy of more than 90%. • It contains 5,606 1024x1024 size images, consisting of 14 diseases and a 'No Finding' class, but due to the nature of the sample, some disease data are very scarce. (2) Random Sample of NIH Chest X-ray Dataset can be used to: • Development of chest disease image reading AI: Using X-ray images with various chest disease labels, deep learning-based automatic diagnosis and classification models can be trained and evaluated. • Medical image data preprocessing and labeling research: It can be used for medical artificial intelligence research and algorithm development such as automatic labeling of large medical image datasets, data quality evaluation, and weak-supervised learning.
NIH Chest X-rays Bbox version
kaggle.com
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huthayfa Hodeb (2024). NIH Chest X-rays Bbox version [Dataset]. https://www.kaggle.com/datasets/huthayfahodeb/nih-chest-x-rays-bbox-version
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2024
Dataset provided by
Kaggle
Authors
Huthayfa Hodeb
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
NIH Chest X-ray Dataset

National Institutes of Health Chest X-Ray Dataset

Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

Link to paper

Data limitations

The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.

Very limited numbers of disease region bounding boxes (See BBox_list_2017.csv)

File contents

Image format: 880 total images with size 1024 x 1024

bbox_img: Contains 880 bbox images

README_ChestXray.pdf: Original README file

BBox_list_2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels

Image Index: File name

Finding Label: Disease type (Class label)

Bbox x

Bbox y

Bbox w

Bbox h

Data_entry_2017.csv: Class labels and patient data for the entire dataset

Image Index: File name

Finding Labels: Disease type (Class label)

Follow-up #

Patient ID

Patient Age

Patient Gender

View Position: X-ray orientation

OriginalImageWidth

OriginalImageHeight

OriginalImagePixelSpacing_x

OriginalImagePixelSpacing_y

label.csv: Class labels

tesnorlfow.csv: tensorflow version of the dataset

Class descriptions

There are 8 classes . Images can be classified as one or more disease classes: - Infiltrate - Atelectasis - Pneumonia - Cardiomegaly - Effusion - Pneumothorax - Mass - Nodule

Citations

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf

NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community -Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345

Acknowledgements

This work was supported by the Intramural Research Program of the NClinical Center (clinicalcenter.nih.gov) and National Library of Medicine (www.nlm.nih.gov).
Z
TG-CSR Annotations
data.niaid.nih.gov
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alice M. Mulvehill (2023). TG-CSR Annotations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908823
Explore at:
Dataset updated
Aug 21, 2023
Dataset provided by
Alice M. Mulvehill
Henrique Santos
Ke Shen
Deborah L. McGuinness
Mayank Kejriwal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Individual raw and normalized label data for the TG-CSR (Theoretically-Grounded Commonsense Reasoning) benchmark.
R
Labeled Temporal Brain Networks
entrepot.recherche.data.gouv.fr
txt, zip
Updated Jul 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aurora ROSSI; Aurora ROSSI (2023). Labeled Temporal Brain Networks [Dataset]. http://doi.org/10.57745/HHNT10
Explore at:
txt(1498), zip(648811279)Available download formats
Unique identifier
https://doi.org/10.57745/HHNT10
Dataset updated
Jul 21, 2023
Dataset provided by
Recherche Data Gouv
Authors
Aurora ROSSI; Aurora ROSSI
License
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/HHNT10https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/HHNT10
Dataset funded by
French government, National Research Agency (ANR)
Description
Labeled Temporal Brain Networks This dataset contains a collection of temporal brain networks of 100 subjects. Each subject has a label representing their biological sex ("M" for male and "F" for female) and age range (22-25, 26-30,31-35 and 36+). The networks are obtained from resting-state fMRI data from the Human Connectome Project (HCP) and are undirected and weighted. The number of nodes is fixed at 202, instead the edge weights change their values over time. Dataset structure The networks.zip file contains the networks as .txt files in the following format: the first line of each .txt file contains the number of nodes and the number of snapshots of the network divided by a space. The following lines contain the list of edges of the network in the form i,j,t,w, meaning that the edge between node i and node j at time t has weight w. The labels are contained in the file labels.txt, where there are three columns separated by a space, where the first column is the identifier of a subject, the second is the biological sex, and the last is an age range. Acknowledgments Data were provided by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. The authors are grateful to the OPAL infrastructure from Université Côte d'Azur for providing resources and support. This work has been supported by the French government, through the UCA DS4H Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-17-EURE-0004.
d
Journal Article Tag Suite (JATS)
catalog.data.gov
datadiscovery.nlm.nih.gov
+2more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Journal Article Tag Suite (JATS) [Dataset]. https://catalog.data.gov/dataset/journal-article-tag-suite-jats
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
Journal Article Tag Suite (JATS) is an application of NISO Z39.96.2019, which defines a set of XML elements and attributes for describing the textual and graphical content of journal articles and describes three article models.
N
Data from: DailyMed
datadiscovery.nlm.nih.gov
data.virginia.gov
+6more
application/rdfxml +5
Updated Mar 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). DailyMed [Dataset]. https://datadiscovery.nlm.nih.gov/d/n7e9-np3x
Explore at:
application/rdfxml, xml, application/rssxml, csv, json, tsvAvailable download formats
Dataset updated
Mar 2, 2021
Description
DailyMed provides health information providers and the public with a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts, also known as Structured Product Labeling (SPL).
Mini NIH XRay Dataset for Binary Classification
kaggle.com
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abby Morgan (2023). Mini NIH XRay Dataset for Binary Classification [Dataset]. https://www.kaggle.com/datasets/abbymorgan/create-mini-xray-dataset-binary-classification-100
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 4, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abby Morgan
Description
The original full dataset contained 112,120 X-ray images with disease labels from 30,805 unique patients.

This notebook is modified from K Scott Mader's notebook here to create a mini chest x-ray dataset that is split 50:50 between normal and diseased images.

In my notebook I will use this dataset to test a pretrained model on a binary classification task (diseased vs. healthy xray), and then visualize which specific labels the model has the most trouble with.

Also, because disease classification is such an important task to get right, it's likely that any AI/ML medical classification task will include a human-in-the-loop. In this way, this process more closely resembles how this sort of ML would be used in the real world.

Note that the original notebook on which this one was based had two versions: Standard and Equalized. In this notebook we will be using the equalized version in order to save ourselves the extra step of performing CLAHE during the tensor transformations.

The goal of this notebook, as originally stated by Mader, is "to make a much easier to use mini-dataset out of the Chest X-Ray collection. The idea is to have something akin to MNIST or Fashion MNIST for medical images." In order to do this, we will preprocess, normalize, and scale down the images, and then save them into an HDF5 file with the corresponding tabular data.

Data limitations: The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%. Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv) Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation

File Contents File is an HDF5 file of shape 200, 28. Main file contains nested HDF5 file of xray images with key images. Main HDF5 file keys are: - Image Index
- Finding Labels: list of disease labels
- Follow-up #
- Patient ID
- Patient Age
- Patient Gender: 'F'/'M'
- View Position: 'PA', 'AP' - OriginalImageWidth
- OriginalImageHeight
- OriginalImagePixelSpacing_x
- Normal: Binary; if Xray finding is 'Normal' - Atelectasis: Binary; if Xray finding includes 'Atelectasis' - Cardiomegaly: Binary; if Xray finding includes 'Cardiomegaly' - Consolidation: Binary; if Xray finding includes 'Consolidation' - Edema: Binary; if Xray finding includes 'Edema' - Effusion: Binary; if Xray finding includes 'Effusion' - Emphysema: Binary; if Xray finding includes 'Emphysema' - Fibrosis: Binary; if Xray finding includes 'Fibrosis' - Hernia: Binary; if Xray finding includes 'Hernia' - Infiltration: Binary; if Xray finding includes 'Infiltration' - Mass: Binary; if Xray finding includes 'Mass' - Nodule: Binary; if Xray finding includes 'Nodule' - Pleural_Thickening: Binary; if Xray finding includes 'Pleural_Thickening' - Pneumonia: Binary; if Xray finding includes'Pneumonia'
- Pneumothorax: Binary; if Xray finding includes 'Pneumothorax'
NIH Chest X ray 14 (224x224 resized)
kaggle.com
zip
Updated Jul 8, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khan Fashee Monowar (Sawrup) (2020). NIH Chest X ray 14 (224x224 resized) [Dataset]. https://www.kaggle.com/khanfashee/nih-chest-x-ray-14-224x224-resized
Explore at:
zip(2468882507 bytes)Available download formats
Dataset updated
Jul 8, 2020
Authors
Khan Fashee Monowar (Sawrup)
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
National Institutes of Health Chest X-Ray Dataset

Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

Data limitations:

The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%. Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv) Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation

File contents

Image format: 112,120 total images with size 1024 x 1024 images_001.zip: Contains 4999 images images_002.zip: Contains 10,000 images images_003.zip: Contains 10,000 images images_004.zip: Contains 10,000 images images_005.zip: Contains 10,000 images images_006.zip: Contains 10,000 images images_007.zip: Contains 10,000 images images_008.zip: Contains 10,000 images images_009.zip: Contains 10,000 images images_010.zip: Contains 10,000 images images_011.zip: Contains 10,000 images images_012.zip: Contains 7,121 images README_ChestXray.pdf: Original README file BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels Image Index: File name Finding Label: Disease type (Class label) Bbox x Bbox y Bbox w Bbox h Dataentry2017.csv: Class labels and patient data for the entire dataset Image Index: File name Finding Labels: Disease type (Class label) Follow-up # Patient ID Patient Age Patient Gender View Position: X-ray orientation OriginalImageWidth OriginalImageHeight OriginalImagePixelSpacing_x OriginalImagePixelSpacing_y

Class descriptions

There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:

Atelectasis Consolidation Infiltration Pneumothorax Edema Emphysema Fibrosis Effusion Pneumonia Pleural_thickening Cardiomegaly Nodule Mass Hernia

Full Dataset Content

There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.

Sample: sample.zip

Modifications to original data

Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory

Citations

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8Hospital-ScaleChestCVPR2017_paper.pdf NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
Data from: NutriGreen Image Dataset: A Collection of Annotated Nutrition,...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara; Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara (2024). NutriGreen Image Dataset: A Collection of Annotated Nutrition, Organic, and Vegan Food Products [Dataset]. http://doi.org/10.5281/zenodo.10020545
Explore at:
bin, csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10020545
Dataset updated
Feb 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara; Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The generated dataset is an annotated collection, with each image carrying labels (NutriScore, V-label and Bio). The presence of annotated data is essential for developing a supervised machine-learning model capable of automatically identifying labels in new images. In our case, we utilize this data to train a model that can autonomously recognize labels on new images not present in the dataset, achieving a model accuracy of 94%. In the future, you have the option to train a new model using the dataset to achieve higher accuracy or employ the existing model to automatically identify bio and nutri labels in newly collected images, eliminating the need for manual review. We should emphasize that these resources should be utilized by a data science team. There is an opportunity for this model to be integrated with a mobile app, but this is a direction for future work, we included in the revised version.

In this research, we introduce the NutriGreen dataset, which is a collection of images representing packaged food products. Each image in the dataset comes with three distinct labels: one indicating its nutritional value using the Nutri-Score, another denoting whether it's vegan or vegetarian with the V-label, and a third displaying the EU organic certification (BIO) logo. The dataset comprises a total of 10,472 images. Among these, the Nutri-Score label is distributed across five sub-labels: A with 1,250 images, B with 1,107 images, C with 867 images, D with 1,001 images, and E with 967 images. Additionally, there are 870 images featuring the V-Label, 2,328 images showcasing the BIO label, and 3201 images with no labels. Furthermore, we have fine-tuned the YOLOv5 model to demonstrate the practicality of using these annotated datasets, achieving an impressive accuracy of 94.0%. These promising results indicate that this dataset has significant potential for training innovative systems capable of detecting food labels. Moreover, it can serve as a valuable benchmark dataset for emerging computer vision systems.
Z
User Study Data - Obfuscation and Labeling of Search Results to Mitigate...
data.niaid.nih.gov
zenodo.org
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Draws, Tim (2024). User Study Data - Obfuscation and Labeling of Search Results to Mitigate Confirmation Bias [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5902728
Explore at:
Dataset updated
Apr 19, 2024
Dataset provided by
Draws, Tim
Rieger, Alisa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data was collected to test the effect of obfuscations with warning labels on participants' interaction with search results on debated topics. The data set contains questionnaire responses and interaction data with search results on debated topics of 328 participants. Data excluded from data analysis is not included in this data set (due to not fulfilling the requirements: reporting to have a strong attitude on at least one of the topics, passing all four attention checks, spending more than 60 seconds on the SERP, clicking on and marking at least one search result).
openFDA Drug Labeling
kaggle.com
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ddrbcn (2025). openFDA Drug Labeling [Dataset]. https://www.kaggle.com/datasets/ddrbcn/openfda-drug-labeling
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ddrbcn
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🧬 openFDA Drug Labeling – JSON Dataset

This dataset contains structured drug labeling information (FDA labels) provided by DailyMed and made available through the openFDA Drug Labeling endpoint.

The dataset includes 13 compressed .zip files with drug label records in JSON format. Each record reflects the full label submitted to the FDA, and the structure matches what you would receive from the /drug/label API.

📁 Dataset Contents

13 ZIP files

Each file contains multiple JSON documents representing FDA-approved drug labels

Data fields include (but are not limited to):

drug_interactions

warnings

indications_and_usage

contraindications

adverse_reactions

dosage_and_administration

brand_name, generic_name

...and many others

You will also find the 'Human Drug.xlsx' file included in the dataset, which contains the complete data dictionary for reference.

🔄 Updates

This dataset reflects the most recent version available as of April 9, 2025. According to the source, previous records may be modified in future updates. For accuracy and completeness, all files should be downloaded together.

📚 Sources and More Information

openFDA Drug Labeling Downloads

API Documentation

DailyMed Main Site

⚠️ Disclaimer (Please Read Carefully)

Do not rely on openFDA to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of FDA-regulated products. We may limit or otherwise restrict your access to the API in line with our Terms of Service.

Full terms available here: openFDA Terms of Service

🛠️ Notes for Usage

This dataset is ideal for applications involving: - Drug safety analysis - Drug interaction monitoring - Medical language modeling - Retrieval-augmented generation (RAG) agents - Regulatory and pharmacovigilance systems

You may want to extract and preprocess only relevant fields before vectorizing or feeding them into an AI model for efficiency and performance.
Code for Predicting MIEs from Gene Expression and Chemical Target Labels...
catalog.data.gov
datasets.ai
+1more
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
Explore at:
Dataset updated
Apr 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).
A
‘Dietary Supplements Label Database (DSLD) - Product Information’ analyzed...
analyst-2.ai
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Dietary Supplements Label Database (DSLD) - Product Information’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-dietary-supplements-label-database-dsld-product-information-3954/b142dd69/?iid=008-044&v=presentation
Explore at:
Dataset updated
Aug 4, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Dietary Supplements Label Database (DSLD) - Product Information’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/2a76d253-e2f4-49c5-90e3-d08701608b28 on 28 January 2022.

--- Dataset description provided by original source is as follows ---

(https://dsld.nlm.nih.gov) The Dietary Supplement Label Database (DSLD) includes full label derived information from dietary supplement products marketed in the U.S. with a Web-based user interface that provides ready access to label information. It was developed to serve the research community and as a resource for health care providers and the public. It can be an educational and research tool for students, academics, and other professionals.

The Product Information dataset contains the full listing of product labels, LanguaLcodes, and other product information.

--- Original source retains full ownership of the source dataset ---
Z
Data from: Multi-label Pathway Prediction based on Active Dataset...
data.niaid.nih.gov
zenodo.org
Updated Sep 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Hallam, Steven (2020). Multi-label Pathway Prediction based on Active Dataset Subsampling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3940705
Explore at:
Dataset updated
Sep 15, 2020
Dataset provided by
J. Hallam, Steven
M. A. Basher, Abdur Rahman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We include samples of various data types used in the work "Multi-label Pathway Prediction based on Active Dataset Subsampling" (under-review)

More information about the software package and instructions are provided in hallamlab/leADS
Data from: A region-wide, multi-year set of crop field boundary labels for...
zenodo.org
registry.opendata.aws
application/gzip, bin
Updated May 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes; Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes (2024). A region-wide, multi-year set of crop field boundary labels for Africa [Dataset]. http://doi.org/10.5281/zenodo.11060871
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11060871
Dataset updated
May 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes; Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes
License
https://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdfhttps://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdf
Description

Data resulting from a project undertaken to generate a comprehensive set of crop field boundary labels throughout the continent of Africa, representing the years 2017-2023. The project was funded by the https://lacunafund.org/">Lacuna Fund, and led by https://farmerline.co/">Farmerline, in collaboration with https://spatialcollective.com/">Spatial Collective and the Agricultural Impacts Research Group at https://www.clarku.edu/departments/geography/">Clark University.

Please refer to the technical report in the accompanying repository for more details on the methods used to develop the dataset, an analysis of label quality, and usage guidelines.
h
Label-free LC-MS/MS of Kidney (Right) from Female, 58 years old
portal.hubmapconsortium.org
Updated Aug 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Spraggins (2020). Label-free LC-MS/MS of Kidney (Right) from Female, 58 years old [Dataset]. https://portal.hubmapconsortium.org/browse/dataset/af8e5c3a7f66a105e8e19aba8a6fc6e3
Explore at:
Dataset updated
Aug 22, 2020
Dataset provided by
Vanderbilt TMC
Authors
Jeff Spraggins
Description
LC MS/MS Proteomics data collected from the Right Kidney of a 58 year old White Female donor by the Biomolecular Multimodal Imaging Center (BIOMC) at Vanderbilt University. BIOMIC is a Tissue Mapping Center that is part of the NIH funded Human Biomolecular Atlas Program (HuBMAP). Label-free data were collected with a Thermo Scientific Orbitrap Fusion Tribrid using DIA methods. Support was provided by the NIH Common Fund and National Institute of Diabetes and Digestive and Kidney Diseases (U54 DK120058). Tissue was collected through the Cooperative Human Tissue Network with support provided by the NIH National Cancer Institute (5 UM1 CA183727-08).
Sentiment Analysis Test Dataset Created from Two COVID-19 Surveys: National...
figshare.com
xlsx
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Antonio Lossio-Ventura; Rachel Weger; Angela Lee; Emily Guinee; Joyce Chung; Atlas, Lauren; Eleni Linos; Francisco Pereira (2024). Sentiment Analysis Test Dataset Created from Two COVID-19 Surveys: National Institutes of Health (NIH) and Stanford University [Dataset]. http://doi.org/10.6084/m9.figshare.24560584.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24560584.v2
Dataset updated
Jan 9, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Juan Antonio Lossio-Ventura; Rachel Weger; Angela Lee; Emily Guinee; Joyce Chung; Atlas, Lauren; Eleni Linos; Francisco Pereira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two COVID-19 surveys were used to create the test dataset, both collected by teams from the National Institutes of Health (NIH) and Stanford University. The collected data were intended to assess the general topics experienced by participants during the pandemic lockdown. The test dataset comprises a total of 1,000 randomly chosen sentences, with 500 sentences selected from each survey. Each set was annotated by three separate and independent annotators. The annotators were instructed to assess the polarity of each sentence on a scale of -1 (negative), 0 (neutral), or 1 (positive). We then followed a three-step procedure to determine the final labels. First, if all three annotators agreed on a label (full agreement), that label was accepted. Second, if two out of the three agreed on a label (partial agreement), that label was also accepted. Third, if there was no agreement, the label was set as neutral (no agreement).

Facebook

Twitter

Click to copy link

Link copied

Cite

Benjamin Altpeter (2022). Worrying confessions: A look at data safety labels on Android [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7088556

Worrying confessions: A look at data safety labels on Android

Explore at:

Dataset updated

Sep 18, 2022

Dataset authored and provided by

Benjamin Altpeter

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

The Google Play Store recently introduced a data safety section in order to give users accessible insights into apps’ data collection practices. We analyzed the labels of 43,927 of the most popular apps. Almost one third of the apps with a label claims not to collect any data. But we also saw popular apps, including apps meant for children, admitting to collecting and sharing highly sensitive data like the user’s sexual orientation or health information for tracking and advertising purposes. To verify the declarations, we recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.

This data set contains a dump of our database, including the top chart data and data safety labels from September 07, 2022, and the recorded network traffic.

The analysis is available at our blog: https://www.datarequests.org/blog/android-data-safety-labels-analysis/ The source code for the analysis is available on GitHub: https://github.com/datenanfragen/android-data-safety-label-analysis

Clear search

Close search

Google apps

Main menu

Worrying confessions: A look at data safety labels on Android

Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation...

GPT-2 generated form fields

Random Sample of NIH Chest X ray Dataset

NIH Chest X-rays Bbox version

NIH Chest X-ray Dataset

National Institutes of Health Chest X-Ray Dataset

Data limitations

File contents

Class descriptions

Citations

Acknowledgements

TG-CSR Annotations

Labeled Temporal Brain Networks

Journal Article Tag Suite (JATS)

Data from: DailyMed

Mini NIH XRay Dataset for Binary Classification

NIH Chest X ray 14 (224x224 resized)

National Institutes of Health Chest X-Ray Dataset

Data limitations:

File contents

Class descriptions

Full Dataset Content

Modifications to original data

Citations

Data from: NutriGreen Image Dataset: A Collection of Annotated Nutrition,...

User Study Data - Obfuscation and Labeling of Search Results to Mitigate...

openFDA Drug Labeling

🧬 openFDA Drug Labeling – JSON Dataset

📁 Dataset Contents

🔄 Updates

📚 Sources and More Information

⚠️ Disclaimer (Please Read Carefully)

🛠️ Notes for Usage

Code for Predicting MIEs from Gene Expression and Chemical Target Labels...

‘Dietary Supplements Label Database (DSLD) - Product Information’ analyzed...

Data from: Multi-label Pathway Prediction based on Active Dataset...

Data from: A region-wide, multi-year set of crop field boundary labels for...

Label-free LC-MS/MS of Kidney (Right) from Female, 58 years old

Sentiment Analysis Test Dataset Created from Two COVID-19 Surveys: National...

Worrying confessions: A look at data safety labels on Android