100+ datasets found
  1. Z

    Worrying confessions: A look at data safety labels on Android

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Altpeter (2022). Worrying confessions: A look at data safety labels on Android [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7088556
    Explore at:
    Dataset updated
    Sep 18, 2022
    Dataset authored and provided by
    Benjamin Altpeter
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Google Play Store recently introduced a data safety section in order to give users accessible insights into apps’ data collection practices. We analyzed the labels of 43,927 of the most popular apps. Almost one third of the apps with a label claims not to collect any data. But we also saw popular apps, including apps meant for children, admitting to collecting and sharing highly sensitive data like the user’s sexual orientation or health information for tracking and advertising purposes. To verify the declarations, we recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.

    This data set contains a dump of our database, including the top chart data and data safety labels from September 07, 2022, and the recorded network traffic.

    The analysis is available at our blog: https://www.datarequests.org/blog/android-data-safety-labels-analysis/ The source code for the analysis is available on GitHub: https://github.com/datenanfragen/android-data-safety-label-analysis

  2. Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinseok Kim; Jason Owen-Smith (2023). Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale [Dataset]. http://doi.org/10.6084/m9.figshare.13404986.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jinseok Kim; Jason Owen-Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print).1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009.2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID.3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation. 4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key.Please see the paper for details on how the datasets were created.Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6The uploaded datasets were created by combining several data sources below.1. ORCID data were downloaded from the link below for the 2018 version.Please refer to the policies on the use of ORCID data.https://info.orcid.org/public-data-file-use-policy/2. MEDLINE baseline data were downloaded from the link below for the 2016 version.Please refer to the policies on the use of MEDLINE data.https://www.nlm.nih.gov/databases/download/pubmed_medline.html3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below.Please refer to the policies on the use of those datasets.https://databank.illinois.edu/datasets/IDB-9087546Please cite three papers below to properly give credits to the creators of the original datasets.Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.24677204. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below.https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1Please cite the paper below to properly give credits to the creators of the original dataset.Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731

  3. Z

    GPT-2 generated form fields

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Davis (2022). GPT-2 generated form fields [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6544100
    Explore at:
    Dataset updated
    May 13, 2022
    Dataset authored and provided by
    Brian Davis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt

    The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.

    For example:

    [ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]

  4. c

    Random Sample of NIH Chest X ray Dataset

    • cubig.ai
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Random Sample of NIH Chest X ray Dataset [Dataset]. https://cubig.ai/store/products/354/random-sample-of-nih-chest-x-ray-dataset
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description

    1) Data Introduction • The Random Sample of NIH Chest X-ray Dataset is a sample version of a large public medical imaging dataset containing 112,120 chest X-ray images and 15 disease (or normal) labels collected from 30,805 patients.

    2) Data Utilization (1) Random Sample of NIH Chest X-ray Dataset has characteristics that: • Each sample comes with detailed metadata such as image file name, disease label, patient ID, age, gender, direction of shooting, and image size, and the label extracts the radiographic reading report with NLP, showing an accuracy of more than 90%. • It contains 5,606 1024x1024 size images, consisting of 14 diseases and a 'No Finding' class, but due to the nature of the sample, some disease data are very scarce. (2) Random Sample of NIH Chest X-ray Dataset can be used to: • Development of chest disease image reading AI: Using X-ray images with various chest disease labels, deep learning-based automatic diagnosis and classification models can be trained and evaluated. • Medical image data preprocessing and labeling research: It can be used for medical artificial intelligence research and algorithm development such as automatic labeling of large medical image datasets, data quality evaluation, and weak-supervised learning.

  5. NIH Chest X-rays Bbox version

    • kaggle.com
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huthayfa Hodeb (2024). NIH Chest X-rays Bbox version [Dataset]. https://www.kaggle.com/datasets/huthayfahodeb/nih-chest-x-rays-bbox-version
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Kaggle
    Authors
    Huthayfa Hodeb
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    NIH Chest X-ray Dataset

    National Institutes of Health Chest X-Ray Dataset

    Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

    This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

    Link to paper

    Data limitations

    • The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
    • Very limited numbers of disease region bounding boxes (See BBox_list_2017.csv)

    File contents

    • Image format: 880 total images with size 1024 x 1024
    • bbox_img: Contains 880 bbox images
    • README_ChestXray.pdf: Original README file
    • BBox_list_2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels
      • Image Index: File name
      • Finding Label: Disease type (Class label)
      • Bbox x
      • Bbox y
      • Bbox w
      • Bbox h
    • Data_entry_2017.csv: Class labels and patient data for the entire dataset
      • Image Index: File name
      • Finding Labels: Disease type (Class label)
      • Follow-up #
      • Patient ID
      • Patient Age
      • Patient Gender
      • View Position: X-ray orientation
      • OriginalImageWidth
      • OriginalImageHeight
      • OriginalImagePixelSpacing_x
      • OriginalImagePixelSpacing_y
    • label.csv: Class labels
    • tesnorlfow.csv: tensorflow version of the dataset

    Class descriptions

    There are 8 classes . Images can be classified as one or more disease classes: - Infiltrate - Atelectasis - Pneumonia - Cardiomegaly - Effusion - Pneumothorax - Mass - Nodule

    Citations

    Acknowledgements

    This work was supported by the Intramural Research Program of the NClinical Center (clinicalcenter.nih.gov) and National Library of Medicine (www.nlm.nih.gov).

  6. Z

    TG-CSR Annotations

    • data.niaid.nih.gov
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alice M. Mulvehill (2023). TG-CSR Annotations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908823
    Explore at:
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    Alice M. Mulvehill
    Henrique Santos
    Ke Shen
    Deborah L. McGuinness
    Mayank Kejriwal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Individual raw and normalized label data for the TG-CSR (Theoretically-Grounded Commonsense Reasoning) benchmark.

  7. R

    Labeled Temporal Brain Networks

    • entrepot.recherche.data.gouv.fr
    txt, zip
    Updated Jul 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aurora ROSSI; Aurora ROSSI (2023). Labeled Temporal Brain Networks [Dataset]. http://doi.org/10.57745/HHNT10
    Explore at:
    txt(1498), zip(648811279)Available download formats
    Dataset updated
    Jul 21, 2023
    Dataset provided by
    Recherche Data Gouv
    Authors
    Aurora ROSSI; Aurora ROSSI
    License

    https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/HHNT10https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/HHNT10

    Dataset funded by
    French government, National Research Agency (ANR)
    Description

    Labeled Temporal Brain Networks This dataset contains a collection of temporal brain networks of 100 subjects. Each subject has a label representing their biological sex ("M" for male and "F" for female) and age range (22-25, 26-30,31-35 and 36+). The networks are obtained from resting-state fMRI data from the Human Connectome Project (HCP) and are undirected and weighted. The number of nodes is fixed at 202, instead the edge weights change their values over time. Dataset structure The networks.zip file contains the networks as .txt files in the following format: the first line of each .txt file contains the number of nodes and the number of snapshots of the network divided by a space. The following lines contain the list of edges of the network in the form i,j,t,w, meaning that the edge between node i and node j at time t has weight w. The labels are contained in the file labels.txt, where there are three columns separated by a space, where the first column is the identifier of a subject, the second is the biological sex, and the last is an age range. Acknowledgments Data were provided by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. The authors are grateful to the OPAL infrastructure from Université Côte d'Azur for providing resources and support. This work has been supported by the French government, through the UCA DS4H Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-17-EURE-0004.

  8. d

    Journal Article Tag Suite (JATS)

    • catalog.data.gov
    • datadiscovery.nlm.nih.gov
    • +2more
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Journal Article Tag Suite (JATS) [Dataset]. https://catalog.data.gov/dataset/journal-article-tag-suite-jats
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    Journal Article Tag Suite (JATS) is an application of NISO Z39.96.2019, which defines a set of XML elements and attributes for describing the textual and graphical content of journal articles and describes three article models.

  9. N

    Data from: DailyMed

    • datadiscovery.nlm.nih.gov
    • data.virginia.gov
    • +6more
    application/rdfxml +5
    Updated Mar 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). DailyMed [Dataset]. https://datadiscovery.nlm.nih.gov/d/n7e9-np3x
    Explore at:
    application/rdfxml, xml, application/rssxml, csv, json, tsvAvailable download formats
    Dataset updated
    Mar 2, 2021
    Description

    DailyMed provides health information providers and the public with a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts, also known as Structured Product Labeling (SPL).

  10. Mini NIH XRay Dataset for Binary Classification

    • kaggle.com
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abby Morgan (2023). Mini NIH XRay Dataset for Binary Classification [Dataset]. https://www.kaggle.com/datasets/abbymorgan/create-mini-xray-dataset-binary-classification-100
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abby Morgan
    Description

    The original full dataset contained 112,120 X-ray images with disease labels from 30,805 unique patients.

    This notebook is modified from K Scott Mader's notebook here to create a mini chest x-ray dataset that is split 50:50 between normal and diseased images.

    In my notebook I will use this dataset to test a pretrained model on a binary classification task (diseased vs. healthy xray), and then visualize which specific labels the model has the most trouble with.

    Also, because disease classification is such an important task to get right, it's likely that any AI/ML medical classification task will include a human-in-the-loop. In this way, this process more closely resembles how this sort of ML would be used in the real world.

    Note that the original notebook on which this one was based had two versions: Standard and Equalized. In this notebook we will be using the equalized version in order to save ourselves the extra step of performing CLAHE during the tensor transformations.

    The goal of this notebook, as originally stated by Mader, is "to make a much easier to use mini-dataset out of the Chest X-Ray collection. The idea is to have something akin to MNIST or Fashion MNIST for medical images." In order to do this, we will preprocess, normalize, and scale down the images, and then save them into an HDF5 file with the corresponding tabular data.

    Data limitations: The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%. Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv) Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation

    File Contents File is an HDF5 file of shape 200, 28. Main file contains nested HDF5 file of xray images with key images. Main HDF5 file keys are: - Image Index
    - Finding Labels: list of disease labels
    - Follow-up #
    - Patient ID
    - Patient Age
    - Patient Gender: 'F'/'M'
    - View Position: 'PA', 'AP' - OriginalImageWidth
    - OriginalImageHeight
    - OriginalImagePixelSpacing_x
    - Normal: Binary; if Xray finding is 'Normal' - Atelectasis: Binary; if Xray finding includes 'Atelectasis' - Cardiomegaly: Binary; if Xray finding includes 'Cardiomegaly' - Consolidation: Binary; if Xray finding includes 'Consolidation' - Edema: Binary; if Xray finding includes 'Edema' - Effusion: Binary; if Xray finding includes 'Effusion' - Emphysema: Binary; if Xray finding includes 'Emphysema' - Fibrosis: Binary; if Xray finding includes 'Fibrosis' - Hernia: Binary; if Xray finding includes 'Hernia' - Infiltration: Binary; if Xray finding includes 'Infiltration' - Mass: Binary; if Xray finding includes 'Mass' - Nodule: Binary; if Xray finding includes 'Nodule' - Pleural_Thickening: Binary; if Xray finding includes 'Pleural_Thickening' - Pneumonia: Binary; if Xray finding includes'Pneumonia'
    - Pneumothorax: Binary; if Xray finding includes 'Pneumothorax'

  11. NIH Chest X ray 14 (224x224 resized)

    • kaggle.com
    zip
    Updated Jul 8, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khan Fashee Monowar (Sawrup) (2020). NIH Chest X ray 14 (224x224 resized) [Dataset]. https://www.kaggle.com/khanfashee/nih-chest-x-ray-14-224x224-resized
    Explore at:
    zip(2468882507 bytes)Available download formats
    Dataset updated
    Jul 8, 2020
    Authors
    Khan Fashee Monowar (Sawrup)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    National Institutes of Health Chest X-Ray Dataset

    Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

    This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

    Data limitations:

    The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
    Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv)
    Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation
    

    File contents

    Image format: 112,120 total images with size 1024 x 1024
    
    images_001.zip: Contains 4999 images
    
    images_002.zip: Contains 10,000 images
    
    images_003.zip: Contains 10,000 images
    
    images_004.zip: Contains 10,000 images
    
    images_005.zip: Contains 10,000 images
    
    images_006.zip: Contains 10,000 images
    
    images_007.zip: Contains 10,000 images
    
    images_008.zip: Contains 10,000 images
    
    images_009.zip: Contains 10,000 images
    
    images_010.zip: Contains 10,000 images
    
    images_011.zip: Contains 10,000 images
    
    images_012.zip: Contains 7,121 images
    
    README_ChestXray.pdf: Original README file
    
    BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels
      Image Index: File name
      Finding Label: Disease type (Class label)
      Bbox x
      Bbox y
      Bbox w
      Bbox h
    
    Dataentry2017.csv: Class labels and patient data for the entire dataset
      Image Index: File name
      Finding Labels: Disease type (Class label)
      Follow-up #
      Patient ID
      Patient Age
      Patient Gender
      View Position: X-ray orientation
      OriginalImageWidth
      OriginalImageHeight
      OriginalImagePixelSpacing_x
      OriginalImagePixelSpacing_y
    

    Class descriptions

    There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:

    Atelectasis
    Consolidation
    Infiltration
    Pneumothorax
    Edema
    Emphysema
    Fibrosis
    Effusion
    Pneumonia
    Pleural_thickening
    Cardiomegaly
    Nodule Mass
    Hernia
    

    Full Dataset Content

    There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.

    Sample: sample.zip
    

    Modifications to original data

    Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform
    
    CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory
    

    Citations

    Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8Hospital-ScaleChestCVPR2017_paper.pdf
    
    NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community
    
    Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
    
  12. Data from: NutriGreen Image Dataset: A Collection of Annotated Nutrition,...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara; Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara (2024). NutriGreen Image Dataset: A Collection of Annotated Nutrition, Organic, and Vegan Food Products [Dataset]. http://doi.org/10.5281/zenodo.10020545
    Explore at:
    bin, csv, zipAvailable download formats
    Dataset updated
    Feb 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara; Drole Jan; Pravst Igor; Eftimov Tome; Koroušić Seljak Barbara
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The generated dataset is an annotated collection, with each image carrying labels (NutriScore, V-label and Bio). The presence of annotated data is essential for developing a supervised machine-learning model capable of automatically identifying labels in new images. In our case, we utilize this data to train a model that can autonomously recognize labels on new images not present in the dataset, achieving a model accuracy of 94%. In the future, you have the option to train a new model using the dataset to achieve higher accuracy or employ the existing model to automatically identify bio and nutri labels in newly collected images, eliminating the need for manual review. We should emphasize that these resources should be utilized by a data science team. There is an opportunity for this model to be integrated with a mobile app, but this is a direction for future work, we included in the revised version.

    In this research, we introduce the NutriGreen dataset, which is a collection of images representing packaged food products. Each image in the dataset comes with three distinct labels: one indicating its nutritional value using the Nutri-Score, another denoting whether it's vegan or vegetarian with the V-label, and a third displaying the EU organic certification (BIO) logo. The dataset comprises a total of 10,472 images. Among these, the Nutri-Score label is distributed across five sub-labels: A with 1,250 images, B with 1,107 images, C with 867 images, D with 1,001 images, and E with 967 images. Additionally, there are 870 images featuring the V-Label, 2,328 images showcasing the BIO label, and 3201 images with no labels. Furthermore, we have fine-tuned the YOLOv5 model to demonstrate the practicality of using these annotated datasets, achieving an impressive accuracy of 94.0%. These promising results indicate that this dataset has significant potential for training innovative systems capable of detecting food labels. Moreover, it can serve as a valuable benchmark dataset for emerging computer vision systems.

  13. Z

    User Study Data - Obfuscation and Labeling of Search Results to Mitigate...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Draws, Tim (2024). User Study Data - Obfuscation and Labeling of Search Results to Mitigate Confirmation Bias [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5902728
    Explore at:
    Dataset updated
    Apr 19, 2024
    Dataset provided by
    Draws, Tim
    Rieger, Alisa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data was collected to test the effect of obfuscations with warning labels on participants' interaction with search results on debated topics. The data set contains questionnaire responses and interaction data with search results on debated topics of 328 participants. Data excluded from data analysis is not included in this data set (due to not fulfilling the requirements: reporting to have a strong attitude on at least one of the topics, passing all four attention checks, spending more than 60 seconds on the SERP, clicking on and marking at least one search result).

  14. openFDA Drug Labeling

    • kaggle.com
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ddrbcn (2025). openFDA Drug Labeling [Dataset]. https://www.kaggle.com/datasets/ddrbcn/openfda-drug-labeling
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ddrbcn
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧬 openFDA Drug Labeling – JSON Dataset

    This dataset contains structured drug labeling information (FDA labels) provided by DailyMed and made available through the openFDA Drug Labeling endpoint.

    The dataset includes 13 compressed .zip files with drug label records in JSON format. Each record reflects the full label submitted to the FDA, and the structure matches what you would receive from the /drug/label API.

    📁 Dataset Contents

    • 13 ZIP files
    • Each file contains multiple JSON documents representing FDA-approved drug labels
    • Data fields include (but are not limited to):
      • drug_interactions
      • warnings
      • indications_and_usage
      • contraindications
      • adverse_reactions
      • dosage_and_administration
      • brand_name, generic_name
      • ...and many others

    You will also find the 'Human Drug.xlsx' file included in the dataset, which contains the complete data dictionary for reference.

    🔄 Updates

    This dataset reflects the most recent version available as of April 9, 2025. According to the source, previous records may be modified in future updates. For accuracy and completeness, all files should be downloaded together.

    📚 Sources and More Information

    ⚠️ Disclaimer (Please Read Carefully)

    Do not rely on openFDA to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of FDA-regulated products. We may limit or otherwise restrict your access to the API in line with our Terms of Service.

    Full terms available here: openFDA Terms of Service

    🛠️ Notes for Usage

    This dataset is ideal for applications involving: - Drug safety analysis - Drug interaction monitoring - Medical language modeling - Retrieval-augmented generation (RAG) agents - Regulatory and pharmacovigilance systems

    You may want to extract and preprocess only relevant fields before vectorizing or feeding them into an AI model for efficiency and performance.

  15. Code for Predicting MIEs from Gene Expression and Chemical Target Labels...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
    Explore at:
    Dataset updated
    Apr 21, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).

  16. A

    ‘Dietary Supplements Label Database (DSLD) - Product Information’ analyzed...

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Dietary Supplements Label Database (DSLD) - Product Information’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-dietary-supplements-label-database-dsld-product-information-3954/b142dd69/?iid=008-044&v=presentation
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Dietary Supplements Label Database (DSLD) - Product Information’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/2a76d253-e2f4-49c5-90e3-d08701608b28 on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    (https://dsld.nlm.nih.gov) The Dietary Supplement Label Database (DSLD) includes full label derived information from dietary supplement products marketed in the U.S. with a Web-based user interface that provides ready access to label information. It was developed to serve the research community and as a resource for health care providers and the public. It can be an educational and research tool for students, academics, and other professionals.

    The Product Information dataset contains the full listing of product labels, LanguaLcodes, and other product information.

    --- Original source retains full ownership of the source dataset ---

  17. Z

    Data from: Multi-label Pathway Prediction based on Active Dataset...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Hallam, Steven (2020). Multi-label Pathway Prediction based on Active Dataset Subsampling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3940705
    Explore at:
    Dataset updated
    Sep 15, 2020
    Dataset provided by
    J. Hallam, Steven
    M. A. Basher, Abdur Rahman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We include samples of various data types used in the work "Multi-label Pathway Prediction based on Active Dataset Subsampling" (under-review)

    More information about the software package and instructions are provided in hallamlab/leADS

  18. Data from: A region-wide, multi-year set of crop field boundary labels for...

    • zenodo.org
    • registry.opendata.aws
    application/gzip, bin
    Updated May 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes; Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes (2024). A region-wide, multi-year set of crop field boundary labels for Africa [Dataset]. http://doi.org/10.5281/zenodo.11060871
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    May 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes; Amos Wussah; Mary Asipunu; Michelle Gathigi; Primož Kovačič; Justus Muhando; Victor Yeboah; Foster Addai; Edward Setor Akakpo; Michael Allotey; Phillip Amkoya; Eunice Amponsem; Kofi Danquah Dadon; Xefilde Godknows Harrison; Emily Heltzel; Charles Juma; Ronald Mdawida; Adelide Miroyo; Julius Mucha; Judith Mugami; Fredrick Mwawaza; Delaiah Nyarko; Purent Oduor; Kofi Ohemeng; Sladen Isaac Dela Segbefia; Trevor Tumbula; Francis Wambua; Felicia Yeboah; Lyndon Estes
    License

    https://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdfhttps://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdf

    Description

    Data resulting from a project undertaken to generate a comprehensive set of crop field boundary labels throughout the continent of Africa, representing the years 2017-2023. The project was funded by the https://lacunafund.org/">Lacuna Fund, and led by https://farmerline.co/">Farmerline, in collaboration with https://spatialcollective.com/">Spatial Collective and the Agricultural Impacts Research Group at https://www.clarku.edu/departments/geography/">Clark University.

    Please refer to the technical report in the accompanying repository for more details on the methods used to develop the dataset, an analysis of label quality, and usage guidelines.

  19. h

    Label-free LC-MS/MS of Kidney (Right) from Female, 58 years old

    • portal.hubmapconsortium.org
    Updated Aug 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Spraggins (2020). Label-free LC-MS/MS of Kidney (Right) from Female, 58 years old [Dataset]. https://portal.hubmapconsortium.org/browse/dataset/af8e5c3a7f66a105e8e19aba8a6fc6e3
    Explore at:
    Dataset updated
    Aug 22, 2020
    Dataset provided by
    Vanderbilt TMC
    Authors
    Jeff Spraggins
    Description

    LC MS/MS Proteomics data collected from the Right Kidney of a 58 year old White Female donor by the Biomolecular Multimodal Imaging Center (BIOMC) at Vanderbilt University. BIOMIC is a Tissue Mapping Center that is part of the NIH funded Human Biomolecular Atlas Program (HuBMAP). Label-free data were collected with a Thermo Scientific Orbitrap Fusion Tribrid using DIA methods. Support was provided by the NIH Common Fund and National Institute of Diabetes and Digestive and Kidney Diseases (U54 DK120058). Tissue was collected through the Cooperative Human Tissue Network with support provided by the NIH National Cancer Institute (5 UM1 CA183727-08).

  20. Sentiment Analysis Test Dataset Created from Two COVID-19 Surveys: National...

    • figshare.com
    xlsx
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Antonio Lossio-Ventura; Rachel Weger; Angela Lee; Emily Guinee; Joyce Chung; Atlas, Lauren; Eleni Linos; Francisco Pereira (2024). Sentiment Analysis Test Dataset Created from Two COVID-19 Surveys: National Institutes of Health (NIH) and Stanford University [Dataset]. http://doi.org/10.6084/m9.figshare.24560584.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Juan Antonio Lossio-Ventura; Rachel Weger; Angela Lee; Emily Guinee; Joyce Chung; Atlas, Lauren; Eleni Linos; Francisco Pereira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two COVID-19 surveys were used to create the test dataset, both collected by teams from the National Institutes of Health (NIH) and Stanford University. The collected data were intended to assess the general topics experienced by participants during the pandemic lockdown. The test dataset comprises a total of 1,000 randomly chosen sentences, with 500 sentences selected from each survey. Each set was annotated by three separate and independent annotators. The annotators were instructed to assess the polarity of each sentence on a scale of -1 (negative), 0 (neutral), or 1 (positive). We then followed a three-step procedure to determine the final labels. First, if all three annotators agreed on a label (full agreement), that label was accepted. Second, if two out of the three agreed on a label (partial agreement), that label was also accepted. Third, if there was no agreement, the label was set as neutral (no agreement).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Benjamin Altpeter (2022). Worrying confessions: A look at data safety labels on Android [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7088556

Worrying confessions: A look at data safety labels on Android

Explore at:
Dataset updated
Sep 18, 2022
Dataset authored and provided by
Benjamin Altpeter
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

The Google Play Store recently introduced a data safety section in order to give users accessible insights into apps’ data collection practices. We analyzed the labels of 43,927 of the most popular apps. Almost one third of the apps with a label claims not to collect any data. But we also saw popular apps, including apps meant for children, admitting to collecting and sharing highly sensitive data like the user’s sexual orientation or health information for tracking and advertising purposes. To verify the declarations, we recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.

This data set contains a dump of our database, including the top chart data and data safety labels from September 07, 2022, and the recorded network traffic.

The analysis is available at our blog: https://www.datarequests.org/blog/android-data-safety-labels-analysis/ The source code for the analysis is available on GitHub: https://github.com/datenanfragen/android-data-safety-label-analysis

Search
Clear search
Close search
Google apps
Main menu