CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Google Play Store recently introduced a data safety section in order to give users accessible insights into apps’ data collection practices. We analyzed the labels of 43,927 of the most popular apps. Almost one third of the apps with a label claims not to collect any data. But we also saw popular apps, including apps meant for children, admitting to collecting and sharing highly sensitive data like the user’s sexual orientation or health information for tracking and advertising purposes. To verify the declarations, we recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.
This data set contains a dump of our database, including the top chart data and data safety labels from September 07, 2022, and the recorded network traffic.
The analysis is available at our blog: https://www.datarequests.org/blog/android-data-safety-labels-analysis/ The source code for the analysis is available on GitHub: https://github.com/datenanfragen/android-data-safety-label-analysis
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print).1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009.2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID.3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation. 4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key.Please see the paper for details on how the datasets were created.Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6The uploaded datasets were created by combining several data sources below.1. ORCID data were downloaded from the link below for the 2018 version.Please refer to the policies on the use of ORCID data.https://info.orcid.org/public-data-file-use-policy/2. MEDLINE baseline data were downloaded from the link below for the 2016 version.Please refer to the policies on the use of MEDLINE data.https://www.nlm.nih.gov/databases/download/pubmed_medline.html3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below.Please refer to the policies on the use of those datasets.https://databank.illinois.edu/datasets/IDB-9087546Please cite three papers below to properly give credits to the creators of the original datasets.Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.24677204. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below.https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1Please cite the paper below to properly give credits to the creators of the original dataset.Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt
The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.
For example:
[ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Random Sample of NIH Chest X-ray Dataset is a sample version of a large public medical imaging dataset containing 112,120 chest X-ray images and 15 disease (or normal) labels collected from 30,805 patients.
2) Data Utilization (1) Random Sample of NIH Chest X-ray Dataset has characteristics that: • Each sample comes with detailed metadata such as image file name, disease label, patient ID, age, gender, direction of shooting, and image size, and the label extracts the radiographic reading report with NLP, showing an accuracy of more than 90%. • It contains 5,606 1024x1024 size images, consisting of 14 diseases and a 'No Finding' class, but due to the nature of the sample, some disease data are very scarce. (2) Random Sample of NIH Chest X-ray Dataset can be used to: • Development of chest disease image reading AI: Using X-ray images with various chest disease labels, deep learning-based automatic diagnosis and classification models can be trained and evaluated. • Medical image data preprocessing and labeling research: It can be used for medical artificial intelligence research and algorithm development such as automatic labeling of large medical image datasets, data quality evaluation, and weak-supervised learning.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.
This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)
There are 8 classes . Images can be classified as one or more disease classes: - Infiltrate - Atelectasis - Pneumonia - Cardiomegaly - Effusion - Pneumothorax - Mass - Nodule
This work was supported by the Intramural Research Program of the NClinical Center (clinicalcenter.nih.gov) and National Library of Medicine (www.nlm.nih.gov).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Individual raw and normalized label data for the TG-CSR (Theoretically-Grounded Commonsense Reasoning) benchmark.
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/HHNT10https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.57745/HHNT10
Labeled Temporal Brain Networks This dataset contains a collection of temporal brain networks of 100 subjects. Each subject has a label representing their biological sex ("M" for male and "F" for female) and age range (22-25, 26-30,31-35 and 36+). The networks are obtained from resting-state fMRI data from the Human Connectome Project (HCP) and are undirected and weighted. The number of nodes is fixed at 202, instead the edge weights change their values over time. Dataset structure The networks.zip file contains the networks as .txt files in the following format: the first line of each .txt file contains the number of nodes and the number of snapshots of the network divided by a space. The following lines contain the list of edges of the network in the form i,j,t,w, meaning that the edge between node i and node j at time t has weight w. The labels are contained in the file labels.txt, where there are three columns separated by a space, where the first column is the identifier of a subject, the second is the biological sex, and the last is an age range. Acknowledgments Data were provided by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. The authors are grateful to the OPAL infrastructure from Université Côte d'Azur for providing resources and support. This work has been supported by the French government, through the UCA DS4H Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-17-EURE-0004.
Journal Article Tag Suite (JATS) is an application of NISO Z39.96.2019, which defines a set of XML elements and attributes for describing the textual and graphical content of journal articles and describes three article models.
DailyMed provides health information providers and the public with a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts, also known as Structured Product Labeling (SPL).
The original full dataset contained 112,120 X-ray images with disease labels from 30,805 unique patients.
This notebook is modified from K Scott Mader's notebook here to create a mini chest x-ray dataset that is split 50:50 between normal and diseased images.
In my notebook I will use this dataset to test a pretrained model on a binary classification task (diseased vs. healthy xray), and then visualize which specific labels the model has the most trouble with.
Also, because disease classification is such an important task to get right, it's likely that any AI/ML medical classification task will include a human-in-the-loop. In this way, this process more closely resembles how this sort of ML would be used in the real world.
Note that the original notebook on which this one was based had two versions: Standard and Equalized. In this notebook we will be using the equalized version in order to save ourselves the extra step of performing CLAHE during the tensor transformations.
The goal of this notebook, as originally stated by Mader, is "to make a much easier to use mini-dataset out of the Chest X-Ray collection. The idea is to have something akin to MNIST or Fashion MNIST for medical images." In order to do this, we will preprocess, normalize, and scale down the images, and then save them into an HDF5 file with the corresponding tabular data.
Data limitations: The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%. Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv) Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation
File Contents
File is an HDF5 file of shape 200, 28. Main file contains nested HDF5 file of xray images with key images
.
Main HDF5 file keys are:
- Image Index
- Finding Labels: list of disease labels
- Follow-up #
- Patient ID
- Patient Age
- Patient Gender: 'F'/'M'
- View Position: 'PA', 'AP'
- OriginalImageWidth
- OriginalImageHeight
- OriginalImagePixelSpacing_x
- Normal: Binary; if Xray finding is 'Normal'
- Atelectasis: Binary; if Xray finding includes 'Atelectasis'
- Cardiomegaly: Binary; if Xray finding includes 'Cardiomegaly'
- Consolidation: Binary; if Xray finding includes 'Consolidation'
- Edema: Binary; if Xray finding includes 'Edema'
- Effusion: Binary; if Xray finding includes 'Effusion'
- Emphysema: Binary; if Xray finding includes 'Emphysema'
- Fibrosis: Binary; if Xray finding includes 'Fibrosis'
- Hernia: Binary; if Xray finding includes 'Hernia'
- Infiltration: Binary; if Xray finding includes 'Infiltration'
- Mass: Binary; if Xray finding includes 'Mass'
- Nodule: Binary; if Xray finding includes 'Nodule'
- Pleural_Thickening: Binary; if Xray finding includes 'Pleural_Thickening'
- Pneumonia: Binary; if Xray finding includes'Pneumonia'
- Pneumothorax: Binary; if Xray finding includes 'Pneumothorax'
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.
This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)
The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv)
Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation
Image format: 112,120 total images with size 1024 x 1024
images_001.zip: Contains 4999 images
images_002.zip: Contains 10,000 images
images_003.zip: Contains 10,000 images
images_004.zip: Contains 10,000 images
images_005.zip: Contains 10,000 images
images_006.zip: Contains 10,000 images
images_007.zip: Contains 10,000 images
images_008.zip: Contains 10,000 images
images_009.zip: Contains 10,000 images
images_010.zip: Contains 10,000 images
images_011.zip: Contains 10,000 images
images_012.zip: Contains 7,121 images
README_ChestXray.pdf: Original README file
BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels
Image Index: File name
Finding Label: Disease type (Class label)
Bbox x
Bbox y
Bbox w
Bbox h
Dataentry2017.csv: Class labels and patient data for the entire dataset
Image Index: File name
Finding Labels: Disease type (Class label)
Follow-up #
Patient ID
Patient Age
Patient Gender
View Position: X-ray orientation
OriginalImageWidth
OriginalImageHeight
OriginalImagePixelSpacing_x
OriginalImagePixelSpacing_y
There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:
Atelectasis
Consolidation
Infiltration
Pneumothorax
Edema
Emphysema
Fibrosis
Effusion
Pneumonia
Pleural_thickening
Cardiomegaly
Nodule Mass
Hernia
There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.
Sample: sample.zip
Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform
CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory
Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8Hospital-ScaleChestCVPR2017_paper.pdf
NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community
Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The generated dataset is an annotated collection, with each image carrying labels (NutriScore, V-label and Bio). The presence of annotated data is essential for developing a supervised machine-learning model capable of automatically identifying labels in new images. In our case, we utilize this data to train a model that can autonomously recognize labels on new images not present in the dataset, achieving a model accuracy of 94%. In the future, you have the option to train a new model using the dataset to achieve higher accuracy or employ the existing model to automatically identify bio and nutri labels in newly collected images, eliminating the need for manual review. We should emphasize that these resources should be utilized by a data science team. There is an opportunity for this model to be integrated with a mobile app, but this is a direction for future work, we included in the revised version.
In this research, we introduce the NutriGreen dataset, which is a collection of images representing packaged food products. Each image in the dataset comes with three distinct labels: one indicating its nutritional value using the Nutri-Score, another denoting whether it's vegan or vegetarian with the V-label, and a third displaying the EU organic certification (BIO) logo. The dataset comprises a total of 10,472 images. Among these, the Nutri-Score label is distributed across five sub-labels: A with 1,250 images, B with 1,107 images, C with 867 images, D with 1,001 images, and E with 967 images. Additionally, there are 870 images featuring the V-Label, 2,328 images showcasing the BIO label, and 3201 images with no labels. Furthermore, we have fine-tuned the YOLOv5 model to demonstrate the practicality of using these annotated datasets, achieving an impressive accuracy of 94.0%. These promising results indicate that this dataset has significant potential for training innovative systems capable of detecting food labels. Moreover, it can serve as a valuable benchmark dataset for emerging computer vision systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data was collected to test the effect of obfuscations with warning labels on participants' interaction with search results on debated topics. The data set contains questionnaire responses and interaction data with search results on debated topics of 328 participants. Data excluded from data analysis is not included in this data set (due to not fulfilling the requirements: reporting to have a strong attitude on at least one of the topics, passing all four attention checks, spending more than 60 seconds on the SERP, clicking on and marking at least one search result).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains structured drug labeling information (FDA labels) provided by DailyMed and made available through the openFDA Drug Labeling endpoint.
The dataset includes 13 compressed .zip
files with drug label records in JSON format. Each record reflects the full label submitted to the FDA, and the structure matches what you would receive from the /drug/label
API.
drug_interactions
warnings
indications_and_usage
contraindications
adverse_reactions
dosage_and_administration
brand_name
, generic_name
You will also find the 'Human Drug.xlsx' file included in the dataset, which contains the complete data dictionary for reference.
This dataset reflects the most recent version available as of April 9, 2025. According to the source, previous records may be modified in future updates. For accuracy and completeness, all files should be downloaded together.
Do not rely on openFDA to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of FDA-regulated products. We may limit or otherwise restrict your access to the API in line with our Terms of Service.
Full terms available here: openFDA Terms of Service
This dataset is ideal for applications involving: - Drug safety analysis - Drug interaction monitoring - Medical language modeling - Retrieval-augmented generation (RAG) agents - Regulatory and pharmacovigilance systems
You may want to extract and preprocess only relevant fields before vectorizing or feeding them into an AI model for efficiency and performance.
Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Dietary Supplements Label Database (DSLD) - Product Information’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/2a76d253-e2f4-49c5-90e3-d08701608b28 on 28 January 2022.
--- Dataset description provided by original source is as follows ---
(https://dsld.nlm.nih.gov) The Dietary Supplement Label Database (DSLD) includes full label derived information from dietary supplement products marketed in the U.S. with a Web-based user interface that provides ready access to label information. It was developed to serve the research community and as a resource for health care providers and the public. It can be an educational and research tool for students, academics, and other professionals.
The Product Information dataset contains the full listing of product labels, LanguaLcodes, and other product information.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include samples of various data types used in the work "Multi-label Pathway Prediction based on Active Dataset Subsampling" (under-review)
More information about the software package and instructions are provided in hallamlab/leADS
https://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdfhttps://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdf
Data resulting from a project undertaken to generate a comprehensive set of crop field boundary labels throughout the continent of Africa, representing the years 2017-2023. The project was funded by the https://lacunafund.org/">Lacuna Fund, and led by https://farmerline.co/">Farmerline, in collaboration with https://spatialcollective.com/">Spatial Collective and the Agricultural Impacts Research Group at https://www.clarku.edu/departments/geography/">Clark University.
Please refer to the technical report in the accompanying repository for more details on the methods used to develop the dataset, an analysis of label quality, and usage guidelines.
LC MS/MS Proteomics data collected from the Right Kidney of a 58 year old White Female donor by the Biomolecular Multimodal Imaging Center (BIOMC) at Vanderbilt University. BIOMIC is a Tissue Mapping Center that is part of the NIH funded Human Biomolecular Atlas Program (HuBMAP). Label-free data were collected with a Thermo Scientific Orbitrap Fusion Tribrid using DIA methods. Support was provided by the NIH Common Fund and National Institute of Diabetes and Digestive and Kidney Diseases (U54 DK120058). Tissue was collected through the Cooperative Human Tissue Network with support provided by the NIH National Cancer Institute (5 UM1 CA183727-08).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two COVID-19 surveys were used to create the test dataset, both collected by teams from the National Institutes of Health (NIH) and Stanford University. The collected data were intended to assess the general topics experienced by participants during the pandemic lockdown. The test dataset comprises a total of 1,000 randomly chosen sentences, with 500 sentences selected from each survey. Each set was annotated by three separate and independent annotators. The annotators were instructed to assess the polarity of each sentence on a scale of -1 (negative), 0 (neutral), or 1 (positive). We then followed a three-step procedure to determine the final labels. First, if all three annotators agreed on a label (full agreement), that label was accepted. Second, if two out of the three agreed on a label (partial agreement), that label was also accepted. Third, if there was no agreement, the label was set as neutral (no agreement).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Google Play Store recently introduced a data safety section in order to give users accessible insights into apps’ data collection practices. We analyzed the labels of 43,927 of the most popular apps. Almost one third of the apps with a label claims not to collect any data. But we also saw popular apps, including apps meant for children, admitting to collecting and sharing highly sensitive data like the user’s sexual orientation or health information for tracking and advertising purposes. To verify the declarations, we recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.
This data set contains a dump of our database, including the top chart data and data safety labels from September 07, 2022, and the recorded network traffic.
The analysis is available at our blog: https://www.datarequests.org/blog/android-data-safety-labels-analysis/ The source code for the analysis is available on GitHub: https://github.com/datenanfragen/android-data-safety-label-analysis