19 datasets found
  1. u

    Industry Guide for the labelling of cosmetics - Catalogue - Canadian Urban...

    • data.urbandatacentre.ca
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Industry Guide for the labelling of cosmetics - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-f78ec8b0-fa1b-476b-abde-b47b6669163b
    Explore at:
    Dataset updated
    Oct 19, 2025
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    This guide is designed to assist in the preparation of labels that comply with Canadian regulatory requirements for cosmetics.

  2. Fuel Economy Label and CAFE Data Inventory

    • catalog.data.gov
    • data.amerigeoss.org
    • +1more
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Air and Radiation (OAR) - Office of Transportation and Air Quality (OTAQ) (2021). Fuel Economy Label and CAFE Data Inventory [Dataset]. https://catalog.data.gov/dataset/fuel-economy-label-and-cafe-data-inventory
    Explore at:
    Dataset updated
    Jul 12, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The Fuel Economy Label and CAFE Data asset contains measured summary fuel economy estimates and test data for light-duty vehicle manufacturers by model for certification as required under the Energy Policy and Conservation Act of 1975 (EPCA) and The Energy Independent Security Act of 2007 (EISA) to collect vehicle fuel economy estimates for the creation of Economy Labels and for the calculation of Corporate Average Fuel Economy (CAFE). Manufacturers submit data on an annual basis, or as needed to document vehicle model changes.The EPA performs targeted fuel economy confirmatory tests on approximately 15% of vehicles submitted for validation. Confirmatory data on vehicles is associated with its corresponding submission data to verify the accuracy of manufacturer submissions beyond standard business rules. Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML, and includes descriptive information on the vehicle itself, fuel economy information, and the manufacturer's testing approach. This data may contain proprietary information (CBI) such as information on estimated sales or other data elements indicated by the submitter as confidential. CBI data is not publically available; however, within the EPA data can accessed under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Datasets are segmented by vehicle model/manufacturer and/or year with corresponding fuel economy, test, and certification data. Data assets are stored in EPA's Verify system.Coverage began in 1974 with early records being primarily paper documents which did not go through the same level of validation as primarily digital submissions which started in 2008. Early data is available to the public digitally starting from 1978, but more complete digital certification data is available starting in 2008. Fuel economy submission data prior to 2006 was calculated using an older formula; however, mechanisms exist to make this data comparable to current results.Fuel Economy Label and CAFE Data submission documents with metadata, certificate and summary decision information is utilized and made publically available through the EPA/DOE's Fuel Economy Guide Website (https://www.fueleconomy.gov/) as well as EPA's Smartway Program Website (https://www.epa.gov/smartway/) and Green Vehicle Guide Website (http://ofmpub.epa.gov/greenvehicles/Index.do;jsessionid=3F4QPhhYDYJxv1L3YLYxqh6J2CwL0GkxSSJTl2xgMTYPBKYS00vw!788633877) after it has been quality assured. Where summary data appears inaccurate, OTAQ returns the entries for review to their originator.

  3. Lunar Reconnaissance Orbiter Imagery for LROCNet Moon Classifier

    • zenodo.org
    bin, zip
    Updated Nov 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Dunkel; Emily Dunkel (2022). Lunar Reconnaissance Orbiter Imagery for LROCNet Moon Classifier [Dataset]. http://doi.org/10.5281/zenodo.7041842
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Emily Dunkel; Emily Dunkel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    We provide imagery used to train LROCNet -- our Convolutional Neural Network classifier of orbital imagery of the moon. Images are divided into train, validation, and test zip files, which contain class specific sub-folders. We have three classes: "fresh crater", "old crater", and "none". Classes are described in detail in the attached labeling guide.

    Directory Contents

    We include the labeling guide and training, testing, and validation data. Training data was split to avoid upload timeouts.

    • LROC_Labeling_Intro_for_release.ppt: Labeling guide
    • val: Validation images divided into class sub-folders
      • ejecta: "fresh crater" class
      • oldcrater: "old crater" class
      • none: "none" class
    • test: Testing images divided into class sub-folders
      • ejecta: "fresh crater" class
      • oldcrater: "old crater" class
      • none: "none" class
    • ejecta_train: Training images of "fresh crater" class
    • oldcrater_train: Training images of "old crater" class
    • none_train1-4: Training images of "none" class (divided into 4 just for uploading)

    Data Description

    We use CDR (Calibrated Data Record) browse imagery (50% resolution) from the Lunar Reconnaissance Orbiter's Narrow Angle Cameras (NACs). Data we get from the NACs are 5-km swaths, at nominal orbit, so we perform a saliency detection step to find surface features of interest. A detector developed for Mars HiRISE (Wagstaff et al.) worked well for our purposes, after updating based on LROC NAC image resolution. We use this detector to create a set of image chipouts (small 227x277 cutouts) from the larger image, sampling the lunar globe.

    Class Labeling

    We select classes of interest based on what is visible at the NAC resolution, consulting with scientists and performing a literature review. Initially, we have 7 classes: "fresh crater", "old crater", "overlapping craters", "irregular mare patches", "rockfalls and landfalls", "of scientific interest", and "none".

    Using the Zooniverse platform, we set up a labeling tool and labeled 5,000 images. We found that "fresh crater" make up 11% of the data, "old crater" 18%, with the vast majority "none". Due to limited examples of the other classes, we reduce our initial class set to: "fresh crater" (with impact ejecta), "old crater", and "none".

    We divide the images into train/validation/test sets making sure no image swaths span multiple sets.

    Data Augmentation

    Using PyTorch, we apply the following augmentation on the training set only: horizontal flip, vertical flip, rotation by 90/180/270 degrees, and brightness adjustment (0.5, 2). In addition, we use weighted sampling so that each class is weighted equally. The training set included here does not include augmentation since that was performed within PyTorch.

    Acknowledgements

    The author would like to thank the volunteers who provided annotations for this data set, as well as others who contributed to this work (as in the Contributor list). We would also like to thank the PDS Imaging Node for support of this work.

    The research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

    CL#22-4763

    © 2022 California Institute of Technology. Government sponsorship acknowledged.

  4. o

    A Guide to The Thai Green Label Scheme

    • data.opendevelopmentmekong.net
    Updated Jun 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). A Guide to The Thai Green Label Scheme [Dataset]. https://data.opendevelopmentmekong.net/dataset/a-guide-to-the-thai-green-label-scheme
    Explore at:
    Dataset updated
    Jun 20, 2018
    Description

    The scheme is developed to promote the concept of resource conservation, pollution reduction, and waste management. The purposes of awarding the green label are: • To provide reliable information and guide customers in their choice of products. • To create an opportunity for consumers to make an environmentally conscious decision, thus creating market incentives for manufacturers to develop and supply more environmentally sound products. • To reduce environmental impact which may occur during the manufacturing, utilization, consumption and disposal phases of a product.

  5. Z

    Dataset of Video Comments of a Vision Video Classified by Their Relevance,...

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karras, Oliver; Kristo, Eklekta (2024). Dataset of Video Comments of a Vision Video Classified by Their Relevance, Polarity, Intention, and Topic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4533301
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    TIB - Leibniz Information Centre for Science and Technology
    Leibniz University Hannover
    Authors
    Karras, Oliver; Kristo, Eklekta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all comments (comments and replies) of the YouTube vision video "Tunnels" by "The Boring Company" fetched on 2020-10-13 using YouTube API. The comments are classified manually by three persons. We performed a single-class labeling of the video comments regarding their relevance for requirement engineering (RE) (ham/spam), their polarity (positive/neutral/negative). Furthermore, we performed a multi-class labeling of the comments regarding their intention (feature request and problem report) and their topic (efficiency and safety). While a comment can only be relevant or not relevant and have only one polarity, a comment can have one or more intentions and also one or more topics.

    For the replies, one person also classified them regarding their relevance for RE. However, the investigation of the replies is ongoing and future work.

    Remark: For 126 comments and 26 replies, we could not determine the date and time since they were no longer accessible on YouTube at the time this data set was created. In the case of a missing date and time, we inserted "NULL" in the corresponding cell.

    This data set includes the following files:

    Dataset.xlsx contains the raw and labeled video comments and replies:

    For each comment, the data set contains:

    ID: An identification number generated by YouTube for the comment

    Date: The date and time of the creation of the comment

    Author: The username of the author of the comment

    Likes: The number of likes of the comment

    Replies: The number of replies to the comment

    Comment: The written comment

    Relevance: Label indicating the relevance of the comment for RE (ham = relevant, spam = irrelevant)

    Polarity: Label indicating the polarity of the comment

    Feature request: Label indicating that the comment request a feature

    Problem report: Label indicating that the comment reports a problem

    Efficiency: Label indicating that the comment deals with the topic efficiency

    Safety: Label indicating that the comment deals with the topic safety

    For each reply, the data set contains:

    ID: The identification number of the comment to which the reply belongs

    Date: The date and time of the creation of the reply

    Author: The username of the author of the reply

    Likes: The number of likes of the reply

    Comment: The written reply

    Relevance: Label indicating the relevance of the reply for RE (ham = relevant, spam = irrelevant)

    Detailed analysis results.xlsx contains the detailed results of all ten times repeated 10-fold cross validation analyses for each of all considered combinations of machine learning algorithms and features

    Guide Sheet - Multi-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual multi-class labeling

    Guide Sheet - Single-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual single-class labeling

    Python scripts for analysis.zip contains the scripts (as jupyter notebooks) and prepared data (as csv-files) for the analyses

  6. u

    Good Label and Package Practices Guide for Non-prescription Drugs and...

    • data.urbandatacentre.ca
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Good Label and Package Practices Guide for Non-prescription Drugs and Natural Health Products - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-9aa482a3-bad7-4abb-8292-4be888ee84bd
    Explore at:
    Dataset updated
    Oct 19, 2025
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    This best practices guide provides direction to industry on the design of safe and clear health product labels.

  7. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  8. Microsoft Security Incident Prediction

    • kaggle.com
    zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). Microsoft Security Incident Prediction [Dataset]. https://www.kaggle.com/datasets/Microsoft/microsoft-security-incident-prediction
    Explore at:
    zip(538236447 bytes)Available download formats
    Dataset updated
    Jul 11, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    Description

    Microsoft is challenging the data science community to develop techniques for predicting the next significant cybersecurity incident. GUIDE, the largest publicly available collection of real-world cybersecurity incidents, enables researchers and practitioners to experiment with authentic cybersecurity data to advance the state of cybersecurity. This groundbreaking dataset contains over 13 million pieces of evidence across 33 entity types, covering 1.6 million alerts and 1 million annotated incidents with triage labels from customers over a two-week period. Of these incidents, 26,000 contain additional remediation action labels from customers. The dataset includes telemetry from over 6,100 organizations, featuring 9,100 unique custom and built-in DetectorIds across numerous security products, encompassing 441 MITRE ATT&CK techniques. GUIDE offers a first of its kind opportunity to develop and benchmark next-generation machine learning models on comprehensive guided response telemetry, supporting efforts to tackle one of cybersecurity's most challenging problems.

    For additional information on GUIDE and Microsoft's approach to Guided Response in Copilot for Security, see the arXiv paper here.

    Introduction

    In the rapidly evolving cybersecurity landscape, the sharp rise in threat actors has overwhelmed enterprise security operation centers (SOCs) with an unprecedented volume of incidents to triage. This surge requires solutions that can either partially or fully automate the remediation process. Fully automated systems demand an exceptionally high confidence threshold to ensure correct actions are taken 99% of the time to avoid inadvertently disabling critical enterprise assets. Consequently, attaining such a high level of confidence often renders full automation impractical.

    This challenge has catalyzed the development of guided response (GR) systems to support SOC analysts by facilitating informed decision-making. Extended Detection and Response (XDR) products are ideally positioned to deliver precise, context-rich guided response recommendations thanks to their comprehensive visibility across the entire enterprise security landscape. By consolidating telemetry across endpoints, network devices, cloud environments, email systems, and more, XDR systems can harness a wide array of data to provide historical context, generate detailed insights into the nature of threats, and recommend tailored remediation actions.

    Dataset Overview

    We provide three hierarchies of data: (1) evidence, (2) alert, and (3) incident. At the bottom level, evidence supports an alert. For example, an alert may be associated with multiple pieces of evidence such as an IP address, email, and user details, each containing specific supporting metadata. Above that, we have alerts that consolidate multiple pieces of evidence to signify a potential security incident. These alerts provide a broader context by aggregating related evidences to present a more comprehensive picture of the potential threat. At the highest level, incidents encompass one or more alerts, representing a cohesive narrative of a security breach or threat scenario.

    Benchmarking

    With the release of GUIDE, we aim to establish a standardized benchmark for guided response systems using real-world data. The primary objective of the dataset is to accurately predict incident triage grades—true positive (TP), benign positive (BP), and false positive (FP)—based on historical customer responses. To support this, we provide a training dataset containing 45 features, labels, and unique identifiers across 1M triage-annotated incidents. We divide the dataset into a train set containing 70% of the data and a test set with 30%, stratified based on triage grade ground-truth, OrgId, and DetectorId. We ensure that incidents are stratified together within the train and test sets to ensure the relevance of evidence and alert rows.

    A secondary objective of GUIDE is to benchmark the remediation capabilities of guided response systems. To this end, we release 26k ground-truth labels for predicting remediation actions for alerts, available at both granular and aggregate levels. The recommended metric for evaluating research using the GUIDE dataset is macro-F1 score, along with details on precision and recall.

    Privacy

    To ensure privacy, we implement a stringent anonymization process. Initially, sensitive values are pseudo-anonymized using SHA1 hashing techniques. This step ensures that unique identifiers are obfuscated while maintaining their uniqueness for consistency across the dataset. Following this, we replace these hashed values with randomly generated IDs to further enhance anonymity and prevent any potential re-identification. Additionally, we introduce noise to the timestamps, ensuring that t...

  9. f

    Data from: IsoSolve: An Integrative Framework to Improve Isotopic Coverage...

    • acs.figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Millard; Serguei Sokol; Michael Kohlstedt; Christoph Wittmann; Fabien Létisse; Guy Lippens; Jean-Charles Portais (2023). IsoSolve: An Integrative Framework to Improve Isotopic Coverage and Consolidate Isotopic Measurements by Mass Spectrometry and/or Nuclear Magnetic Resonance [Dataset]. http://doi.org/10.1021/acs.analchem.1c01064.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Pierre Millard; Serguei Sokol; Michael Kohlstedt; Christoph Wittmann; Fabien Létisse; Guy Lippens; Jean-Charles Portais
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Stable-isotope labeling experiments are widely used to investigate the topology and functioning of metabolic networks. Label incorporation into metabolites can be quantified using a broad range of mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy methods, but in general, no single approach can completely cover isotopic space, even for small metabolites. The number of quantifiable isotopic species could be increased and the coverage of isotopic space improved by integrating measurements obtained by different methods; however, this approach has remained largely unexplored because no framework able to deal with partial, heterogeneous isotopic measurements has yet been developed. Here, we present a generic computational framework based on symbolic calculus that can integrate any isotopic data set by connecting measurements to the chemical structure of the molecules. As a test case, we apply this framework to isotopic analyses of amino acids, which are ubiquitous to life, central to many biological questions, and can be analyzed by a broad range of MS and NMR methods. We demonstrate how this integrative framework helps to (i) clarify and improve the coverage of isotopic space, (ii) evaluate the complementarity and redundancy of different techniques, (iii) consolidate isotopic data sets, (iv) design experiments, and (v) guide future analytical developments. This framework, which can be applied to any labeled element, isotopic tracer, metabolite, and analytical platform, has been implemented in IsoSolve (available at https://github.com/MetaSys-LISBP/IsoSolve and https://pypi.org/project/IsoSolve), an open-source software that can be readily integrated into data analysis pipelines.

  10. u

    Summary: Packaging and labelling guide for cannabis products - Catalogue -...

    • data.urbandatacentre.ca
    Updated Oct 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Summary: Packaging and labelling guide for cannabis products - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-17e70eb6-8fa9-4d7e-ac17-ff17bf70d241
    Explore at:
    Dataset updated
    Oct 19, 2025
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    This guide provides information about the packaging and labelling requirements for cannabis and cannabis products under the Cannabis Act and the Cannabis Regulations.

  11. d

    Neighborhood Labels

    • catalog.data.gov
    • opendata.dc.gov
    • +3more
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D.C. Office of the Chief Technology Officer (2025). Neighborhood Labels [Dataset]. https://catalog.data.gov/dataset/neighborhood-labels
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    D.C. Office of the Chief Technology Officer
    Description

    This dataset was created by the DC Office of Planning and provides a simplified representation of the neighborhoods of the District of Columbia. These boundaries are used by the Office of Planning to determine appropriate locations for placement of neighborhood names on maps. They do not reflect detailed boundary information, do not necessarily include all commonly-used neighborhood designations, do not match planimetric centerlines, and do not necessarily match Neighborhood Cluster boundaries. There is no formal set of standards that describes which neighborhoods are represented or where boundaries are placed. These informal boundaries are not appropriate for display, calculation, or reporting. Their only appropriate use is to guide the placement of text labels for DC's neighborhoods. This is an informal product used for internal mapping purposes only. It should be considered draft, will be subject to change on an irregular basis, and is not intended for publication.

  12. n

    PlantVillage Crop Type Kenya

    • cmr.earthdata.nasa.gov
    • access.earthdata.nasa.gov
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). PlantVillage Crop Type Kenya [Dataset]. http://doi.org/10.34911/rdnt.u41j87
    Explore at:
    Dataset updated
    Oct 10, 2023
    Time period covered
    Jan 1, 2020 - Jan 1, 2023
    Area covered
    Description

    This dataset contains field boundaries and crop type information for fields in Kenya. PlantVillage app is used to collect multiple points around each field and collectors have access to basemap imagery in the app during data collection. They use the basemap as a guide in collecting and verifying the points.

    Post ground data collection, Radiant Earth Foundation conducted a quality control of the polygons using Sentinel-2 imagery of the growing season as well as Google basemap imagery. Two actions were taken on the data 1)several polygons that had overlapping areas with different crop labels were removed, 2) invalid polygons where multiple points were collected in corners of the field (within a distance of less than 0.5m) and the overall shape was not convex, were corrected. Finally, ground reference polygons were matched with corresponding time series data from Sentinel-2 satellites (listed in the source imagery property of each label item).

  13. Z

    StreetSurfaceVis: a dataset of street-level imagery with annotations of road...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kapp, Alexandra; Hoffmann, Edith; Weigmann, Esther; Mihaljevic, Helena (2025). StreetSurfaceVis: a dataset of street-level imagery with annotations of road surface type and quality [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11449976
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Hochschule für Technik und Wirtschaft Berlin
    HTW Berlin - University of Applied Sciences
    Authors
    Kapp, Alexandra; Hoffmann, Edith; Weigmann, Esther; Mihaljevic, Helena
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    StreetSurfaceVis

    StreetSurfaceVis is an image dataset containing 9,122 street-level images from Germany with labels on road surface type and quality. The CSV file streetSurfaceVis_v1_0.csv contains all image metadata and four folders contain the image files. All images are available in four different sizes, based on the image width, in 256px, 1024px, 2048px and the original size.Folders containing the images are named according to the respective image size. Image files are named based on the mapillary_image_id.

    You can find the corresponding publication here: StreetSurfaceVis: a dataset of crowdsourced street-level imagery with semi-automated annotations of road surface type and quality

    Image metadata

    Each CSV record contains information about one street-level image with the following attributes:

    mapillary_image_id: ID provided by Mapillary (see information below on Mapillary)

    user_id: Mapillary user ID of contributor

    user_name: Mapillary user name of contributor

    captured_at: timestamp, capture time of image

    longitude, latitude: location the image was taken at

    train: Suggestion to split train and test data. True for train data and False for test data. Test data contains data from 5 cities which are excluded in the training data.

    surface_type: Surface type of the road in the focal area (the center of the lower image half) of the image. Possible values: asphalt, concrete, paving_stones, sett, unpaved

    surface_quality: Surface quality of the road in the focal area of the image. Possible values: (1) excellent, (2) good, (3) intermediate, (4) bad, (5) very bad (see the attached Labeling Guide document for details)

    Image source

    Images are obtained from Mapillary, a crowd-sourcing plattform for street-level imagery. More metadata about each image can be obtained via the Mapillary API . User-generated images are shared by Mapillary under the CC-BY-SA License.

    For each image, the dataset contains the mapillary_image_id and user_name. You can access user information on the Mapillary website by https://www.mapillary.com/app/user/ and image information by https://www.mapillary.com/app/?focus=photo&pKey=

    If you use the provided images, please adhere to the terms of use of Mapillary.

    Instances per class

    Total number of images: 9,122

    excellent good intermediate bad very bad

    asphalt 971 1697 821

    246

    concrete 314 350 250

    58

    paving stones 385 1063 519

    70

    sett

    129 694

    540

    unpaved

    -

    326 387 303

    For modeling, we recommend using a train-test split where the test data includes geospatially distinct areas, thereby ensuring the model's ability to generalize to unseen regions is tested. We propose five cities varying in population size and from different regions in Germany for testing - images are tagged accordingly.

    Number of test images (train-test split): 776

    Inter-rater-reliablility

    Three annotators labeled the dataset, such that each image was annotated by one person. Annotators were encouraged to consult each other for a second opinion when uncertain.1,800 images were annotated by all three annotators, resulting in a Krippendorff's alpha of 0.96 for surface type and 0.74 for surface quality.

    Recommended image preprocessing

    As the focal road located in the bottom center of the street-level image is labeled, it is recommended to crop images to their lower and middle half prior using for classification tasks.

    This is an exemplary code for recommended image preprocessing in Python:

    from PIL import Imageimg = Image.open(image_path)width, height = img.sizeimg_cropped = img.crop((0.25 * width, 0.5 * height, 0.75 * width, height))

    License

    CC-BY-SA

    Citation

    If you use this dataset, please cite as:

    Kapp, A., Hoffmann, E., Weigmann, E. et al. StreetSurfaceVis: a dataset of crowdsourced street-level imagery annotated by road surface type and quality. Sci Data 12, 92 (2025). https://doi.org/10.1038/s41597-024-04295-9

    @article{kapp_streetsurfacevis_2025, title = {{StreetSurfaceVis}: a dataset of crowdsourced street-level imagery annotated by road surface type and quality}, volume = {12}, issn = {2052-4463}, url = {https://doi.org/10.1038/s41597-024-04295-9}, doi = {10.1038/s41597-024-04295-9}, pages = {92}, number = {1}, journaltitle = {Scientific Data}, shortjournal = {Scientific Data}, author = {Kapp, Alexandra and Hoffmann, Edith and Weigmann, Esther and Mihaljević, Helena}, date = {2025-01-16},}

    This is part of the SurfaceAI project at the University of Applied Sciences, HTW Berlin.

    • Prof. Dr. Helena Mihajlević- Alexandra Kapp- Edith Hoffmann- Esther Weigmann

    Contact: surface-ai@htw-berlin.de

    https://surfaceai.github.io/surfaceai/

    Funding: SurfaceAI is a mFund project funded by the Federal Ministry for Digital and Transportation Germany.

  14. f

    Data from: Comparative Evaluation of Proteome Discoverer and FragPipe for...

    • acs.figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianen He; Youqi Liu; Yan Zhou; Lu Li; He Wang; Shanjun Chen; Jinlong Gao; Wenhao Jiang; Yi Yu; Weigang Ge; Hui-Yin Chang; Ziquan Fan; Alexey I. Nesvizhskii; Tiannan Guo; Yaoting Sun (2023). Comparative Evaluation of Proteome Discoverer and FragPipe for the TMT-Based Proteome Quantification [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00390.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tianen He; Youqi Liu; Yan Zhou; Lu Li; He Wang; Shanjun Chen; Jinlong Gao; Wenhao Jiang; Yi Yu; Weigang Ge; Hui-Yin Chang; Ziquan Fan; Alexey I. Nesvizhskii; Tiannan Guo; Yaoting Sun
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Isobaric labeling-based proteomics is widely applied in deep proteome quantification. Among the platforms for isobaric labeled proteomic data analysis, the commercial software Proteome Discoverer (PD) is widely used, incorporating the search engine CHIMERYS, while FragPipe (FP) is relatively new, free for noncommercial purposes, and integrates the engine MSFragger. Here, we compared PD and FP over three public proteomic data sets labeled using 6plex, 10plex, and 16plex tandem mass tags. Our results showed the protein abundances generated by the two software are highly correlated. PD quantified more proteins (10.02%, 15.44%, 8.19%) than FP with comparable NA ratios (0.00% vs. 0.00%, 0.85% vs. 0.38%, and 11.74% vs. 10.52%) in the three data sets. Using the 16plex data set, PD and FP outputs showed high consistency in quantifying technical replicates, batch effects, and functional enrichment in differentially expressed proteins. However, FP saved 93.93%, 96.65%, and 96.41% of processing time compared to PD for analyzing the three data sets, respectively. In conclusion, while PD is a well-maintained commercial software integrating various additional functions and can quantify more proteins, FP is freely available and achieves similar output with a shorter computational time. Our results will guide users in choosing the most suitable quantification software for their needs.

  15. f

    Quantified Dynamics-Property Relationships: Data-Efficient Protein...

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T. Emme Burgin (2025). Quantified Dynamics-Property Relationships: Data-Efficient Protein Engineering with Machine Learning of Protein Dynamics [Dataset]. http://doi.org/10.1021/acs.jcim.5c01813.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 23, 2025
    Dataset provided by
    ACS Publications
    Authors
    T. Emme Burgin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning has proven to be very powerful for predicting mutation effects in proteins, but the simplest approaches require a substantial amount of training data. Because experiments to collect training data are often expensive, time-consuming, and/or otherwise limited, alternatives that make good use of small amounts of data to guide protein engineering are of high potential value. One potential alternative to large-scale benchtop experiments for collecting training data is high-throughput molecular dynamics simulation; however, to date, this source of data has been largely absent from the literature. Here, I introduce a new method for selecting desirable protein variants based on quantified relationships between a small number of experimentally determined labels and descriptors of their dynamic properties. These descriptors are provided by deep neural networks trained on data from molecular dynamics simulations of variants of the protein of interest. I demonstrate that this approach can obtain very highly optimized variants based on small amounts of experimental data, outperforming alternative supervised approaches to machine learning-guided directed evolution with the same amount of experimental data. Furthermore, I show that quantified dynamics-property relationships based on only a handful of experimentally labeled example sequences can be used to accurately predict the key residues that are most relevant to determining the property in question, even when that information could not have been known or predicted based on either the molecular dynamics simulations or the experimental data alone. This work establishes a new and practical framework for incorporating general protein dynamics information from simulations of mutants to guide protein engineering.

  16. f

    Data from: 6‑Plex mdSUGAR Isobaric-Labeling Guide Fingerprint Embedding for...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Ma; Miyang Li; Yinlong Zhu; Yingyi Zhao; Feixuan Wu; Zicong Wang; Yu Feng; Hung-Yu Chiang; Manish S. Patankar; Cheng Chang; Lingjun Li (2023). 6‑Plex mdSUGAR Isobaric-Labeling Guide Fingerprint Embedding for Glycomics Analysis [Dataset]. http://doi.org/10.1021/acs.analchem.3c03342.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    ACS Publications
    Authors
    Min Ma; Miyang Li; Yinlong Zhu; Yingyi Zhao; Feixuan Wu; Zicong Wang; Yu Feng; Hung-Yu Chiang; Manish S. Patankar; Cheng Chang; Lingjun Li
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Glycans are vital biomolecules with diverse functions in biological processes. Mass spectrometry (MS) has become the most widely employed technology for glycomics studies. However, in the traditional data-dependent acquisition mode, only a subset of the abundant ions during MS1 scans are isolated and fragmented in subsequent MS2 events, which reduces reproducibility and prevents the measurement of low-abundance glycan species. Here, we reported a new method termed 6-plex mdSUGAR isobaric-labeling guide fingerprint embedding (MAGNI), to achieve multiplexed, quantitative, and targeted glycan analysis. The glycan peak signature was embedded by a triplicate-labeling strategy with a 6-plex mdSUGAR tag, and using ultrahigh-resolution mass spectrometers, the low-abundance glycans that carry the mass fingerprints can be recognized on the MS1 spectra through an in-house developed software tool, MAGNIFinder. These embedded unique fingerprints can guide the selection and fragmentation of targeted precursor ions and further provide rich information on glycan structures. Quantitative analysis of two standard glycoproteins demonstrated the accuracy and precision of MAGNI. Using this approach, we identified 304 N-glycans in two ovarian cancer cell lines. Among them, 65 unique N-glycans were found differentially expressed, which indicates a distinct glycosylation pattern for each cell line. Remarkably, 31 N-glycans can be quantified in only 1 × 103 cells, demonstrating the high sensitivity of our method. Taken together, our MAGNI method offers a useful tool for low-abundance N-glycan characterization and is capable of determining small quantitative differences in N-glycan profiling. Therefore, it will be beneficial to the field of glycobiology and will expand our understanding of glycosylation.

  17. XNLI: 18-Langauge NLI Dataset

    • kaggle.com
    zip
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). XNLI: 18-Langauge NLI Dataset [Dataset]. https://www.kaggle.com/thedevastator/xnli-18-langauge-nli-dataset
    Explore at:
    zip(1130133924 bytes)Available download formats
    Dataset updated
    Nov 27, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    XNLI: 18-Langauge NLI Dataset

    Unlocking Multi-Language Natural Language Inference

    By Huggingface Hub [source]

    About this dataset

    The XNLI: Cross-Lingual Natural Language Inference Dataset is an 18-languages dataset containing information on natural language inference. This dataset has been designed to give researchers the ability to better understand the complexities of cross-lingual understanding by providing groups of premise, hypothesis, and label data in diverse languages. With this data, machine learning models can be trained and tested in both English and various non-English languages - such as Spanish, Arabic, Russian - for performance optimization in AI applications. Each entry of this dataset contains unique premise sentences as well as an associated hypothesis statement which incorporates a label (either entailment, neutral or contradiction) regarding the implication held between them. So whether your focus is language modeling or natural language processing, the XNLI dataset offers a wealth of study material that can open up new research opportunities for you!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset, XNLI: Cross-Lingual Natural Language Inference, offers an interesting opportunity to benchmark models in the field of natural language processing. It contains parallel inference examples in multiple languages for testing and validating natural language inference. This guide will provide an overview of the data and instructions on how to use it.

    The XNLI dataset consists of three sub-datasets: en_test.csv, el_validation.csv, and ur_test.csv. Each csv file consists of three columns: premise, hypothesis, and label. The premise column provides a statement or phrase; the hypothesis column presents a new statement that could be true or false according to the statement in the premise; finally, the label column indicates whether this sentence is an entailment (1), contradiction (-1) or neither (0).

    To get started with XNLI using machine learning algorithms such as Recurrent Neural Networks (RNNs) or Long short-term memory networks (LSTMs), you can use one of two methods: multi-language training with translation transfer learning or single language training with mono-lingual datasets like en_test and ur_test for English and Urdu respectively as your target language datasets for training models on NLI tasks . Depending on which setup you choose to go with these datasets can both be used to produce powerful modern natural language understanding systems as well as patient diagnosis assistance tools utilizing NLI capabilities such as those trained on healthcare domain data like medical chit chat conversations etc.,

    To start building a model using multi-language transfer learning first split up your classes into two separate sets that will make up your training dataset which are then immediately used during validation through utilising cross validation techniques like kfold methods when running all experiments necessary for hyperparemeter tuning procedures while concurrently working on transforming text from all languages into desired target english statements by employing /*name dropped*/ translation services API's either directly within code application by making call outs stating .^ specific parameters configured inside development environment variabels before then performing model fitting procedures straight away but Bear in mind certain points especially when fine tuning any deeplearning architecture because let’s not forget that since same eggs were thrown once during early stages till now so same principles might apply again here too!

    For starting with building Mono lingual NLP systems based on this xnli datasets such process may look something like initially loading up text data from individual files from chosen destination folder followed by cleaning it accordingly fitting token

    Research Ideas

    • Training and testing a cross-lingual NLI model for language translation applications.
    • Building a sentiment analyzer that can accurately classify sentiment in 18 different languages.
    • Constructing an AI assistant that is capable of understanding natural language in 18 languages and providing appropriate responses accordingly

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and pe...

  18. f

    Data from: PhoXplex: Combining Phospho-enrichable Cross-Linking with...

    • acs.figshare.com
    zip
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Runa D. Hoenger Ramazanova; Theodoros I. Roumeliotis; James C. Wright; Jyoti S. Choudhary (2024). PhoXplex: Combining Phospho-enrichable Cross-Linking with Isobaric Labeling for Quantitative Proteome-Wide Mapping of Protein Interfaces [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00567.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    ACS Publications
    Authors
    Runa D. Hoenger Ramazanova; Theodoros I. Roumeliotis; James C. Wright; Jyoti S. Choudhary
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Integrating cross-linking mass spectrometry (XL-MS) into structural biology workflows provides valuable information about the spatial arrangement of amino acid stretches, which can guide elucidation of protein assembly architecture. Additionally, the combination of XL-MS with peptide quantitation techniques is a powerful approach to delineate protein interface dynamics across diverse conditions. While XL-MS is increasingly effective with isolated proteins or small complexes, its application to whole-cell samples poses technical challenges related to analysis depth and throughput. The use of enrichable cross-linkers has greatly improved the detectability of protein interfaces in a proteome-wide scale, facilitating global protein–protein interaction mapping. Therefore, bringing together enrichable cross-linking and multiplexed peptide quantification is an appealing approach to enable comparative characterization of structural attributes of proteins and protein interactions. Here, we combined phospho-enrichable cross-linking with TMT labeling to develop a streamline workflow (PhoXplex) for the detection of differential structural features across a panel of cell lines in a global scale. We achieved deep coverage with quantification of over 9000 cross-links and long loop-links in total including potentially novel interactions. Overlaying AlphaFold predictions and disorder protein annotations enables exploration of the quantitative cross-linking data set, to reveal possible associations between mutations and protein structures. Lastly, we discuss current shortcomings and perspectives for deep whole-cell profiling of protein interfaces at large-scale.

  19. Data from: S1 Dataset -

    • plos.figshare.com
    xlsx
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanmugam Shobana; Gopalakrishnan Sangavi; Ramatu Wuni; Bakshi Priyanka; Arun Leelavady; Dhanushkodi Kayalvizhi; Ranjit Mohan Anjana; Kamala Krishnaswamy; Karani Santhanakrishnan Vimaleswaran; Viswanathan Mohan (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0314819.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Shanmugam Shobana; Gopalakrishnan Sangavi; Ramatu Wuni; Bakshi Priyanka; Arun Leelavady; Dhanushkodi Kayalvizhi; Ranjit Mohan Anjana; Kamala Krishnaswamy; Karani Santhanakrishnan Vimaleswaran; Viswanathan Mohan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Nutrition labels on packaged food items provide at-a-glance information about the nutritional composition of the food, serving as a quick guide for consumers to assess the quality of food products. The aim of the current study is to evaluate the nutritional information on the front and back of pack labels of selected packaged foods in the Indian market. A total of 432 food products in six categories (idli mix, breakfast cereals, porridge mix, soup mix, beverage mix and extruded snacks) were investigated by a survey. Nutritional profiling of the foods was done based on the Food Safety and Standards Authority of India (FSSAI) claims regulations. The healthiness of the packaged foods was assessed utilising nutritional traffic light system. The products were classified into ‘healthy’, ‘moderately healthy’ and ‘less healthy’ based on the fat, saturated fat, and sugar content. Most of the food products evaluated belong to healthy’ and ‘moderately healthy’ categories except for products in extruded snacks. Reformulation of ‘extruded snacks’ are necessary to decrease the total and saturated fat content. The nutrient content claims were classified using the International Network for Food and Obesity / NCDs Research, Monitoring and Action Support (INFORMAS) taxonomy. Protein, dietary fibre, fat, sugar, vitamins and minerals were the most referred nutrients in the nutrient content claims. Breakfast cereal carried highest number of nutritional claims while porridge mix had the lowest number of claims. The overall compliance of the nutrient content claims for the studied food products is 80.5%. This study gives an overall view about the nutritional quality of the studied convenience food products and snacks in Indian market.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). Industry Guide for the labelling of cosmetics - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-f78ec8b0-fa1b-476b-abde-b47b6669163b

Industry Guide for the labelling of cosmetics - Catalogue - Canadian Urban Data Catalogue (CUDC)

Explore at:
Dataset updated
Oct 19, 2025
License

Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically

Area covered
Canada
Description

This guide is designed to assist in the preparation of labels that comply with Canadian regulatory requirements for cosmetics.

Search
Clear search
Close search
Google apps
Main menu