60 datasets found
  1. MSL Curiosity Rover Images with Science and Engineering Classes

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. http://doi.org/10.5281/zenodo.4033453
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 17, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

    Data Set Description

    The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

    Directory Contents

    • images - contains all 6,820 images
    • class_map.csv - string-integer class mappings
    • train-set-v2.1.txt - label file for the training set
    • val-set-v2.1.txt - label file for the validation set
    • test-set-v2.1.txt - label file for the test set

    The label files are formatted as below:

    "Image-file-name class_in_integer_representation"

    Labeling Process

    Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

    • If all three labels agree with each other, then use the label as the final label.
    • If the three labels do not agree with each other, then we manually review the labels and decide the final label.
    • We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

    Classes

    There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

    Class name, counts (training set), counts (validation set), counts (test set), integer representation

    Arm cover, 10, 1, 4, 0

    Other rover part, 190, 11, 10, 1

    Artifact, 680, 62, 132, 2

    Nearby surface, 1554, 74, 187, 3

    Close-up rock, 1422, 50, 84, 4

    DRT, 8, 4, 6, 5

    DRT spot, 214, 1, 7, 6

    Distant landscape, 342, 14, 34, 7

    Drill hole, 252, 5, 12, 8

    Night sky, 40, 3, 4, 9

    Float, 190, 5, 1, 10

    Layers, 182, 21, 17, 11

    Light-toned veins, 42, 4, 27, 12

    Mastcam cal target, 122, 12, 29, 13

    Sand, 228, 19, 16, 14

    Sun, 182, 5, 19, 15

    Wheel, 212, 5, 5, 16

    Wheel joint, 62, 1, 5, 17

    Wheel tracks, 26, 3, 1, 18

    Image Augmentation

    Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

    • 90 degrees clockwise rotation (file name ends with -r90.jpg)
    • 180 degrees clockwise rotation (file name ends with -r180.jpg)
    • 270 degrees clockwise rotation (file name ends with -r270.jpg)
    • Horizontal flip (file name ends with -fh.jpg)
    • Vertical flip (file name ends with -fv.jpg)

    Acknowledgment

    The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.

  2. d

    Clickbaits Labeling Data on Instagram

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu-i Ha; Jeongmin Kim; Donghyeon Won; Meeyoung Cha; Jungseock Joo (2023). Clickbaits Labeling Data on Instagram [Dataset]. http://doi.org/10.7910/DVN/DEZMRA
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Yu-i Ha; Jeongmin Kim; Donghyeon Won; Meeyoung Cha; Jungseock Joo
    Description

    Our dataset is composed of information about 7,769 posts on Instagram. The data collection was done over a two-week period in July 2017 using an InstaLooter API. We searched for posts mentioning 62 internationally renowned fashion brand names as hashtag.

  3. h

    Data from: SRL4ORL: Improving Opinion Role Labeling Using Multi-Task...

    • heidata.uni-heidelberg.de
    • tudatalib.ulb.tu-darmstadt.de
    zip
    Updated Feb 4, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Marasovic; Ana Marasovic (2019). SRL4ORL: Improving Opinion Role Labeling Using Multi-Task Learning With Semantic Role Labeling [Source Code] [Dataset]. http://doi.org/10.11588/DATA/LWN9XE
    Explore at:
    zip(14676065)Available download formats
    Dataset updated
    Feb 4, 2019
    Dataset provided by
    heiDATA
    Authors
    Ana Marasovic; Ana Marasovic
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XEhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/LWN9XE

    Description

    This repository contains code for reproducing experiments done in Marasovic and Frank (2018). Paper abstract: For over a decade, machine learning has been used to extract opinion-holder-target structures from text to answer the question "Who expressed what kind of sentiment towards what?". Recent neural approaches do not outperform the state-of-the-art feature-based models for Opinion Role Labeling (ORL). We suspect this is due to the scarcity of labeled training data and address this issue using different multi-task learning (MTL) techniques with a related task which has substantially more data, i.e. Semantic Role Labeling (SRL). We show that two MTL models improve significantly over the single-task model for labeling of both holders and targets, on the development and the test sets. We found that the vanilla MTL model, which makes predictions using only shared ORL and SRL features, performs the best. With deeper analysis, we determine what works and what might be done to make further improvements for ORL. Data for ORL Download MPQA 2.0 corpus. Check mpqa2-pytools for example usage. Splits can be found in the datasplit folder. Data for SRL The data is provided by: CoNLL-2005 Shared Task, but the original words are from the Penn Treebank dataset, which is not publicly available. How to train models? python main.py --adv_coef 0.0 --model fs --exp_setup_id new --n_layers_orl 0 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model html --exp_setup_id new --n_layers_orl 1 --n_layers_shared 2 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.0 --model sp --exp_setup_id new --n_layers_orl 3 --begin_fold 0 --end_fold 4 python main.py --adv_coef 0.1 --model asp --exp_setup_id prior --n_layers_orl 3 --begin_fold 0 --end_fold 10

  4. Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions...

    • zenodo-rdm.web.cern.ch
    • data.niaid.nih.gov
    zip
    Updated Sep 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft (2023). Quantitative Content Analysis Data for Hand Labeling Road Surface Conditions in New York State Department of Transportation Camera Images [Dataset]. http://doi.org/10.5281/zenodo.8370665
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 27, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carly Sutter; Carly Sutter; Kara Sulia; Kara Sulia; Nick P. Bassill; Nick P. Bassill; Christopher D. Thorncroft; Christopher D. Wirz; Christopher D. Wirz; Vanessa Przybylo; Vanessa Przybylo; Mariana G. Cains; Mariana G. Cains; Jacob Radford; Jacob Radford; David Aaron Evans; David Aaron Evans; Christopher D. Thorncroft
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Traffic camera images from the New York State Department of Transportation (511ny.org) are used to create a hand-labeled dataset of images classified into to one of six road surface conditions: 1) severe snow, 2) snow, 3) wet, 4) dry, 5) poor visibility, or 6) obstructed. Six labelers (authors Sutter, Wirz, Przybylo, Cains, Radford, and Evans) went through a series of four labeling trials where reliability across all six labelers were assessed using the Krippendorff’s alpha (KA) metric (Krippendorff, 2007). The online tool by Dr. Freelon (Freelon, 2013; Freelon, 2010) was used to calculate reliability metrics after each trial, and the group achieved inter-coder reliability with KA of 0.888 on the 4th trial. This process is known as quantitative content analysis, and three pieces of data used in this process are shared, including: 1) a PDF of the codebook which serves as a set of rules for labeling images, 2) images from each of the four labeling trials, including the use of New York State Mesonet weather observation data (Brotzge et al., 2020), and 3) an Excel spreadsheet including the calculated inter-coder reliability metrics and other summaries used to asses reliability after each trial.

    The broader purpose of this work is that the six human labelers, after achieving inter-coder reliability, can then label large sets of images independently, each contributing to the creation of larger labeled dataset used for training supervised machine learning models to predict road surface conditions from camera images. The xCITE lab (xCITE, 2023) is used to store camera images from 511ny.org, and the lab provides computing resources for training machine learning models.

  5. d

    Data from: Fashion conversation data on Instagram

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ha, Yu-i; Kwon, Sejeong; Cha, Meeyoung; Joo, Jungseock (2023). Fashion conversation data on Instagram [Dataset]. https://search.dataone.org/view/sha256%3Ac3de287ab8b375881b5922ac14887dfb46780a2bb7434e64bc7f71f2da7868fa
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Ha, Yu-i; Kwon, Sejeong; Cha, Meeyoung; Joo, Jungseock
    Description

    Our fashion dataset is composed of information about 24,752 posts by 13,350 people on Instagram. The data collection was done over a month period in January, 2015. We searched for posts mentioning 48 internationally renowned fashion brand names as hashtag. Our data contain information about hashtags as well as image features based on deep learning (Convolutional Neural Network or CNN). The list of learned features include selfies, body snaps, marketing shots, non-fashion, faces, logo, etc. Please refer to our paper for the full description of how we built our deep learning model.

  6. n

    LandCoverNet Asia

    • access.earthdata.nasa.gov
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). LandCoverNet Asia [Dataset]. http://doi.org/10.34911/rdnt.63fxe5
    Explore at:
    Dataset updated
    Oct 10, 2023
    Time period covered
    Jan 1, 2020 - Jan 1, 2023
    Area covered
    Description

    LandCoverNet is a global annual land cover classification training dataset with labels for the multi-spectral satellite imagery from Sentinel-1, Sentinel-2 and Landsat-8 missions in 2018. LandCoverNet Asia contains data across Asia, which accounts for ~31% of the global dataset. Each pixel is identified as one of the seven land cover classes based on its annual time series. These classes are water, natural bare ground, artificial bare ground, woody vegetation, cultivated vegetation, (semi) natural vegetation, and permanent snow/ice.
    There are a total of 2753 image chips of 256 x 256 pixels in LandCoverNet South America V1.0 spanning 92 tiles. Each image chip contains temporal observations from the following satellite products with an annual class label, all stored in raster format (GeoTIFF files):
    * Sentinel-1 ground range distance (GRD) with radiometric calibration and orthorectification at 10m spatial resolution
    * Sentinel-2 surface reflectance product (L2A) at 10m spatial resolution
    * Landsat-8 surface reflectance product from Collection 2 Level-2

    Radiant Earth Foundation designed and generated this dataset with a grant from Schmidt Futures with additional support from NASA ACCESS, Microsoft AI for Earth and in kind technology support from Sinergise.

  7. Statistics and Evaluation Data for Publication "Using Supervised Learning to...

    • zenodo.org
    application/gzip
    Updated May 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa (2020). Statistics and Evaluation Data for Publication "Using Supervised Learning to Classify Metadata of Research Data by Field of Study" [Dataset]. http://doi.org/10.5281/zenodo.3757468
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records. This publication contains aggregated data for the paper. It also contains the evaluation data of all model/hyper-parameter training and test runs.

  8. R

    Healthcare Data Collection and Labeling Market Size Report 2037

    • researchnester.com
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Nester (2025). Healthcare Data Collection and Labeling Market Size Report 2037 [Dataset]. https://www.researchnester.com/reports/healthcare-data-collection-and-labeling-market/6612
    Explore at:
    Dataset updated
    Jan 10, 2025
    Dataset authored and provided by
    Research Nester
    License

    https://www.researchnester.comhttps://www.researchnester.com

    Description

    The global healthcare data collection and labeling market size surpassed USD 1.11 billion in 2024 and is forecasted to grow at a steady pace of 25.8% CAGR, reaching USD 21.94 billion by 2037. North America industry is estimated to account for largest revenue share of 37.8% by 2037, owing to utilizing state-of-the-art tools such as artificial intelligence (AI) and machine learning to improve efficiency and accuracy in data labeling and annotation.

  9. modis-lake-powell-raster-dataset

    • huggingface.co
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA CISTO Data Science Group (2023). modis-lake-powell-raster-dataset [Dataset]. https://huggingface.co/datasets/nasa-cisto-data-science-group/modis-lake-powell-raster-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2023
    Dataset provided by
    NASAhttp://nasa.gov/
    Authors
    NASA CISTO Data Science Group
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Lake Powell
    Description

    MODIS Water Lake Powell Raster Dataset

      Dataset Summary
    

    Raster dataset comprised of MODIS surface reflectance bands along with calculated indices and a label (water/not-water)

      Dataset Structure
    
    
    
    
    
    
    
      Data Fields
    

    water: Label, water or not-water (binary) sur_refl_b01_1: MODIS surface reflection band 1 (-100, 16000) sur_refl_b02_1: MODIS surface reflection band 2 (-100, 16000) sur_refl_b03_1: MODIS surface reflection band 3 (-100, 16000)… See the full description on the dataset page: https://huggingface.co/datasets/nasa-cisto-data-science-group/modis-lake-powell-raster-dataset.

  10. Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

    • data.nist.gov
    • cloud.csiss.gmu.edu
    • +1more
    Updated Oct 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian DeCost (2020). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. http://doi.org/10.18434/mds2-2301
    Explore at:
    Dataset updated
    Oct 23, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Brian DeCost
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations. Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.

  11. n

    LandCoverNet North America

    • cmr.earthdata.nasa.gov
    • access.earthdata.nasa.gov
    Updated Oct 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). LandCoverNet North America [Dataset]. http://doi.org/10.34911/rdnt.jx15e8
    Explore at:
    Dataset updated
    Oct 10, 2023
    Time period covered
    Jan 1, 2020 - Jan 1, 2023
    Area covered
    Description

    LandCoverNet is a global annual land cover classification training dataset with labels for the multi-spectral satellite imagery from Sentinel-1, Sentinel-2 and Landsat-8 missions in 2018. LandCoverNet North America contains data across North America, which accounts for ~13% of the global dataset. Each pixel is identified as one of the seven land cover classes based on its annual time series. These classes are water, natural bare ground, artificial bare ground, woody vegetation, cultivated vegetation, (semi) natural vegetation, and permanent snow/ice.

    There are a total of 1561 image chips of 256 x 256 pixels in LandCoverNet North America V1.0 spanning 40 tiles. Each image chip contains temporal observations from the following satellite products with an annual class label, all stored in raster format (GeoTIFF files):
    * Sentinel-1 ground range distance (GRD) with radiometric calibration and orthorectification at 10m spatial resolution
    * Sentinel-2 surface reflectance product (L2A) at 10m spatial resolution
    * Landsat-8 surface reflectance product from Collection 2 Level-2

    Radiant Earth Foundation designed and generated this dataset with a grant from Schmidt Futures with additional support from NASA ACCESS, Microsoft AI for Earth and in kind technology support from Sinergise.

  12. m

    Human Faces and Objects Mix Image Dataset

    • data.mendeley.com
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bindu Garg (2025). Human Faces and Objects Mix Image Dataset [Dataset]. http://doi.org/10.17632/nzwvnrmwp3.1
    Explore at:
    Dataset updated
    Mar 13, 2025
    Authors
    Bindu Garg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description: Human Faces and Objects Dataset (HFO-5000) The Human Faces and Objects Dataset (HFO-5000) is a curated collection of 5,000 images, categorized into three distinct classes: male faces (1,500), female faces (1,500), and objects (2,000). This dataset is designed for machine learning and computer vision applications, including image classification, face detection, and object recognition. The dataset provides high-quality, labeled images with a structured CSV file for seamless integration into deep learning pipelines.

    Column Description: The dataset is accompanied by a CSV file that contains essential metadata for each image. The CSV file includes the following columns: file_name: The name of the image file (e.g., image_001.jpg). label: The category of the image, with three possible values: "male" (for male face images) "female" (for female face images) "object" (for images of various objects) file_path: The full or relative path to the image file within the dataset directory.

    Uniqueness and Key Features: 1) Balanced Distribution: The dataset maintains an even distribution of human faces (male and female) to minimize bias in classification tasks. 2) Diverse Object Selection: The object category consists of a wide variety of items, ensuring robustness in distinguishing between human and non-human entities. 3) High-Quality Images: The dataset consists of clear and well-defined images, suitable for both training and testing AI models. 4) Structured Annotations: The CSV file simplifies dataset management and integration into machine learning workflows. 5) Potential Use Cases: This dataset can be used for tasks such as gender classification, facial recognition benchmarking, human-object differentiation, and transfer learning applications.

    Conclusion: The HFO-5000 dataset provides a well-structured, diverse, and high-quality set of labeled images that can be used for various computer vision tasks. Its balanced distribution of human faces and objects ensures fairness in training AI models, making it a valuable resource for researchers and developers. By offering structured metadata and a wide range of images, this dataset facilitates advancements in deep learning applications related to facial recognition and object classification.

  13. S

    Data from: DIPSER: A Dataset for In-Person Student Engagement Recognition in...

    • scidb.cn
    • observatorio-cientifico.ua.es
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Márquez-Carpintero; Sergio Suescun-Ferrandiz; Carolina Lorenzo Álvarez; Jorge Fernandez-Herrero; Diego Viejo; Rosabel Roig-Vila; Miguel Cazorla (2024). DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild [Dataset]. http://doi.org/10.57760/sciencedb.11541
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 4, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Luis Márquez-Carpintero; Sergio Suescun-Ferrandiz; Carolina Lorenzo Álvarez; Jorge Fernandez-Herrero; Diego Viejo; Rosabel Roig-Vila; Miguel Cazorla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student1 Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.

  14. m

    Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for...

    • data.mendeley.com
    • paperswithcode.com
    • +1more
    Updated Jan 6, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Kermany (2018). Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification [Dataset]. http://doi.org/10.17632/rscbjbr9sj.2
    Explore at:
    Dataset updated
    Jan 6, 2018
    Authors
    Daniel Kermany
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of validated OCT and Chest X-Ray images described and analyzed in "Deep learning-based classification and referral of treatable human diseases". The OCT Images are split into a training set and a testing set of independent patients. OCT Images are labeled as (disease)-(randomized patient ID)-(image number by this patient) and split into 4 directories: CNV, DME, DRUSEN, and NORMAL.

  15. S

    CloudSEN12 - a global dataset for semantic understanding of cloud and cloud...

    • scidb.cn
    • produccioncientifica.usal.es
    • +2more
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cesar Luis; Luis Ysuhuaylas; Jhomira Loja; Karen Gonzales; Fernando Herrera; Lesly Bautista; Roy Yali; Angie Flores; Lissette Diaz; Nicole Cuenca; Wendy Espinoza; Fernando Prudencio; Joselyn Inga; Valeria Llactayo; David Montero; Martin Sudmanns; Dirk Tiede; Gonzalo Mateo-García; Luis Gómez-Chova (2022). CloudSEN12 - a global dataset for semantic understanding of cloud and cloud shadow in Sentinel-2 [Dataset]. http://doi.org/10.57760/sciencedb.06669
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Cesar Luis; Luis Ysuhuaylas; Jhomira Loja; Karen Gonzales; Fernando Herrera; Lesly Bautista; Roy Yali; Angie Flores; Lissette Diaz; Nicole Cuenca; Wendy Espinoza; Fernando Prudencio; Joselyn Inga; Valeria Llactayo; David Montero; Martin Sudmanns; Dirk Tiede; Gonzalo Mateo-García; Luis Gómez-Chova
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    CloudSEN12 is a large dataset for cloud semantic understanding that consists of 9880 regions of interest (ROIs). Each ROI has five 5090x5090 meters image patches (IPs) collected on different dates; we manually choose the images to guarantee that each IP inside an ROI matches one of the following cloud cover groups:- clear (0%)- low-cloudy (1% - 25%) - almost clear (25% - 45%)- mid-cloudy (45% - 65%)- cloudy (65% >)An IP is the core unit in CloudSEN12. Each IP contains data from Sentinel-2 optical levels 1C and 2A, Sentinel-1 Synthetic Aperture Radar (SAR), digital elevation model, surface water occurrence, land cover classes, and cloud mask results from eight cutting-edge cloud detection algorithms. Besides, in order to support standard, weakly, and self-/semi-supervised learning procedures, cloudSEN12 includes three distinct forms of hand-crafted labelling data: high-quality, scribble, and no annotation. Consequently, each ROI is randomly assigned to a different annotation group:2000 ROIs with pixel-level annotation, where the average annotation time is 150 minutes (high-quality group).2000 ROIs with scribble-level annotation, where the annotation time is 15 minutes (scribble group).5880 ROIs with annotation only in the cloud-free (0\%) image (no annotation group).For high-quality labels, we use the Intelligence foR Image Segmentation\cite{iris2019} (IRIS) active learning technology, combining human photo-interpretation and machine learning. For scribble, ground truth pixels were drawn using IRIS but without ML support. Finally, the no-annotation dataset is generated automatically, with manual annotation only in the clear image patch. A backup of the dataset in STAC format is available here: https://shorturl.at/cgjtz. Check out our website https://cloudsen12.github.io/ for examples.

  16. d

    Calcite data set with boolean label

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomescu, Vlad Ioan (2023). Calcite data set with boolean label [Dataset]. http://doi.org/10.7910/DVN/VDBQVV
    Explore at:
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Tomescu, Vlad Ioan
    Description

    A data set consisting of various literature metrics on the Apache Calcite. The presence of defects is marked with boolean labels.

  17. n

    LandCoverNet Australia

    • access.earthdata.nasa.gov
    • cmr.earthdata.nasa.gov
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). LandCoverNet Australia [Dataset]. http://doi.org/10.34911/rdnt.0vgi25
    Explore at:
    Dataset updated
    Oct 10, 2023
    Time period covered
    Jan 1, 2020 - Jan 1, 2023
    Area covered
    Description

    LandCoverNet is a global annual land cover classification training dataset with labels for the multi-spectral satellite imagery from Sentinel-1, Sentinel-2 and Landsat-8 missions in 2018. LandCoverNet Australia contains data across Australia, which accounts for ~7% of the global dataset. Each pixel is identified as one of the seven land cover classes based on its annual time series. These classes are water, natural bare ground, artificial bare ground, woody vegetation, cultivated vegetation, (semi) natural vegetation, and permanent snow/ice.
    There are a total of 600 image chips of 256 x 256 pixels in LandCoverNet Australia V1.0 spanning 20 tiles. Each image chip contains temporal observations from the following satellite products with an annual class label, all stored in raster format (GeoTIFF files):
    * Sentinel-1 ground range distance (GRD) with radiometric calibration and orthorectification at 10m spatial resolution
    * Sentinel-2 surface reflectance product (L2A) at 10m spatial resolution
    * Landsat-8 surface reflectance product from Collection 2 Level-2

    Radiant Earth Foundation designed and generated this dataset with a grant from Schmidt Futures with additional support from NASA ACCESS, Microsoft AI for Earth and in kind technology support from Sinergise.

  18. d

    Replication Data for \"An Investigation of Social Media Labeling Decisions...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grossman, Shelby (2023). Replication Data for \"An Investigation of Social Media Labeling Decisions Preceding the 2020 U.S. Election\" [Dataset]. https://search.dataone.org/view/sha256%3A829cb59643165543633d8b9d5b68da4131d0824ed75da18f1abd4b29aef09b49
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Grossman, Shelby
    Description

    Attachments include replication data, replication code, and a short document explaining the dataset with variable descriptions for the paper: "An Investigation of Social Media Labeling Decisions Preceding the 2020 U.S. Election" by Samantha Bradshaw, Shelby Grossman, and Miles McCain.

  19. d

    Replication Data for: Message or Messenger? Source and Labeling Effects in...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnon, Daniel; Edwards, Pearce; Li, Handi (2023). Replication Data for: Message or Messenger? Source and Labeling Effects in Authoritarian Response to Protest [Dataset]. http://doi.org/10.7910/DVN/ZMRFYS
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Arnon, Daniel; Edwards, Pearce; Li, Handi
    Description

    Authoritarian regimes in the 21st century have increasingly turned to using information control rather than kinetic force to respond to threats to their rule. This paper studies an often overlooked type of information control: strategic labeling and public statements by regime sources in response to protests. Labeling protesters as violent criminals may increase support for repression by signaling that protests are illegitimate and deviant. Regime sources, compared to more independent sources, could increase support for repression even more when paired with such an accusatory label. Accommodative labels should have opposing effects---decreasing support for repression. The argument is tested with a survey experiment in China which labels environmental protests. Accusatory labels increase support for repression of protests. Regime sources, meanwhile, have no advantage over nongovernmental sources in shifting opinion. The findings suggest that negative labels de-legitimize protesters and legitimize repression while the sources matter less in this contentious authoritarian context.

  20. Z

    Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miehling, Daniel (2023). Antisemitism on Twitter: A Dataset for Machine Learning and Text Analytics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7872834
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    Soemer, Katharina
    Miehling, Daniel
    Karali, Sameer
    Jikeli, Gunther
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Institute For the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset:

    The ISCA project has compiled this dataset using an annotation portal, which was used to label tweets as either antisemitic or non-antisemitic, among other labels. Please note that the annotation was done with live data, including images and the context, such as threads. The original data was sourced from annotationportal.com.

    Content:

    This dataset contains 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021. The dataset is drawn from representative samples during this period with relevant keywords. 1,250 tweets (18%) meet the IHRA definition of antisemitic messages.

    The dataset has been compiled within the ISCA project using an annotation portal to label tweets as either antisemitic or non-antisemitic. The original data was sourced from annotationportal.com.

    The tweets' distribution of all messages by year is as follows: 1,499 (22%) from 2019, 3,716 (54%) from 2020, and 1,726 (25%) from 2021. 4,605 (66%) contain the keyword "Jews," 1,524 (22%) include "Israel," 529 (8%) feature the derogatory term "ZioNazi*," and 283 (4%) use the slur "K---s." Some tweets may contain multiple keywords.

    483 out of the 4,605 tweets with the keyword "Jews" (11%) and 203 out of the 1,524 tweets with the keyword "Israel" (13%) were classified as antisemitic. 97 out of the 283 tweets using the antisemitic slur "K---s" (34%) are antisemitic. Interestingly, many tweets featuring the slur "K---s" actually call out its usage. In contrast, the majority of tweets with the derogatory term "ZioNazi*" are antisemitic, with 467 out of 529 (88%) being classified as such.

    File Description:

    The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:

    ‘TweetID’: Represents the tweet ID.

    ‘Username’: Represents the username who published the tweet.

    ‘Text’: Represents the full text of the tweet (not pre-processed).

    ‘CreateDate’: Represents the date the tweet was created.

    ‘Biased’: Represents the labeled by our annotations if the tweet is antisemitic or non-antisemitic.

    ‘Keyword’: Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.

    Licences

    Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)

    R code is published under the terms of the "MIT" licence (https://opensource.org/licenses/MIT) ‘

    Acknowledgements

    We are grateful for the support of Indiana University’s Observatory on Social Media (OSoMe) (Davis et al. 2016) and the contributions and annotations of all team members in our Social Media & Hate Research Lab at Indiana University’s Institute for the Study of Contemporary Antisemitism, especially Grace Bland, Elisha S. Breton, Kathryn Cooper, Robin Forstenhäusler, Sophie von Máriássy, Mabel Poindexter, Jenna Solomon, Clara Schilling, and Victor Tschiskale.

    This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. http://doi.org/10.5281/zenodo.4033453
Organization logo

MSL Curiosity Rover Images with Science and Engineering Classes

Explore at:
zipAvailable download formats
Dataset updated
Sep 17, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

Data Set Description

The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

Directory Contents

  • images - contains all 6,820 images
  • class_map.csv - string-integer class mappings
  • train-set-v2.1.txt - label file for the training set
  • val-set-v2.1.txt - label file for the validation set
  • test-set-v2.1.txt - label file for the test set

The label files are formatted as below:

"Image-file-name class_in_integer_representation"

Labeling Process

Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

  • If all three labels agree with each other, then use the label as the final label.
  • If the three labels do not agree with each other, then we manually review the labels and decide the final label.
  • We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

Classes

There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

Class name, counts (training set), counts (validation set), counts (test set), integer representation

Arm cover, 10, 1, 4, 0

Other rover part, 190, 11, 10, 1

Artifact, 680, 62, 132, 2

Nearby surface, 1554, 74, 187, 3

Close-up rock, 1422, 50, 84, 4

DRT, 8, 4, 6, 5

DRT spot, 214, 1, 7, 6

Distant landscape, 342, 14, 34, 7

Drill hole, 252, 5, 12, 8

Night sky, 40, 3, 4, 9

Float, 190, 5, 1, 10

Layers, 182, 21, 17, 11

Light-toned veins, 42, 4, 27, 12

Mastcam cal target, 122, 12, 29, 13

Sand, 228, 19, 16, 14

Sun, 182, 5, 19, 15

Wheel, 212, 5, 5, 16

Wheel joint, 62, 1, 5, 17

Wheel tracks, 26, 3, 1, 18

Image Augmentation

Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

  • 90 degrees clockwise rotation (file name ends with -r90.jpg)
  • 180 degrees clockwise rotation (file name ends with -r180.jpg)
  • 270 degrees clockwise rotation (file name ends with -r270.jpg)
  • Horizontal flip (file name ends with -fh.jpg)
  • Vertical flip (file name ends with -fv.jpg)

Acknowledgment

The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.

Search
Clear search
Close search
Google apps
Main menu