Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student1 Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning can be used to predict fault properties such as shear stress, friction, and time to failure using continuous records of fault zone acoustic emissions. The files are extracted features and labels from lab data (experiment p4679). The features are extracted with a non-overlapping window from the original acoustic data. The first column is the time of the window. The second and third columns are the mean and the variance of the acoustic data in this window, respectively. The 4th-11th column is the the power spectrum density ranging from low to high frequency. And the last column is the corresponding label (shear stress level). The name of the file means which driving velocity the sequence is generated from. Data were generated from laboratory friction experiments conducted with a biaxial shear apparatus. Experiments were conducted in the double direct shear configuration in which two fault zones are sheared between three rigid forcing blocks. Our samples consisted of two 5-mm-thick layers of simulated fault gouge with a nominal contact area of 10 by 10 cm^2. Gouge material consisted of soda-lime glass beads with initial particle size between 105 and 149 micrometers. Prior to shearing, we impose a constant fault normal stress of 2 MPa using a servo-controlled load-feedback mechanism and allow the sample to compact. Once the sample has reached a constant layer thickness, the central block is driven down at constant rate of 10 micrometers per second. In tandem, we collect an AE signal continuously at 4 MHz from a piezoceramic sensor embedded in a steel forcing block about 22 mm from the gouge layer The data from this experiment can be used with the deep learning algorithm to train it for future fault property prediction.
Repository for the data generated as part of the 2023-2024 ALCC project "Machine Learning-Enhanced Multiphase CFD for Carbon Capture Modeling." The data was generated with MFIX-Exa's CFD-DEM model. The problem of interest is gravity driven, particle-laden, gas-solid flow in a triply-periodic domain of length 2048 particle diameters with an aspect ratio of 4. The mean particle concentration ranges from 1% to 40% and the Archimedes number ranges from 18 to 90. The particle-to-fluid density ratio, particle-particle restitution and friction coefficients and domain aspect ratio are held constant at values of 1000, 0.9, 0.25 and 4, respectively. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 using NERSC award ALCC-ERCAP0025948.
Abstract:
In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.
General Information:
This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.
You may find details of this dataset from the original paper:
Sasho Nedelkoski, Ajay Kumar Mandapati, Jasmin Bogatinovski, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics". [link very soon]
If you use the data, implementation, or any details of the paper, please cite!
The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth (all at the Zenodo link below). We provide two datasets, which differ on how the workload is executed. The openstack_multimodal_sequential_actions is generated via executing workload of sequential user requests. The openstack_multimodal_concurrent_actions is generated via executing workload of concurrent user requests.
The difference of the concurrent dataset is that:
Due to the heavy load on the control node, the metric data for wally113 (control node) is not representative and we excluded it.
Three rally actions are executed in parallel: boot_and_delete, create_and_delete_networks, create_and_delete_image, whereas for the sequential there were 5 actions executed.
The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.
Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods.
Our GitHub repository can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
this dataset was collected by an unmanned aerial vehicle in hermitage, réunion - 2023-12-01.underwater or aerial images collected by scientists or citizens can have a wide variety of use for science, management, or conservation. these images can be annotated and shared to train ia models which can in turn predict the objects on the images. we provide a set of tools (hardware and software) to collect marine data, predict species or habitat, and provide maps.survey information camera: hasselblad l1d-20c number of images: 119 total size : 1.04 gb flight start: 2023:12:01 15:07:31 flight end: 2023:12:01 15:13:31 flight duration: 0h 6min 0sec median height: 79.9 m area covered: 3.93 hageneric folder structureyyyymmdd_countrycode-optionalplace_device_session-number├── dcim : folder to store videos and photos depending on the media collected.├── gps : folder to store any positioning related file. if any kind of correction is possible on files (e.g. post-processed kinematic thanks to rinex data) then the distinction between device data and base data is made. if, on the other hand, only device position data are present and the files cannot be corrected by post-processing techniques (e.g. gpx files), then the distinction between base and device is not made and the files are placed directly at the root of the gps folder.│ ├── base : files coming from rtk station or any static positioning instrument.│ └── device : files coming from the device.├── metadata : folder with general information files about the session.├── processed_data : contain all the folders needed to store the results of the data processing of the current session.│ ├── bathy : output folder for bathymetry raw data extracted from mission logs.│ ├── frames : output folder for georeferenced frames extracted from dcim videos.│ ├── ia : destination folder for image recognition predictions.│ └── photogrammetry : destination folder for reconstructed models in photogrammetry.└── sensors : folder to store files coming from other sources (bathymetry data from the echosounder, log file from the autopilot, mission plan etc.).softwareall the raw data was processed using our worflow.all predictions were generated by our inference pipeline.you can find all the necessary scripts to download this data in this repository.enjoy your data with seatizendoi!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Intending to cover the existing gap regarding behavioral datasets modelling interactions of users with individual a multiple devices in Smart Office to later authenticate them continuously, we publish the following collection of datasets, which has been generated after having five users interacting for 60 days with their personal computer and mobile devices. Below you can find a brief description of each dataset. Dataset 1 (2.3 GB). This dataset contains 92975 vectors of features (8096 per vector) that model the interactions of the five users with their personal computers. Each vector contains aggregated data about keyboard and mouse activity, as well as application usage statistics. More info about features meaning can be found in the readme file. Originally, the number of features of this dataset was 24 065 but after filtering the constant features, this number was reduced to 8096. There was a high number of constant features to 0 since each possible digraph (two keys combination) was considered when collecting the data. However, there are many unusual digraphs that the users never introduced in their computers, so these features were deleted in the uploaded dataset. Dataset 2 (8.9 MB). This dataset contains 61918 vectors of features (15 per vector)that model the interactions of the five users with their mobile devices. Each vector contains aggregated data about application usage statistics. More info about features meaning can be found in the readme file.Dataset 3 (28.9 MB). This dataset contains 133590vectors of features (42 per vector)that model the interactions of the five users with their mobile devices. Each vector contains aggregated data about the gyroscope and Accelerometer sensors. More info about features meaning can be found in the readme file.Dataset 4 (162.4 MB). This dataset contains 145465vectors of features (241 per vector)that model the interactions of the five users with both personal computers and mobile devices. Each vector contains the aggregation of the most relevant features of both devices. More info about features meaning can be found in the readme file.Dataset 5 (878.7 KB). This dataset is composed of 7 datasets. Each one of them contains an aggregation of feature vectors generated from the active/inactive intervals of personal computers and mobile devices by considering different time windows ranging from 1h to 24h.1h: 4074 vectors2h: 2149 vectors3h: 1470 vectors4h: 1133 vectors6h: 770 vectors12h: 440 vectors24h: 229 vectors
Subsurface data analysis, reservoir modeling, and machine learning (ML) techniques have been applied to the Brady Hot Springs (BHS) geothermal field in Nevada, USA to further characterize the subsurface and assist with optimizing reservoir management. Hundreds of reservoir simulations have been conducted in TETRAD-G and CMG STARS to explore different injection and production fluid flow rates and allocations and to develop a training data set for ML. This process included simulating the historical injection and production since 1979 and prediction of future performance through 2040. ML networks were created and trained using TensorFlow based on multilayer perceptron, long short-term memory, and convolutional neural network architectures. These networks took as input selected flow rates, injection temperatures, and historical field operation data and produced estimates of future production temperatures. This approach was first successfully tested on a simplified single-fracture doublet system, followed by the application to the BHS reservoir. Using an initial BHS data set with 37 simulated scenarios, the trained and validated network predicted the production temperature for six production wells with the mean absolute percentage error of less than 8%. In a complementary analysis effort, the principal component analysis applied to 13 BHS geological parameters revealed that vertical fracture permeability shows the strongest correlation with fault density and fault intersection density. A new BHS reservoir model was developed considering the fault intersection density as proxy for permeability. This new reservoir model helps to explore underexploited zones in the reservoir. A data gathering plan to obtain additional subsurface data was developed; it includes temperature surveying for three idle injection wells at which the reservoir simulations indicate high bottom-hole temperatures. The collected data assist with calibrating the reservoir model. Data gathering activities are planned for the first quarter of 2021. This GDR submission includes a preprint of the paper titled "Subsurface Characterization and Machine Learning Predictions at Brady Hot Springs" presented at the 46th Stanford Geothermal Workshop (SGW) on Geothermal Reservoir Engineering from February 16-18, 2021.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide a dataset of images(.jpeg) with their corresponding annotations files(.xml) used to train a bird detection deep learning model. These images were collected from the live stream feeds of Cornell Lab of Ornithology (https://www.allaboutbirds.org/cams/) situated in 6 unique locations around the world as follows:
Treman bird feeding garden at the Cornell Ornithology Laboratory in Ithaca, New York. At this station, Axis P11448-LE cameras are used to capture the recordings from feeders perched on the edge of both Sapsucker Woods and its 10-acre ponds. This site mainly attracts forest species like chickadees (Poecile atricapillus), red-winged blackbirds (Agelaius phoeniceus), and woodpeckers (Picidae). A total of 2065 images were captured from this location.
Fort Davis in Western Texas, USA. At this site, a total of 30 hummingbird feeder cams are hosted at an elevation of over 5500 feet. From this site, 1440 images were captured.
Sachatamia Lodge in Mindo, Ecuador. This site has a live hummingbird feed watcher that attracts over 132 species of hummingbirds including: Fawn-breasted Brilliant, White-necked Jacobin, Purple-bibbed Whitetip, Violet-tailed Sylph, Velvet-purple Coronet, and many others. A total of 2063 images were captured from this location.
Morris County, New Jersey, USA. Feeders at this location attract over 39 species including Red-bellied Woodpecker, Red-winged Blackbird, Purple Finch, Blue Jay, Pine Siskin, Hairy Woodpecker, and others. Footage at this site is captured by an Axis P1448-LE Camera and Axis T8351 Microphone. A total of 1876 images were recorded from this site.
Canopy Lodge in El Valle de Anton, Panama. Over 158 bird species visit this location annually and these include Gray-headed Chachalaca, Ruddy Ground-Dove, White-tipped Dove, Green Hermit, and others. A total of 1600 images were captured.
Southeast tip of South Island, New Zealand. At this site, nearly 10000 seabirds visit this location annually and a total of 1548 images were captured.
The Cornell Lab of Ornithology is an institute dedicated to biodiversity conversation with the main focus on birds through research, citizen science, and education. The autoscreen software was used to capture the images from the live feeds and images of approximately 1 Megapixel (Joint Photographic Experts Group) JPEG-coloured images of resolution 1366 X 768 X 3 pixels were collected (https://sourceforge.net/projects/autoscreen/). The software took a new image every 30 seconds and was captured during different times of the day in order to avoid a sample-biased dataset. In total, 10592 images were collected for this study.
Files provided
Train.zip – contains 6779 image files(.jpeg) and 6779 annotation files (.xml)
Validation.zip – contains 1695 image files(.jpeg) and 1695 annotation files (.xml)
Test.zip –contains 2118 image files(.jpeg)
Scripts.zip - Contains scripts needed in manipulating the dataset like dataset partitioning, and creation of CSV and tfrecords files.
This dataset was used in the MSc thesis titled “Investigating automated bird detection from webcams using machine learning” by Alex Mirugwe, University of Cape Town – South Africa.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Offline reinforcement learning (RL) is a promising direction that allows RL agents to be pre-trained from large datasets avoiding recurrence of expensive data collection. To advance the field, it is crucial to generate large-scale datasets. Compositional RL is particularly appealing for generating such large datasets, since 1) it permits creating many tasks from few components, and 2) the task structure may enable trained agents to solve new tasks by combining relevant learned components. This submission provides four offline RL datasets for simulated robotic manipulation created using the 256 tasks from CompoSuite Mendez et al., 2022. In every task in CompoSuite, a robot arm is used to manipulate an object to achieve an objective all while trying to avoid an obstacle. There are for components for each of these four axes that can be combined arbitrarily leading to a total of 256 tasks. The component choices are * Robot: IIWA, Jaco, Kinova3, Panda* Object: Hollow box, box, dumbbell, plate* Objective: Push, pick and place, put in shelf, put in trashcan* Obstacle: None, wall between robot and object, wall between goal and object, door between goal and object The four included datasets are collected using separate agents each trained to a different degree of performance, and each dataset consists of 256 million transitions. The degrees of performance are expert data, medium data, warmstart data and replay data: * Expert dataset: Transitions from an expert agent that was trained to achieve 90% success on every task.* Medium dataset: Transitions from a medium agent that was trained to achieve 30% success on every task.* Warmstart dataset: Transitions from a Soft-actor critic agent trained for a fixed duration of one million steps.* Medium-replay-subsampled dataset: Transitions that were stored during the training of a medium agent up to 30% success. These datasets are intended for the combined study of compositional generalization and offline reinforcement learning. Methods The datasets were collected by using several deep reinforcement learning agents trained to the various degrees of performance described above on the CompoSuite benchmark (https://github.com/Lifelong-ML/CompoSuite) which builds on top of robosuite (https://github.com/ARISE-Initiative/robosuite) and uses the MuJoCo simulator (https://github.com/deepmind/mujoco). During reinforcement learning training, we stored the data that was collected by each agent in a separate buffer for post-processing. Then, after training, to collect the expert and medium dataset, we run the trained agents for 2000 trajectories of length 500 online in the CompoSuite benchmark and store the trajectories. These add up to a total of 1 million state-transitions tuples per dataset, totalling a full 256 million datapoints per dataset. The warmstart and medium-replay-subsampled dataset contain trajectories from the stored training buffer of the SAC agent trained for a fixed duration and the medium agent respectively. For medium-replay-subsampled data, we uniformly sample trajectories from the training buffer until we reach more than 1 million transitions. Since some of the tasks have termination conditions, some of these trajectories are trunctated and not of length 500. This sometimes results in a number of sampled transitions larger than 1 million. Therefore, after sub-sampling, we artificially truncate the last trajectory and place a timeout at the final position. This can in some rare cases lead to one incorrect trajectory if the datasets are used for finite horizon experimentation. However, this truncation is required to ensure consistent dataset sizes, easy data readability and compatibility with other standard code implementations. The four datasets are split into four tar.gz folders each yielding a total of 12 compressed folders. Every sub-folder contains all the tasks for one of the four robot arms for that dataset. In other words, every tar.gz folder contains a total of 64 tasks using the same robot arm and four tar.gz files form a full dataset. This is done to enable people to only download a part of the dataset in case they do not need all 256 tasks. For every task, the data is separately stored in an hdf5 file allowing for the usage of arbitrary task combinations and mixing of data qualities across the four datasets. Every task is contained in a folder that is named after the CompoSuite elements it uses. In other words, every task is represented as a folder named
https://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HSMJLLhttps://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HSMJLL
The dataset comprises the pretraining and testing data for our work: Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations. The pretaining data consists of images corresponding to the Digital Surface Models (DSM) and Digital Terrain Models (DTM) obtained from Norway, with a ground resolution of 1 meter, utilizing the UTM 33N projection. The primary data source for this dataset is the Norwegian Mapping Authority (Kartverket), which has made the data freely available on their website under the CC BY 4.0 license (Source: https://hoydedata.no/, License terms: https://creativecommons.org/licenses/by/4.0/) The DSM and DTM models are generated from 3D LiDAR point clouds collected through periodic aerial campaigns. During these campaigns, the LiDAR sensors capture data with a maximum offset of 20 degrees from the nadir. Additionally, a subset of data also includes building footprints/labels created using the OpenStreetMap (OSM) database. Specifically, building footprints extracted from the OSM database were rasterized to match the grid of the DTM and DSM models. These rasterized labels are made available under the Open Database License (ODbL) in compliance with the OSM license requirements. We hope this dataset facilitates various applications in geographic analysis, remote sensing, and machine learning research.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Federated Learning Solutions Market size was valued at USD 151.03 Million in 2024 and is projected to reach USD 292.47 Million by 2031, growing at a CAGR of 9.50% from 2024 to 2031.
Global Federated Learning Solutions Market Drivers
The market drivers for the Federated Learning Solutions Market can be influenced by various factors. These may include:
Data privacy worries are becoming more and more of a concern. Federated learning provides a mechanism to train machine learning models without gathering sensitive data centrally, which makes it a desirable solution for companies and organizations.
Data Security: Federated learning makes it possible for data to stay on local devices, lowering the possibility of data breaches and guaranteeing data security, which is essential for sectors like healthcare and finance that handle sensitive data.
Cost-Effectiveness: Federated learning can save organizations money by reducing the requirement for large-scale centralized infrastructure by dispersing the training process to local devices.
Regulatory Compliance: By keeping data local and minimizing data transfer, federated learning offers a solution for enterprises to comply with increasingly strict data protection rules, such as GDPR and HIPAA.
Edge Computing: By enabling model training directly on edge devices, edge computing—where data processing is done closer to the source of data—has boosted the viability and efficiency of federated learning.
Industry Adoption: To capitalize on the advantages of machine learning while resolving privacy and security concerns, a number of businesses, including healthcare, banking, and telecommunications, are progressively implementing federated learning solutions.
Technological developments in AI and ML: Federated learning has become a viable method for training models on dispersed data sources as AI and ML technologies develop, spurring additional market innovation and uptake.
Large language models are enabling rapid progress in robotic verbal communication, but nonverbal communication is not keeping pace. Physical humanoid robots struggle to express and communicate using facial movement, relying primarily on voice. The challenge is twofold: First, the actuation of an expressively versatile robotic face is mechanically challenging. A second challenge is knowing what expression to generate so that they appear natural, timely, and genuine. Here we propose that both barriers can be alleviated by training a robot to anticipate future facial expressions and execute them simultaneously with a human. Whereas delayed facial mimicry looks disingenuous, facial co-expression feels more genuine since it requires correctly inferring the human's emotional state for timely execution. We find that a robot can learn to predict a forthcoming smile about 839 milliseconds before the human smiles, and using a learned inverse kinematic facial self-model, co-express the smile simul..., During the data collection phase, the robot generated symmetrical facial expressions, which we thought can cover most of the situation and could reduce the size of the model. We used an Intel RealSense D435i to capture RGB images and cropped them to 480 320. We logged each motor command value and robot images to form a single data pair without any human labeling., , # Dataset for Paper "Human-Robot Facial Co-expression"
This dataset accompanies the research on human-robot facial co-expression, aiming to enhance nonverbal interaction by training robots to anticipate and simultaneously execute human facial expressions. Our study proposes a method where robots can learn to predict forthcoming human facial expressions and execute them in real time, thereby making the interaction feel more genuine and natural.
https://doi.org/10.5061/dryad.gxd2547t7
The dataset is organized into several zip files, each containing different components essential for replicating our study's results or for use in related research projects:
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Artificial Intelligence (AI) in Manufacturing Market size was valued at USD 1.82 USD billion in 2023 and is projected to reach USD 6.64 USD billion by 2032, exhibiting a CAGR of 20.3 % during the forecast period. AI in manufacturing is the technology using intelligent systems and algorithms in industrial settings for the improvement of productivity and decision making. It uses machine learning, robotics, and analytics to optimize manufacturing operations. Industrial areas of applications are supplying chain management (SCM), predictive maintenance (PM), quality control (QC), and autonomous robotics (AR). AI systems in manufacturing can be classified the following ways: supervised learning for predictive maintenance, unsupervised learning for anomaly detection, reinforcement learning for autonomous robotics, and natural language processing for human-machine interaction. A crucial part of this system includes sensors for data gathering, data processing systems, machine learning systems, robotics, and human-machine interfaces. Right now, trendsetting technologies such as AI with IoT for real-time monitoring, explainable AI for transparency, and AI-driven generative design for product innovation are the most important ingredients for the progress of the technology. Companies experiment with AI enabled replicas of the manufacturing process and AI based supply chains that enables them to be more efficient and resilient. Recent developments include: Microsoft and Siemens announce partnership to develop AI-powered manufacturing solutions
Google and ABB collaborate on AI-based cloud solutions for industrial robotics
IBM and Samsung join forces to advance AI for semiconductor manufacturing. Key drivers for this market are: Rising Demand from the Automotive and Construction Sectors to Aid Market Growth. Potential restraints include: The Change in International Policies is Expected to Impact the Market Growth .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The “Fused Image dataset for convolutional neural Network-based crack Detection” (FIND) is a large-scale image dataset with pixel-level ground truth crack data for deep learning-based crack segmentation analysis. It features four types of image data including raw intensity image, raw range (i.e., elevation) image, filtered range image, and fused raw image. The FIND dataset consists of 2500 image patches (dimension: 256x256 pixels) and their ground truth crack maps for each of the four data types.
The images contained in this dataset were collected from multiple bridge decks and roadways under real-world conditions. A laser scanning device was adopted for data acquisition such that the captured raw intensity and raw range images have pixel-to-pixel location correspondence (i.e., spatial co-registration feature). The filtered range data were generated by applying frequency domain filtering to eliminate image disturbances (e.g., surface variations, and grooved patterns) from the raw range data [1]. The fused image data were obtained by combining the raw range and raw intensity data to achieve cross-domain feature correlation [2,3]. Please refer to [4] for a comprehensive benchmark study performed using the FIND dataset to investigate the impact from different types of image data on deep convolutional neural network (DCNN) performance.
If you share or use this dataset, please cite [4] and [5] in any relevant documentation.
In addition, an image dataset for crack classification has also been published at [6].
References:
[1] Shanglian Zhou, & Wei Song. (2020). Robust Image-Based Surface Crack Detection Using Range Data. Journal of Computing in Civil Engineering, 34(2), 04019054. https://doi.org/10.1061/(asce)cp.1943-5487.0000873
[2] Shanglian Zhou, & Wei Song. (2021). Crack segmentation through deep convolutional neural networks and heterogeneous image fusion. Automation in Construction, 125. https://doi.org/10.1016/j.autcon.2021.103605
[3] Shanglian Zhou, & Wei Song. (2020). Deep learning–based roadway crack classification with heterogeneous image data fusion. Structural Health Monitoring, 20(3), 1274-1293. https://doi.org/10.1177/1475921720948434
[4] Shanglian Zhou, Carlos Canchila, & Wei Song. (2023). Deep learning-based crack segmentation for civil infrastructure: data types, architectures, and benchmarked performance. Automation in Construction, 146. https://doi.org/10.1016/j.autcon.2022.104678
5 Shanglian Zhou, Carlos Canchila, & Wei Song. (2022). Fused Image dataset for convolutional neural Network-based crack Detection (FIND) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6383044
[6] Wei Song, & Shanglian Zhou. (2020). Laser-scanned roadway range image dataset (LRRD). Laser-scanned Range Image Dataset from Asphalt and Concrete Roadways for DCNN-based Crack Classification, DesignSafe-CI. https://doi.org/10.17603/ds2-bzv3-nc78
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.
Monitoring asthma condition is essential to asthma self-management. However, traditional methods of monitoring require high levels of active engagement and patients may regard this level of monitoring as tedious. Passive monitoring with mobile health devices, especially when combined with machine learning, provides an avenue to dramatically reduce management burden. However, data for developing machine learning algorithms are scarce, and gathering new data is expensive. A few asthma mHealth datasets are publicly available, but lack objective and passively collected data which may enhance asthma attack prediction systems. To fill this gap, we carried out the 2-phase, 7-month AAMOS-00 observational study to collect data about asthma status using three smart monitoring devices (smart peak flow meter, smart inhaler, smartwatch), and daily symptom questionnaires. Combined with localised weather, pollen, and air quality reports, we have collected a rich longitudinal dataset to explore the feasibility of passive monitoring and asthma attack prediction. Conducting phase 2 of device monitoring over 12 months, from June 2021 to June 2022 and during the COVID-19 pandemic, 22 participants across the UK provided 2,054 unique patient-days of data. This valuable anonymised dataset has been made publicly available with the consent of participants. Ethics approval was provided by the East of England - Cambridge Central Research Ethics Committee. IRAS project ID: 285505 with governance approval from ACCORD (Academic and Clinical Central Office for Research and Development), project number: AC20145. The study sponsor was ACCORD, the University of Edinburgh. The anonymised dataset was produced with statistical advice from Aryelly Rodriguez - Statistician, Edinburgh Clinical Trials Unit, University of Edinburgh. Protocol: 'Predicting asthma attacks using connected mobile devices and machine learning; the AAMOS-00 observational study protocol' - BMJ Open, DOI: 10.1136/bmjopen-2022-064166 # Thesis # Tsang, Kevin CH 'Application of data-driven technologies for asthma self-management' (2022) [Doctoral Thesis] University of Edinburgh https://era.ed.ac.uk/handle/1842/40547 The dataset also relates to the publication K.C.H. Tsang, H. Pinnock, A.M. Wilson, D. Salvi and S.A. Shah (2023). 'Home monitoring with connected mobile devices for asthma attack prediction with machine learning', Scientific Data 10 ( https://doi.org/10.1038/s41597-023-02241-9 ).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". Source data contains CSV files with dataset results summaries, false positives lists, the evaluated sentences, and their keystroke timings. Results data contains training and evaluation ARFF files for each user and sentence with the calculated Manhattan and euclidean distance, R metric, and the directionality index for each challenge instance. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY. The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y. We proposed a method to find, using only flight times (keydown/keydown), whether a medium-sized candidate list of possible texts includes the one to which the timings belong. Nor the text length neither the candidate texts list were restricted, and previous samples of the timing parameters for the candidates were not required to train the model. The method was evaluated using three datasets collected by non-mutually-collaborating sets of authors in different environments. False acceptance and false rejection rates were found to remain below or very near to 1% when user data was available for training. The former increased between two- to three-fold when the models were trained with data from other users, while the latter jumped to around 15%. These error rates are competitive against current methods for text recovery based on keystroke timings, and show that the method can be used effectively even without user-specific samples for training, by recurring to general population data.
Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student1 Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.