100+ datasets found

Eye Tracking based Learning Style Identification for Learning Management...
zenodo.org
data.niaid.nih.gov
bin, pdf, tsv
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok (2024). Eye Tracking based Learning Style Identification for Learning Management Systems [Dataset]. http://doi.org/10.5281/zenodo.8349468
Explore at:
bin, tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8349468
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Abstract:

In recent years, universities have been faced with increasing numbers of students dropping out. This is partly due to the fact that students are limited in their ability to explore individual learning paths through different course materials. However, a promising remedy to this issue is the implementation of adaptive learning management systems. These systems recommend customised learning paths to students - based on their individual learning styles. Learning styles are commonly classified using questionnaires and learning analytics, but both methods are prone to error. Questionnaires may yield superficial responses due to time constraints or lack of motivation, while learning analytics ignore offline learning behaviour. To address these limitations, this study aims to integrating Eye Tracking for a more accurate classification of students' learning styles. Ultimately, this comprehensive approach could not only open up a deeper understanding of subconscious processes, but also provide valuable insights into students' unique learning preferences.

Research:

As an example of a possible analysis of the eye-tracking stimuli and eye movement recordings available here, as well as the corresponding ILS questionnaire responses, we refer to the following research works, which should also be referred to if necessary:

Bittner, D., Nadimpalli, V. K., Grabinger, L., Ezer, T., Hauser, F., & Mottok, J. (2024, June), Uncovering Learning Styles through Eye Tracking and Artificial Intelligence, In 2024 Symposium on Eye Tracking Research and Applications. ETRA.

Bittner, D. (2024), Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence. Master’s Thesis, Regensburg University of Applied Sciences (OTH), Regensburg, Germany

Bittner, D., Ezer, T., Grabinger, L., Hauser, F., & Mottok, J. (2023). Unveiling the secrets of learning styles: decoding eye movements via machine learning. In ICERI2023 Proceedings (pp. 5153-5162). IATED.

Bittner, D., Hauser, F., Nadimpalli, V. K., Grabinger, L., Staufer, S., & Mottok, J. (2023, June). Towards eye tracking based learning style identification. In Proceedings of the 5th European Conference on Software Engineering Education (pp. 138-147). ECSEE.

The following descriptions and the previous abstract are part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. and have to be cited accordingly.

Experimental Setup:

In the following section, crucial notes on the circumstances and the experiment itself as well as the equipment are given.
In order to reduce the external influence on the experiment, variables such as:

order, number, and presentation of the stimuli,

instruction to the participant prior to the experiment,

position of the participant in respect to the Eye Tracking equipment,

environment such as illuminance and ambient noise for the participant,

Eye Tracking equipment, software, settings such as sampling frequency and latency as well as calibration

were attempted to keep constant and consistent throughout the experiment.

Equipment:

In this study, the Tobii Pro Fusion (https://go.tobii.com/tobii-pro-fusion-user-manual) eye tracker is utilized without a chin rest along with the Tobii IVT filter for fixation detection and Tobii Pro Lab software for data collection. The Tobii Pro Fusion is categorised as a video-based combined pupil and corneal reflection technology. This tracker provides several advantages, such as the collection of comprehensive data, comprising gaze, pupil, and eye-opening metrics. The eye tracker captures up to 250 images per second (250Hz), enhancing its precision and eye movement analysis. In addition, Tobii Pro Fusion is capable of performing under different lighting conditions, thus making this portable device ideal for off-site studies.

Ensuring consistent quality across all experiment participants is crucial. Prior to each individual experiment, eye trackers are calibrated, aiming for a maximum reproduction error of less or equal than 0.2 degree during calibration to minimize deviations. The calibration is excluded from the experiment recording. Each participant is given the same instructions for their single trial of the experiment. The stimuli is displayed on a 24-inch monitor in a 16:9 format, positioned approximately 65cm away from the participants' eyes. Any effect related to the characteristics of the participants, such as age, visual acuity, eye colour, pupil size, etc., are considered in the experiment design.

Procedure:

Initially, the participants are requested to confirm their ability to conduct the experiment based on their current condition. Subsequently, the participant must be positioned comfortably and accurately in relation to the eye tracker. The eye tracker calibration is carried out for each participant to ensure a suitable experimental configuration. Once a successful calibration is achieved, the Eye Tracking experiment begin with introductions prior to each task. The stimuli presentation is unrestricted by time constraints, and no prior knowledge of the stimuli contents is necessary. Employing a within-subject design, each stimulus is exposed to each subject. Following completion of the experiment, participants anonymously answer the ILS questionnaire. To prevent any impact on the experiment, it is important that the questionnaire only be seen and completed after the experiment.

Stimuli:

The specially designed stimuli shown to participants during the study are illustrated in the left-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. For this research, only specific regions of a stimulus, referred to as AOI, are taken into consideration. The size of the AOI depends on both stimulus information and distance between multiple AOIs. Adequate results are ensured by not overlapping AOIs and appropriate spacing. The AOIs of the various stimuli employed in this research are illustrated in the right-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. The stimuli are presented in German language, ensuring reliable Eye Tracking measurements without any interference from language barriers. Each stimulus comprises diverse learning materials to engage students with varying learning styles, with some general information about the quantitative research cycle. Some stimuli feature identical type of material, e.g. illustrations or key words, but with different contexts and positions on the stimuli. Rearranging the identical material reduces the influence of reading style and enhances the impact of the learning style, producing a more reliable experiment. These identical types of material or AOIs on different stimuli can be grouped together, identified by the same colour and title, and referred to as AOI groupings.
There are ten different AOI groupings in total, as illustrated in the figure in the "[Documentation]stimuli_preview.pdf" file, where each grouping consists of several AOIs.
In detail, the AOI grouping regarding:

table of contents and summary contain only a single AOI each,

illustrations, key words, theory, exercise, example and additional material contain three AOIs each,

supporting text and multiple choice question contain two AOIs each.

Research data management:

To ensure the transparency and reproducibility of this study, effective management of research data is essential. This section provides details on the management, storage and analysis of the extensive dataset collected as part of the study. Importantly, this research, the study and its processes adhered to ethical guidelines at all times, including informed consent, participant anonymity and secure data handling. The data collected will only be kept for a specific period of time as defined in the research project guidelines. The collection itself involves the recording of participants' eye movements during the ET study and the collection of their demographic data and responses to the ILS questionnaire.
r
A Messy Handwriting Dataset with Student Crossouts and Corrections...
research-repository.rmit.edu.au
researchdata.edu.au
zip
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiqmat Nisa (2023). A Messy Handwriting Dataset with Student Crossouts and Corrections (Line-version) [Dataset]. http://doi.org/10.25439/rmt.24419986.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25439/rmt.24419986.v1
Dataset updated
Oct 24, 2023
Dataset provided by
RMIT University
Authors
Hiqmat Nisa
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.1)SMHD.txt contain all the line level transcription in the form of image name, threshold value, label 0001-000,178 Bombay Phenotype :- 2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text. 3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.Dataset Description:We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
Z
Industrial screw driving dataset collection: Time series data for process...
data.niaid.nih.gov
Updated Feb 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
West, Nikolai (2025). Industrial screw driving dataset collection: Time series data for process monitoring and anomaly detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14729547
Explore at:
Dataset updated
Feb 18, 2025
Dataset provided by
Deuse, Jochen
West, Nikolai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Industrial Screw Driving Datasets

Overview

This repository contains a collection of real-world industrial screw driving datasets, designed to support research in manufacturing process monitoring, anomaly detection, and quality control. Each dataset represents different aspects and challenges of automated screw driving operations, with a focus on natural process variations and degradation patterns.

Scenario name Number of work pieces Repetitions (screw cylces) per workpiece Individual screws per workpiece Observations Unique classes Purpose

s01_thread-degradation 100 25 2 5.000 1 Investigation of thread degradation through repeated fastening

s02_surface-friction 250 25 2 12.500 8 Surface friction effects on screw driving operations

s03_error-collection-1

1 2

20

s04_error-collection-2 2.500 1 2 5.000 25

s05_injection-molding-manipulations-upper-workpiece 1.200 1 2 2.400 44 Investigation of changes in the injection molding process of the workpieces

Dataset Collection

The datasets were collected from operational industrial environments, specifically from automated screw driving stations used in manufacturing. Each scenario investigates specific mechanical phenomena that can occur during industrial screw driving operations:

Currently Available Datasets:

s01_thread-degradation

Focus: Investigation of thread degradation through repeated fastening

Samples: 5,000 screw operations (4,089 normal, 911 faulty)

Features: Natural degradation patterns, no artificial error induction

Equipment: Delta PT 40x12 screws, thermoplastic components

Process: 25 cycles per location, two locations per workpiece

First published in: HICSS 2024 (West & Deuse, 2024)

s02_surface-friction

Focus: Surface friction effects on screw driving operations

Samples: 12,500 screw operations (9,512 normal, 2,988 faulty)

Features: Eight distinct surface conditions (baseline to mechanical damage)

Equipment: Delta PT 40x12 screws, thermoplastic components, surface treatment materials

Process: 25 cycles per location, two locations per workpiece

First published in: CIE51 2024 (West & Deuse, 2024)

s05_injection-molding-manipulations-upper-workpiece

Manipulations of the injection molding process with no changes during tightening

Samples: 2,400 screw operations (2,397 normal, 3 faulty)

Features: 44 classes in five distinct groups:

Mold temperature

Glass fiber content

Recyclate content

Switching point

Injection velocity

Equipment: Delta PT 40x12 screws, thermoplastic components

Unpublished, work in progress

Upcoming Datasets:

s03_screw-error-collection-1 (recorded but unpublished)

Focus: Varius manipulations of the screw driving process

Features: More than 20 different errors recorded

First published in: Publication planned

Status: In preparation

s04_screw-error-collection-2 (recorded but unpublished)

Focus: Varius manipulations of the screw driving process

Features: 25 distinct errors recorded over the course of a week

First published in: Publication planned

Status: In preparation

s06_injection-molding-manipulations-lower-workpiece (recorded but unpublished)

Manipulations of the injection molding process with no changes during tightening

Additional scenarios may be added to this collection as they become available.

Data Format

Each dataset follows a standardized structure:

JSON files containing individual screw operation data

CSV files with operation metadata and labels

Comprehensive documentation in README files

Example code for data loading and processing is available in the companion library PyScrew

Research Applications

These datasets are suitable for various research purposes:

Machine learning model development and validation

Process monitoring and control systems

Quality assurance methodology development

Manufacturing analytics research

Anomaly detection algorithm benchmarking

Usage Notes

All datasets include both normal operations and process anomalies

Complete time series data for torque, angle, and additional parameters available

Detailed documentation of experimental conditions and setup

Data collection procedures and equipment specifications available

Access and Citation

These datasets are provided under an open-access license to support research and development in manufacturing analytics. When using any of these datasets, please cite the corresponding publication as detailed in each dataset's README file.

Related Tools

We recommend using our library PyScrew to load and prepare the data. However, the the datasets can be processed using standard JSON and CSV processing libraries. Common data analysis and machine learning frameworks may be used for the analysis. The .tar file provided all information required for each scenario.

Contact and Support

For questions, issues, or collaboration interests regarding these datasets, either:

Open an issue in our GitHub repository PyScrew

Contact us directly via email

Acknowledgments

These datasets were collected and prepared by:

RIF Institute for Research and Transfer e.V.

University of Kassel, Institute of Material Engineering

University of Kassel, Institute of Materials Engineering (IfW)

Technical University Dortmund, Institute for Production Systems

The preparation and provision of the research was supported by:

German Ministry of Education and Research (BMBF)

European Union's "NextGenerationEU" program

The research is part of this funding program

More information regarding the research project is available here

Change Log

Version Date Features

v1.1.3 18.02.2025

Upload of s05 with injection molding manipulations in 44 classes

v1.1.2 12.02.2025

Change to default names label.csv and README.md in all scenarios

v1.1.1 12.02.2025

Reupload of both s01 and s02 as zip (smaller size) and tar (faster extraction) files

Change to the data structure (now organized as subdirectories per class in json/)

v1.1.0 30.01.2025

Initial uplload of the second scenario s02_surface-friction

v1.0.0 24.01.2025

Initial upload of the first scenario s01_thread-degradation
Z
Data from: Dataset for Investigating Anomalies in Compute Clusters
data.niaid.nih.gov
zenodo.org
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hild, Laura (2023). Dataset for Investigating Anomalies in Compute Clusters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10058229
Explore at:
Dataset updated
Nov 29, 2023
Dataset provided by
Jones, Mark
Schram, Malachi
Hild, Laura
Moore, Wesley
McSpadden, Diana
Smirni, Evgenia
Mohammed, Ahmed
Lu, Yiyang
Hess, Bryan
Ren, Jie
Yasir, Alanazi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data. Background Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff. The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job. Usage Notes While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster. https://doi.org/10.48550/arXiv.2311.16129
Q
Data for: The Bystander Affect Detection (BAD) Dataset for Failure Detection...
data.qdr.syr.edu
pdf, tsv, txt, zip
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Bremers; Alexandra Bremers; Xuanyu Fang; Xuanyu Fang; Natalie Friedman; Natalie Friedman; Wendy Ju; Wendy Ju (2023). Data for: The Bystander Affect Detection (BAD) Dataset for Failure Detection in HRI [Dataset]. http://doi.org/10.5064/F6TAWBGS
Explore at:
zip(66872585), zip(67359564), zip(49981372), zip(45063165), zip(35942055), tsv(5431), zip(63732190), zip(32108293), zip(33064251), zip(49848937), zip(38858151), zip(137880775), zip(90804192), zip(36477139), zip(38068214), zip(36039067), zip(37592931), zip(34234760), zip(63445623), zip(38092264), zip(45582594), zip(50915158), zip(111033502), zip(32955394), zip(30549219), zip(39991378), zip(166237686), zip(50351519), zip(62744513), zip(46810648), zip(34379478), zip(35492684), zip(22036189), pdf(197935), zip(66187509), zip(40085473), zip(40798037), pdf(113804), zip(12931695), zip(31593404), zip(26677367), zip(35547615), tsv(244631), zip(35954889), txt(7329), zip(74593629), zip(52574377), zip(55483165), zip(31323914), zip(43519637), zip(42743107), zip(55790691), zip(50499507), zip(76761027), zip(38063092), zip(55654900), zip(30504764), zip(48203736), zip(40422817)Available download formats
Unique identifier
https://doi.org/10.5064/F6TAWBGS
Dataset updated
Sep 25, 2023
Dataset provided by
Qualitative Data Repository
Authors
Alexandra Bremers; Alexandra Bremers; Xuanyu Fang; Xuanyu Fang; Natalie Friedman; Natalie Friedman; Wendy Ju; Wendy Ju
License
https://qdr.syr.edu/policies/qdr-restricted-access-conditionshttps://qdr.syr.edu/policies/qdr-restricted-access-conditions
Description
Project Overview For a robot to repair its own error, it must first know it has made a mistake. One way that people detect errors is from the implicit reactions from bystanders – their confusion, smirks, or giggles clue us in that something unexpected occurred. To enable robots to detect and act on bystander responses to task failures, we developed a novel method to elicit bystander responses to human and robot errors. Data Overview This project introduces the Bystander Affect Detection (BAD) dataset – a dataset of videos of bystander reactions to videos of failures. This dataset includes 2,452 human reactions to failure, collected in contexts that approximate “in-the-wild” data collection – including natural variances in webcam quality, lighting, and background. The BAD dataset may be requested for use in related research projects. As the dataset contains facial video data of participants, access can be requested along with the presentation of a research protocol and data use agreement that protects participants. Data Collection Overview and Access Conditions Using 46 different stimulus videos featuring a variety of human and machine task failures, we collected a total of 2,452 webcam videos of human reactions from 54 participants. Recruitment happened through the online behavioral research platform Prolific (https://www.prolific.co/about), where the options were selected to recruit a gender-balanced sample across all countries available. Participants had to use a laptop or desktop. Compensation was set at the Prolific rate of $12/hr, which came down to about $8 per participant for about 40 minutes of participation. Participants agreed that their data can be shared for future research projects and the data were approved to be shared publicly by IRB review. However, considering the fact that this is a machine-learning dataset containing identifiable crowdsourced human subjects data, the research team has decided that potential secondary users of the data must meet the following criteria for the access request to be granted: 1. Agreement to three usage terms: - I will not redistribute the contents of the BAD Dataset - I will not use videos for purposes outside of human interaction research (broadly defined as any project that aims to study or develop improvements to human interactions with technology to result in a better user experience) - I will not use the videos to identify, defame, or otherwise negatively impact the health, welfare, employment or reputation of human participants 2. A description of what you want to use the BAD dataset for, indicating any applicable human subjects protection measures that are in place. (For instance, "Me and my fellow researchers at University of X, lab of Y, will use the BAD dataset to train a model to detect when our Nao robot interrupts people at awkward times. The PI is Professor Z. Our protocol was approved under IRB #.") 3. A copy of the IRB record or ethics approval document, confirming the research protocol and institutional approval. Data Analysis To test the viability of the collected data, we used the Bystander Reaction Dataset as input to a deep-learning model, BADNet, to predict failure occurrence. We tested different data labeling methods and learned how they affect model performance, achieving precisions above 90%. Shared Data Organization This data project consists of 54 zipped folders of recorded video data organized by participant, totaling 2,452 videos. The accompanying documentation includes a file containing the text of the consent form used for the research project, an inventory of the stimulus videos used, aggregate survey data, this data narrative, and an administrative readme file. Special Notes The data were approved to be shared publicly by IRB review. However, considering the fact that this is a machine-learning dataset containing identifiable crowdsourced human subjects data, the research team has decided that potential secondary users of the data must meet specific criteria before they qualify for access. Please consult the Terms tab below for more details and follow the instructions there if interested in requesting access.
Z
WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music
data.niaid.nih.gov
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patchbanks (2024). WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13769543
Explore at:
Dataset updated
Oct 11, 2024
Dataset provided by
WaivOps
Patchbanks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EDM-HSE Dataset

EDM-HSE is an open audio dataset containing a collection of code-generated drum recordings in the style of modern electronic house music. It includes 8,000 audio loops recorded in uncompressed stereo WAV format, created using custom audio samples and a MIDI drum dataset. The dataset also comes with paired JSON files containing MIDI note numbers (pitch) and tempo data, intended for supervised training of generative AI audio models.

Overview

The EDM-HSE Dataset was developed using an algorithmic framework to generate probable drum notations commonly played by EDM music producers. For supervised training with labeled data, a variational mixing technique was applied to the rendered audio files. This method systematically includes or excludes drum notes, assisting the model in recognizing patterns and relationships between drum instruments, thereby enhancing its generalization capabilities.

The primary purpose of this dataset is to provide accessible content for machine learning applications in music and audio. Potential use cases include generative music, feature extraction, tempo detection, audio classification, rhythm analysis, drum synthesis, music information retrieval (MIR), sound design and signal processing.

Specifications

8,000 audio loops (approximately 17 hours)

16-bit WAV format

Tempo range: 120–130 BPM

Paired label data (WAV + JSON)

Variational drum patterns

Subgenre styles (Big room, electro, minimal, classic)

A JSON file is provided for referencing and converting MIDI note numbers to text labels. You can update the text labels to suit your preferences.

License

This dataset was compiled by WaivOps, a crowdsourced music project managed by the sound label company Patchbanks. All recordings have been compiled by verified sources for copyright clearance.

The EDM-HSE dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

Additional Info

Please note that this dataset has not been fully reviewed and may contain minor notational errors or audio defects.

For audio examples or more information about this dataset, please refer to the GitHub repository.
Probabilistic AI: A New Approach to Artificial Intelligence (Forecast)
kappasignal.com
Updated May 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2023). Probabilistic AI: A New Approach to Artificial Intelligence (Forecast) [Dataset]. https://www.kappasignal.com/2023/05/probabilistic-ai-new-approach-to.html
Explore at:
Dataset updated
May 27, 2023
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Probabilistic AI: A New Approach to Artificial Intelligence

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
m
IoT Monitoring Dataset of Water Quality and Tilapia (Oreochromis niloticus)...
data.mendeley.com
Updated Nov 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rubén Baena-Navarro (2024). IoT Monitoring Dataset of Water Quality and Tilapia (Oreochromis niloticus) Health in Aquaculture Ponds in Montería, Colombia (2024)) [Dataset]. http://doi.org/10.17632/3g2b4sh65m.1
Explore at:
Unique identifier
https://doi.org/10.17632/3g2b4sh65m.1
Dataset updated
Nov 5, 2024
Authors
Rubén Baena-Navarro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Colombia, Montería
Description
Description This dataset contains six months of water quality and tilapia (Oreochromis niloticus) health monitoring data collected from aquaculture ponds in Montería, Colombia. Using an IoT-based monitoring system, critical parameters such as dissolved oxygen (DO), pH, water temperature, and turbidity were recorded. Fish health indicators, including average fish weight and survival rate, are also included. Data was collected from January to June 2024, with hourly readings to capture daily fluctuations and ensure comprehensive monitoring of aquaculture conditions and tilapia well-being.

Included Files 1. Data Model IoTMLCQ 2024.xlsx o Contains sensor readings and fish health data collected over six months. o Columns:  Datetime: Date and time of each reading.  Month: Data collection month (January to June).  Average Fish Weight (g): Average weight of the tilapia fish in grams.  Survival Rate (%): Percentage of fish survival during the monitoring period.  Disease Occurrence (Cases): Number of disease cases observed.  Temperature (°C): Water temperature readings.  Dissolved Oxygen (mg/L): Levels of dissolved oxygen in the water.  pH: Water pH values.  Turbidity (NTU): Water turbidity measured in Nephelometric Turbidity Units (NTU).  Oxygenation Automatic: Indicates if automatic oxygenation was applied (Yes/No).  Oxygenation Interventions: Oxygenation interventions applied (Yes/No).  Corrective Interventions: Number of corrective measures taken.  Thermal Risk Index: Indicates if the thermal risk is "High" or "Normal."  Low Oxygen Alert: Shows "Critical" if DO levels are below 5 mg/L, otherwise "Safe."  Health Status: Fish health status, showing "At Risk" or "Stable" based on thermal and oxygen risk alerts.

Data Collection Method Data was collected using IoT sensors strategically placed in the aquaculture ponds. Readings were taken every hour throughout the monitoring period. This dataset provides valuable insights into the relationship between water quality parameters and the health of tilapia (Oreochromis niloticus) in controlled aquaculture conditions.

Usage Notes • This dataset is useful for research in aquaculture management, water quality monitoring, and predictive modeling of fish health and growth. • Missing data due to sensor or communication failures were addressed using interpolation methods. • Regular sensor calibrations were performed to ensure accuracy in the collected data.
Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, txt
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell (2024). OpenChart-SE: A corpus of artificial Swedish electronic health records for imagined emergency care patients written by physicians in a crowd-sourcing project [Dataset]. http://doi.org/10.5281/zenodo.7499831
Explore at:
txt, csv, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7499831
Dataset updated
Jul 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically.

Dataset content

OpenChart-SE, version 1 corpus (txt files and and dataset.csv)

The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication.

Codebook.xlsx

The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection.

suppl_data_1_openchart-se_form.pdf

OpenChart-SE mock emergency care EHR form.

suppl_data_3_openchart-se_dataexploration.ipynb

This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus.

More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se).
AI-based Clinical Trials Solution Provider Market Report | Global Forecast...
dataintelo.com
csv, pdf, pptx
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). AI-based Clinical Trials Solution Provider Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-based-clinical-trials-solution-provider-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 5, 2024
Dataset provided by
Authors
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI-based Clinical Trials Solution Provider Market Outlook

The global AI-based Clinical Trials Solution Provider market size was valued at USD 1.5 billion in 2023 and is projected to reach USD 7.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. The rapid growth of this market can be attributed to the increasing adoption of artificial intelligence (AI) in clinical trials to enhance data accuracy, reduce trial times, and cut costs. The integration of AI in clinical trials is revolutionizing the pharmaceutical and biotechnology industries by providing more efficient, cost-effective, and reliable solutions.

One of the primary growth factors for this market is the rising complexity and cost of traditional clinical trials. AI-based solutions offer a significant reduction in the time and resources required for clinical trials by automating various processes such as patient recruitment, data collection, and data analysis. This not only accelerates the trial process but also minimizes human errors, thus enhancing the reliability of the results. Moreover, the increasing incidence of chronic diseases and the subsequent rise in the number of clinical trials are further driving the demand for AI-based solutions.

Another crucial growth factor is the growing awareness and acceptance of AI technology within the healthcare sector. As more pharmaceutical companies and contract research organizations (CROs) recognize the benefits of AI, there is an increasing willingness to invest in these technologies. AI can analyze vast amounts of data much faster and more accurately than traditional methods, leading to more effective and personalized treatments. Additionally, regulatory bodies are beginning to support the use of AI in clinical trials, which is further fueling market growth.

The advancements in AI technology itself are also a significant growth driver. Innovations such as machine learning, natural language processing, and deep learning are continually being refined and applied to clinical trials. These technologies can predict patient outcomes more accurately, identify suitable candidates for trials more efficiently, and provide valuable insights from unstructured data. Consequently, the continuous improvement in AI technologies is expected to sustain market growth in the coming years.

Regionally, North America is expected to dominate the market, followed by Europe and the Asia Pacific. The robust healthcare infrastructure, high adoption rate of advanced technologies, and presence of major pharmaceutical companies in North America are key factors contributing to its leading position. Europe is also a significant market due to its strong emphasis on research and development (R&D) and favorable regulatory environment. Meanwhile, the Asia Pacific region is anticipated to witness the highest growth rate due to increasing investments in healthcare infrastructure and the growing number of clinical trials in emerging economies like China and India.

Component Analysis

The AI-based Clinical Trials Solution Provider market by component is segmented into software and services. The software segment is expected to hold a substantial share of the market, driven by the increasing demand for advanced analytics and predictive modeling tools. These software solutions are designed to streamline various aspects of clinical trials, from patient recruitment to data analysis, thereby reducing trial timelines and costs. The rapid adoption of cloud-based solutions is further propelling the growth of this segment, enabling real-time data access and enhanced collaboration among stakeholders.

Within the software segment, predictive analytics tools are gaining significant traction. These tools leverage machine learning algorithms to predict patient outcomes and identify potential risks, thereby enabling more informed decision-making. Natural language processing (NLP) software is another critical component, used to extract valuable insights from unstructured data such as clinical notes and research papers. The continuous advancements in these technologies are expected to drive substantial growth in the software segment over the forecast period.

The services segment, comprising consulting, implementation, and support services, is also poised for significant growth. As pharmaceutical companies and CROs increasingly adopt AI-based solutions, the demand for expert consulting services to guide them through the implementation process is rising. These services ensure that the AI solutions are effectively integrated into existin
Forensic Toolkit Dataset
kaggle.com
Updated May 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SUNNY THAKUR (2025). Forensic Toolkit Dataset [Dataset]. https://www.kaggle.com/datasets/cyberprince/forensic-toolkit-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SUNNY THAKUR
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Forensic Toolkit Dataset Overview The Forensic Toolkit Dataset is a comprehensive collection of 300 digital forensics and incident response (DFIR) tools, designed for training AI models, supporting forensic investigations, and enhancing cybersecurity workflows. The dataset includes both mainstream and unconventional tools, covering disk imaging, memory analysis, network forensics, mobile forensics, cloud forensics, blockchain analysis, and AI-driven forensic techniques. Each entry provides detailed information about the tool's name, commands, usage, description, supported platforms, and official links, making it a valuable resource for forensic analysts, data scientists, and machine learning engineers. Dataset Description The dataset is provided in JSON Lines (JSONL) format, with each line representing a single tool as a JSON object. It is optimized for AI training, data analysis, and integration into forensic workflows. Schema Each entry contains the following fields:

id: Sequential integer identifier (1–300). tool_name: Name of the forensic tool. commands: List of primary commands or usage syntax (if applicable; GUI-based tools noted). usage: Brief description of how the tool is used in forensic or incident response tasks. description: Detailed explanation of the tool’s purpose, capabilities, and forensic applications. link: URL to the tool’s official website or documentation (verified as of May 26, 2025). system: List of supported platforms (e.g., Linux, Windows, macOS, Android, iOS, Cloud).

Sample Entry { "id": 1, "tool_name": "The Sleuth Kit (TSK)", "commands": ["fls -r -m / image.dd > bodyfile", "ils -e image.dd", "icat image.dd 12345 > output.file", "istat image.dd 12345"], "usage": "Analyze disk images to recover files, list file metadata, and create timelines.", "description": "Open-source collection of command-line tools for analyzing disk images and file systems (NTFS, FAT, ext). Enables recovery of deleted files, metadata examination, and timeline generation.", "link": "https://www.sleuthkit.org/sleuthkit/", "system": ["Linux", "Windows", "macOS"] }

Dataset Structure

Total Entries: 300

Content Focus: Mainstream tools (e.g., The Sleuth Kit, FTK Imager). Unconventional tools (e.g., IoTSeeker, Chainalysis Reactor, DeepCase). Specialized areas: IoT, blockchain, cloud, mobile, and AI-driven forensics.

Purpose The dataset is designed for:

AI Training: Fine-tuning machine learning models for forensic tool recommendation, command generation, or artifact analysis. Forensic Analysis: Reference for forensic analysts to identify tools for specific investigative tasks. Cybersecurity Research: Supporting incident response, threat hunting, and vulnerability analysis. Education: Providing a structured resource for learning about DFIR tools and their applications.

Usage Accessing the Dataset

Download the JSONL files from the repository. Each file can be parsed using standard JSONL libraries (e.g., jsonlines in Python, jq in Linux). Combine files for a complete dataset or use individual segments as needed. ```python

Example: Parsing with Python import json

with open('forensic_toolkit_dataset_1_50.jsonl', 'r') as file: for line in file: tool = json.loads(line) print(f"Tool: {tool['tool_name']}, Supported Systems: {tool['system']}")

Applications AI Model Training: Use the dataset to train models for predicting tool usage based on forensic tasks or generating command sequences. Forensic Workflows: Query the dataset to select tools for specific platforms (e.g., Cloud, Android) or tasks (e.g., memory analysis). Data Analysis: Analyze tool distribution across platforms or forensic categories using data science tools (e.g., Pandas, R). Contribution Guidelines We welcome contributions to expand or refine the dataset. To contribute: Fork the repository. Add new tools or update existing entries in JSONL format, ensuring adherence to the schema. Verify links and platform compatibility as of the contribution date. Submit a pull request with a clear description of changes. Avoid duplicating tools from existing entries (check IDs 1–300). Contribution Notes Ensure tools are forensically sound (preserve evidence integrity, court-admissible where applicable). Include unconventional or niche tools to maintain dataset diversity. Validate links and commands against official documentation. License This dataset is licensed under the MIT License. See the LICENSE file for details. Acknowledgments Inspired by forensic toolkits and resources from ForensicArtifacts.com, SANS, and open-source communities. Thanks to contributors for identifying unique and unconventional DFIR tools. Contact For issues, suggestions, or inquiries, please open an issue on the repository or contact the maintainers at sunny48445@gmail.com.
Replication Package for 'How do Machine Learning Models Change?'
zenodo.org
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2024). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.14128997
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14128997
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.

Our research addresses three main aspects:

Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models.

Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes.

Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.

Data Collection and Preprocessing

Data Collection

We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:

Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards.

Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit.

Release Information: Information on model releases marked by tags in their repositories.

To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.

Data Preprocessing

Commit Diffs

We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.

Commit Classification

We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.

Model Metadata

We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.

Folder Structure

The replication package is organized as follows:

- code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.

Collection/: Contains two Jupyter notebooks for data collection:

HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.

HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.

Preprocessing/: Contains preprocessing scripts:

HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.

HFCommitsPreprocessing.ipynb: Processes commit data, including:

Retrieval of diff information between commits.

Classification of commits following Bhatia et al.'s taxonomy using LLMs.

Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis.

HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.

Analysis/: Contains three Jupyter notebooks with the analysis for each research question:

RQ1_Analysis.ipynb: Analysis for RQ1.

RQ2_Analysis.ipynb: Analysis for RQ2.

RQ3_Analysis.ipynb: Analysis for RQ3.

- datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.

Main Datasets:

HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.

HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.

HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.

model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.

These datasets correspond to the following dataset splits:

+200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models.

+200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study.

+1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution.

Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations.

Additional Datasets:

HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.

HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.

Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps.

- metadata/: Contains the tags_metadata.yaml file used during preprocessing.

- models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.

- requirements.txt: Lists the required Python packages to set up the environment and run the code.

Setup and Execution

Prerequisites

Python 3.10.11 or later.

Jupyter Notebook or JupyterLab.

Installation

Download and Extract the Replication Package

Create a Virtual Environment (Recommended):bash
python -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate

Install Required Packages:bash
pip install -r requirements.txt

Notes

- LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.

- Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.

- Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.

Additional Information

Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.

This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
E
Data for: Automated Generation of Structure Datasets for Machine Learning...
edmond.mpg.de
application/gzip +3
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marvin Poul; Marvin Poul (2025). Data for: Automated Generation of Structure Datasets for Machine Learning Potentials and Alloys [Dataset]. http://doi.org/10.17617/3.DYLLSS
Explore at:
text/comma-separated-values(1657), application/x-gzip(1616269), application/x-gzip(86419116), application/x-gzip(9050356), application/x-gzip(1960106), application/x-gzip(4819324), application/x-gzip(2836478), bin(86419036), application/x-gzip(1425372), application/gzip(3716967195)Available download formats
Unique identifier
https://doi.org/10.17617/3.DYLLSS
Dataset updated
Jun 10, 2025
Dataset provided by
Edmond
Authors
Marvin Poul; Marvin Poul
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
DFG
Description
DFT Training Data for fitting Moment Tensor Potentials for the system Mg/Al/Ca. See https://github.com/eisenforschung/mgalca-mtp-data for further notes and usage examples.
d
100K+ Text Rich Images | AI Training Data | Annotated imagery data for AI |...
datarade.ai
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds (2024). 100K+ Text Rich Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/100k-text-rich-images-ai-training-data-annotated-imagery-data-seeds
Explore at:
.bin, .csv, .json, .sql, .txt, .xmlAvailable download formats
Dataset updated
Mar 18, 2024
Dataset authored and provided by
Data Seeds
Area covered
Bonaire, Réunion, Montenegro, Senegal, Mongolia, Côte d'Ivoire, Italy, Turks and Caicos Islands, Papua New Guinea, Timor-Leste
Description
This dataset features over 100,000 high-quality images containing visible, naturally occurring text, sourced from photographers worldwide. Designed to support AI and machine learning applications, it offers a richly annotated and globally diverse collection ideal for training models in OCR, scene text recognition, and multimodal understanding.

Key Features: 1. Comprehensive Metadata Each image includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Pre-annotations include object detection, scene classification, and text presence. Many images contain metadata on language type, script, and text region properties. Popularity metrics derived from user engagement on our proprietary platform are also included.

Unique Sourcing Capabilities Images are sourced through a gamified photography platform that runs themed competitions — in this case, focused on capturing text in real-world environments. This ensures a steady flow of fresh, relevant, and contextually diverse submissions. Custom datasets can be sourced within 72 hours, including requests for specific languages, signage types, or visual environments (e.g., storefronts, menus, documents, public transport).

Global Diversity Contributors from over 100 countries provide a vast array of languages, scripts (Latin, Cyrillic, Arabic, Chinese, etc.), and contexts. The dataset includes urban signage, handwritten notes, printed posters, digital displays, packaging, street graffiti, books, and more — offering a robust training set for global OCR and text-detection models.

High-Quality Imagery Resolution varies from standard to high-definition, supporting a range of computer vision tasks. The collection includes a mix of candid, environmental shots and deliberate, close-up captures of text, enabling both practical OCR training and stylistic or multimodal research.

Popularity Scores Each image is assigned a popularity score based on performance in our GuruShots photography competitions. This provides additional insight into user-perceived relevance and aesthetic appeal — useful for building models around user engagement, content filtering, or recommendation systems.

AI-Ready Design Optimized for AI workflows, this dataset supports applications in OCR, text spotting, translation, semantic understanding, and cross-modal retrieval. It integrates smoothly into popular machine learning frameworks and pipelines.

Licensing & Compliance The dataset is fully compliant with data privacy regulations and comes with clear, transparent licensing for commercial and academic use. All images have appropriate contributor agreements and usage rights in place.

Use Cases: 1. Training OCR and scene text recognition models across multiple scripts and environments. 2. Powering AI for multilingual translation, navigation, and AR applications. 3. Supporting retail and logistics models through packaging and signage text extraction. 4. Enhancing multimodal AI systems that combine visual and textual understanding. 5. Enabling research in typography, linguistics, and global textual design.

This dataset offers a rich, AI-optimized collection of real-world, text-containing imagery — diverse in content, language, and style — with customization options available for your specific needs. Contact us to request samples or a tailored delivery.
d
Vocalizations in the plains zebra (Equus quagga)
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer (2024). Vocalizations in the plains zebra (Equus quagga) [Dataset]. http://doi.org/10.5061/dryad.v9s4mw73w
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.v9s4mw73w
Dataset updated
Jun 22, 2024
Dataset provided by
Dryad Digital Repository
Authors
Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer
Description
Acoustic signals are vital in animal communication, and quantifying these signals them is fundamental for understanding animal behaviour and ecology. Vocaliszations can be classified into acoustically and functionally or contextually distinct categories, but establishing these categories can be challenging. Newly developed methods, such as machine learning, can provide solutions for classification tasks. The plains zebra is known for its loud and specific vocaliszations, yet limited knowledge exists on the structure and information content of its vocaliszations. In this study, we employed both feature-based and spectrogram-based algorithms, incorporating supervised and unsupervised machine learning methods to enhance robustness in categoriszing zebra vocaliszation types. Additionally, we implemented a permuted discriminant function analysis (pDFA) to examine the individual identity information contained in the identified vocaliszation types. The findings revealed at least four distinct ..., Data collection and sampling We collected data in three locations, in Denmark and South Africa: 1) 10 months between December 2020 and July 2021 and between September and December 2021, at Pilanesberg National Park (hereafter â€œPNPâ€ ), South Africa, covering both dry season (i.e. from May to September) and wet season (i.e. from October to April) (1); 2) 16 days between May and June 2019, and 33 days between February and May 2022, at Knuthenborg Safari Park (hereafter â€œKSPâ€ ), Denmark, covering both periods before the parkâ€™s opening for tourists (i.e. from November to March) and after (i.e. from April to October); 3) 4 days in August 2019 at Givskud Zoo (hereafter â€œGKZâ€ ), Denmark. For all places and periods, three types of data were collected as follows: 1) Pictures were taken for each individual from both sides using a camera (Nikon COOLPIX P950); 2) Contexts of vocal production were recorded either through notes (in the first period of KSP and in GKZ) or videos (in the second period of KS..., , # Vocalizations in the plains zebra (Equus quagga)

Data and Scripts

1_Praat_Script_Zebra_Vocalisations.praat: This script is used to extract vocal features using the software Praat.

2_Data_Script_Vocal_Repertoire.zip: This archive contains data and scripts for analyzing the vocal repertoire. It includes two folders:

Feature_based_analyses:

The dataset "feature_based_input.csv" is the input for both scripts in this folder.

"feature_based_supervised_classification_xgboost.ipynb" is used for supervised analysis.

"feature_based_unsupervised_clustering.ipynb" is used for unsupervised analysis.

Spectrogram_based_analyses:

The "spectrogram_based_classification" folder contains the input data "calltype_spec.npz" and "calltype_y.csv", as well as the notebook script "spectrogram_based_classification_cnn.ipynb" for supervised machine learning analysis.

The "spectrogram_based_clustering" folder contains subfolders "audio", "data"...

NBA Player Dataset & Prediction Model Artifacts

test.researchdata.tuwien.ac.at

bin, csv, json, png +2

Updated Apr 28, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43

Explore at:

json, png, csv, bin, txt, text/markdownAvailable download formats

Unique identifier

https://doi.org/10.70124/ymgzs-z3s43

Dataset updated

Apr 28, 2025

Dataset provided by

TU Wien

Authors

Burak Baltali; Burak Baltali

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

Brief overview of Files

end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;
the Jupyter notebook (Analysis.ipynb); All the code can be executed in there
the trained model binary (nba_model.pkl); Serialized Random Forest model artifact
Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here
FAIR4ML metadata (fair4ml_metadata.jsonld);
see README.md and abbreviations.txt for file details.”
For further information you can go to the github site (Link below)

File Details

Notebook

Analysis.ipynb: Involves the graphica output of the trained and tested data.

Trained/ Test csv Data

Name	Description	PID
regular_train.csv	For training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose	4421e56c-4cd3-4ec1-a566-a89d7ec0bced
regular_test.csv:	For testing purpose of the regular season, the 2022-2023 season was selected	f9d84d5e-db01-4475-b7d1-80cfe9fe0e61
playoff_train.csv	For training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected	bcb3cf2b-27df-48cc-8b76-9e49254783d0
playoff_test.csv	For testing purpose of the playoff season, 2023-2024 season was selected	de37d568-e97f-4cb9-bc05-2e600cc97102

Others

abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

Additional Notes

Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

Some preprocessing has to be done before uploading into dbrepo

Plots have also been uploaded as an output for visual purposes.

A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

m
Bangladeshi Currency (Coins & Notes) Recognition Dataset
data.mendeley.com
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Kumar Basak Shuvo (2025). Bangladeshi Currency (Coins & Notes) Recognition Dataset [Dataset]. http://doi.org/10.17632/xn44yz596n.2
Explore at:
Unique identifier
https://doi.org/10.17632/xn44yz596n.2
Dataset updated
Jan 20, 2025
Authors
Shuvo Kumar Basak Shuvo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh
Description
The Bangladeshi Currency (Coins & Notes) Recognition Dataset is a comprehensive collection of high-quality images of Bangladeshi coins and banknotes. It is designed to facilitate machine learning and computer vision applications for currency recognition, classification, and detection.

This dataset is organized into various denominations of coins and notes, with each folder representing a specific currency denomination. Each folder contains 10,000 images, providing a total of 100,000 images in the dataset.

The images have been resized to a uniform dimension of 256x256 pixels, ensuring consistency and enabling easy integration into machine learning workflows. The images are saved in JPEG format to optimize storage and speed for large-scale training tasks.

Currency Denominations Included: 10 Poisha (Small denomination coin) 1 Poisha 1 Taka 25 Poisha 2 Taka 50 Poisha 5 Poisha 5 Taka Commemorative Coins Demonetized Notes Features: Image Size: All images have been resized to 256x256 pixels (Width x Height). Image Format: JPEG. Total Images: 100,000 (10,000 images per folder, one per denomination). Categories: Each folder corresponds to a unique denomination of currency. The folder names are aligned with the specific denominations such as 10_Poisha, 1_Taka, 5_Taka, etc. Objective: This dataset is ideal for training and evaluating models for the following tasks:

Currency Classification: Identifying the denomination of a given image of a coin or banknote. Currency Recognition: Detecting and recognizing specific Bangladeshi coins and notes from real-world images. Coin and Note Detection: Identifying and classifying multiple coins and notes in a single image. Possible Use Cases: Currency detection systems: Automated systems in ATMs, vending machines, or cash counting machines that recognize Bangladeshi coins and banknotes. Banknote and Coin Classification: Machine learning models that classify various denominations of coins and notes for digital payment applications. Real-world Applications: Currency recognition for mobile apps, kiosks, or any system that needs to automatically recognize Bangladeshi currency. Research in Currency Image Recognition: Researchers working on currency recognition problems using computer vision techniques. Collected (https://www.bb.org.bd/currency) + own
Note for Researchers Using the dataset This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
Indian Currency Notes Classifier
kaggle.com
Updated Jul 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Rajesh Sahani (2020). Indian Currency Notes Classifier [Dataset]. https://www.kaggle.com/gauravsahani/indian-currency-notes-classifier/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Rajesh Sahani
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
India
Description
This Dataset contains 195 images of 7 categories of Indian Currency Notes, this data is collected from Google images, Shutter-stock images and Data flicr respectively, you can Play with this Dataset, to recognize type of Indian Note, from the Photo, or real Time Applications.

This Dataset looks forward, as a Image Classification Data, which contains 7 Distinct types of Indian Currency Notes, the images are not reduced to any single size, they may have different proportions.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4793224%2F9cfc752286ce58ad7277d0c807087a19%2Fstock-vector-rupee-banknotes-set-flat-style-highly-detailed-vector-illustration-isolated-on-white-background-1208483845.jpg?generation=1593957005761091&alt=media" alt="">

These Distinct Types of Indian Currency can be Classified as: 1)Ten Rupee Notes 2)Twenty Rupee Notes 3)Fifty Rupee Notes 4)Hundred Rupee Notes 5)Two Hundred Rupee Notes 6)Five Hundred Rupee Notes, and, 7)Two Thousand Rupee Notes.

Do Download and Play with this Dataset, If you Like this dataset, Please do UpVote! Thank-you.
m
Dataset of Spoilt Banknotes of India (Rupees)
data.mendeley.com
Updated Jul 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vidula Meshram (2023). Dataset of Spoilt Banknotes of India (Rupees) [Dataset]. http://doi.org/10.17632/jh6979fg2t.4
Explore at:
Unique identifier
https://doi.org/10.17632/jh6979fg2t.4
Dataset updated
Jul 17, 2023
Authors
Vidula Meshram
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Accurate Currency recognition and classification is one of the challenging tasks for visually impaired people. As the damaged banknotes are not accepted by vendors while doing financial transactions, thus it is necessary to classify between spoilt and unspoilt banknotes. With this objective we have created the Spoilt Indian (Rupees) bank-notes dataset. This dataset consists of total 5125 (2584 old banknotes and 2541 new banknotes) high quality images. Mobile phones rear camera was used to take the images of spoilt Indian banknotes. The bank-note dataset consists of 8 classes namely Spoilt New 10 Rupees, Spoilt Old 10 Rupees, Spoilt New 20 Rupees, Spoilt Old 20 Rupees, Spoilt New 50 Rupees, Spoilt Old 50 Rupees, Spoilt New 100 Rupees, and Spoilt Old 100 Rupees. The banknote images which were soiled, mutated, holed, torn were considered for dataset creation. The images were taken in dark and illuminated backgrounds as well in the cluttered background.
Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3...
kappasignal.com
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2023). Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3 Months (Forecast) [Dataset]. https://www.kappasignal.com/2023/06/machine-learning-predicts-qqq-to.html
Explore at:
Dataset updated
Jun 2, 2023
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3 Months

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data

Facebook

Twitter

Click to copy link

Link copied

Cite

Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok (2024). Eye Tracking based Learning Style Identification for Learning Management Systems [Dataset]. http://doi.org/10.5281/zenodo.8349468

Eye Tracking based Learning Style Identification for Learning Management Systems

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

bin, tsv, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8349468

Dataset updated

Jul 11, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Abstract:

In recent years, universities have been faced with increasing numbers of students dropping out. This is partly due to the fact that students are limited in their ability to explore individual learning paths through different course materials. However, a promising remedy to this issue is the implementation of adaptive learning management systems. These systems recommend customised learning paths to students - based on their individual learning styles. Learning styles are commonly classified using questionnaires and learning analytics, but both methods are prone to error. Questionnaires may yield superficial responses due to time constraints or lack of motivation, while learning analytics ignore offline learning behaviour. To address these limitations, this study aims to integrating Eye Tracking for a more accurate classification of students' learning styles. Ultimately, this comprehensive approach could not only open up a deeper understanding of subconscious processes, but also provide valuable insights into students' unique learning preferences.

Research:

As an example of a possible analysis of the eye-tracking stimuli and eye movement recordings available here, as well as the corresponding ILS questionnaire responses, we refer to the following research works, which should also be referred to if necessary:

Bittner, D., Nadimpalli, V. K., Grabinger, L., Ezer, T., Hauser, F., & Mottok, J. (2024, June), Uncovering Learning Styles through Eye Tracking and Artificial Intelligence, In 2024 Symposium on Eye Tracking Research and Applications. ETRA.
Bittner, D. (2024), Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence. Master’s Thesis, Regensburg University of Applied Sciences (OTH), Regensburg, Germany
Bittner, D., Ezer, T., Grabinger, L., Hauser, F., & Mottok, J. (2023). Unveiling the secrets of learning styles: decoding eye movements via machine learning. In ICERI2023 Proceedings (pp. 5153-5162). IATED.
Bittner, D., Hauser, F., Nadimpalli, V. K., Grabinger, L., Staufer, S., & Mottok, J. (2023, June). Towards eye tracking based learning style identification. In Proceedings of the 5th European Conference on Software Engineering Education (pp. 138-147). ECSEE.

The following descriptions and the previous abstract are part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. and have to be cited accordingly.

Experimental Setup:

In the following section, crucial notes on the circumstances and the experiment itself as well as the equipment are given.
In order to reduce the external influence on the experiment, variables such as:

order, number, and presentation of the stimuli,
instruction to the participant prior to the experiment,
position of the participant in respect to the Eye Tracking equipment,
environment such as illuminance and ambient noise for the participant,
Eye Tracking equipment, software, settings such as sampling frequency and latency as well as calibration

were attempted to keep constant and consistent throughout the experiment.

Equipment:

In this study, the Tobii Pro Fusion (https://go.tobii.com/tobii-pro-fusion-user-manual) eye tracker is utilized without a chin rest along with the Tobii IVT filter for fixation detection and Tobii Pro Lab software for data collection. The Tobii Pro Fusion is categorised as a video-based combined pupil and corneal reflection technology. This tracker provides several advantages, such as the collection of comprehensive data, comprising gaze, pupil, and eye-opening metrics. The eye tracker captures up to 250 images per second (250Hz), enhancing its precision and eye movement analysis. In addition, Tobii Pro Fusion is capable of performing under different lighting conditions, thus making this portable device ideal for off-site studies.

Ensuring consistent quality across all experiment participants is crucial. Prior to each individual experiment, eye trackers are calibrated, aiming for a maximum reproduction error of less or equal than 0.2 degree during calibration to minimize deviations. The calibration is excluded from the experiment recording. Each participant is given the same instructions for their single trial of the experiment. The stimuli is displayed on a 24-inch monitor in a 16:9 format, positioned approximately 65cm away from the participants' eyes. Any effect related to the characteristics of the participants, such as age, visual acuity, eye colour, pupil size, etc., are considered in the experiment design.

Procedure:

Initially, the participants are requested to confirm their ability to conduct the experiment based on their current condition. Subsequently, the participant must be positioned comfortably and accurately in relation to the eye tracker. The eye tracker calibration is carried out for each participant to ensure a suitable experimental configuration. Once a successful calibration is achieved, the Eye Tracking experiment begin with introductions prior to each task. The stimuli presentation is unrestricted by time constraints, and no prior knowledge of the stimuli contents is necessary. Employing a within-subject design, each stimulus is exposed to each subject. Following completion of the experiment, participants anonymously answer the ILS questionnaire. To prevent any impact on the experiment, it is important that the questionnaire only be seen and completed after the experiment.

Stimuli:

The specially designed stimuli shown to participants during the study are illustrated in the left-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. For this research, only specific regions of a stimulus, referred to as AOI, are taken into consideration. The size of the AOI depends on both stimulus information and distance between multiple AOIs. Adequate results are ensured by not overlapping AOIs and appropriate spacing. The AOIs of the various stimuli employed in this research are illustrated in the right-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. The stimuli are presented in German language, ensuring reliable Eye Tracking measurements without any interference from language barriers. Each stimulus comprises diverse learning materials to engage students with varying learning styles, with some general information about the quantitative research cycle. Some stimuli feature identical type of material, e.g. illustrations or key words, but with different contexts and positions on the stimuli. Rearranging the identical material reduces the influence of reading style and enhances the impact of the learning style, producing a more reliable experiment. These identical types of material or AOIs on different stimuli can be grouped together, identified by the same colour and title, and referred to as AOI groupings.
There are ten different AOI groupings in total, as illustrated in the figure in the "[Documentation]stimuli_preview.pdf" file, where each grouping consists of several AOIs.
In detail, the AOI grouping regarding:

table of contents and summary contain only a single AOI each,
illustrations, key words, theory, exercise, example and additional material contain three AOIs each,
supporting text and multiple choice question contain two AOIs each.

Research data management:

To ensure the transparency and reproducibility of this study, effective management of research data is essential. This section provides details on the management, storage and analysis of the extensive dataset collected as part of the study. Importantly, this research, the study and its processes adhered to ethical guidelines at all times, including informed consent, participant anonymity and secure data handling. The data collected will only be kept for a specific period of time as defined in the research project guidelines. The collection itself involves the recording of participants' eye movements during the ET study and the collection of their demographic data and responses to the ILS questionnaire.

Clear search

Close search

Google apps

Main menu

Eye Tracking based Learning Style Identification for Learning Management...

Abstract:

Research:

Experimental Setup:

Equipment:

Procedure:

Stimuli:

Research data management:

A Messy Handwriting Dataset with Student Crossouts and Corrections...

Industrial screw driving dataset collection: Time series data for process...

Data from: Dataset for Investigating Anomalies in Compute Clusters

Data for: The Bystander Affect Detection (BAD) Dataset for Failure Detection...

WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music

Probabilistic AI: A New Approach to Artificial Intelligence (Forecast)

Probabilistic AI: A New Approach to Artificial Intelligence

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

IoT Monitoring Dataset of Water Quality and Tilapia (Oreochromis niloticus)...

Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...

AI-based Clinical Trials Solution Provider Market Report | Global Forecast...

AI-based Clinical Trials Solution Provider Market Outlook

Component Analysis

Forensic Toolkit Dataset

Replication Package for 'How do Machine Learning Models Change?'

Overview

Data Collection and Preprocessing

Data Collection

Data Preprocessing

Folder Structure

Setup and Execution

Prerequisites

Installation

Notes

Additional Information

Data for: Automated Generation of Structure Datasets for Machine Learning...

100K+ Text Rich Images | AI Training Data | Annotated imagery data for AI |...

Vocalizations in the plains zebra (Equus quagga)

Data and Scripts

NBA Player Dataset & Prediction Model Artifacts

Description

Brief overview of Files

File Details

Bangladeshi Currency (Coins & Notes) Recognition Dataset

Indian Currency Notes Classifier

Dataset of Spoilt Banknotes of India (Rupees)

Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3...

Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3 Months

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Eye Tracking based Learning Style Identification for Learning Management Systems

Abstract:

Research:

Experimental Setup:

Equipment:

Procedure:

Stimuli:

Research data management: