100+ datasets found
  1. Eye Tracking based Learning Style Identification for Learning Management...

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf, tsv
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok (2024). Eye Tracking based Learning Style Identification for Learning Management Systems [Dataset]. http://doi.org/10.5281/zenodo.8349468
    Explore at:
    bin, tsv, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Abstract:

    In recent years, universities have been faced with increasing numbers of students dropping out. This is partly due to the fact that students are limited in their ability to explore individual learning paths through different course materials. However, a promising remedy to this issue is the implementation of adaptive learning management systems. These systems recommend customised learning paths to students - based on their individual learning styles. Learning styles are commonly classified using questionnaires and learning analytics, but both methods are prone to error. Questionnaires may yield superficial responses due to time constraints or lack of motivation, while learning analytics ignore offline learning behaviour. To address these limitations, this study aims to integrating Eye Tracking for a more accurate classification of students' learning styles. Ultimately, this comprehensive approach could not only open up a deeper understanding of subconscious processes, but also provide valuable insights into students' unique learning preferences.

    Research:

    As an example of a possible analysis of the eye-tracking stimuli and eye movement recordings available here, as well as the corresponding ILS questionnaire responses, we refer to the following research works, which should also be referred to if necessary:

    • Bittner, D., Nadimpalli, V. K., Grabinger, L., Ezer, T., Hauser, F., & Mottok, J. (2024, June), Uncovering Learning Styles through Eye Tracking and Artificial Intelligence, In 2024 Symposium on Eye Tracking Research and Applications. ETRA.
    • Bittner, D. (2024), Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence. Master’s Thesis, Regensburg University of Applied Sciences (OTH), Regensburg, Germany
    • Bittner, D., Ezer, T., Grabinger, L., Hauser, F., & Mottok, J. (2023). Unveiling the secrets of learning styles: decoding eye movements via machine learning. In ICERI2023 Proceedings (pp. 5153-5162). IATED.
    • Bittner, D., Hauser, F., Nadimpalli, V. K., Grabinger, L., Staufer, S., & Mottok, J. (2023, June). Towards eye tracking based learning style identification. In Proceedings of the 5th European Conference on Software Engineering Education (pp. 138-147). ECSEE.

    The following descriptions and the previous abstract are part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. and have to be cited accordingly.

    Experimental Setup:

    In the following section, crucial notes on the circumstances and the experiment itself as well as the equipment are given.
    In order to reduce the external influence on the experiment, variables such as:

    • order, number, and presentation of the stimuli,
    • instruction to the participant prior to the experiment,
    • position of the participant in respect to the Eye Tracking equipment,
    • environment such as illuminance and ambient noise for the participant,
    • Eye Tracking equipment, software, settings such as sampling frequency and latency as well as calibration

    were attempted to keep constant and consistent throughout the experiment.

    Equipment:

    In this study, the Tobii Pro Fusion (https://go.tobii.com/tobii-pro-fusion-user-manual) eye tracker is utilized without a chin rest along with the Tobii IVT filter for fixation detection and Tobii Pro Lab software for data collection. The Tobii Pro Fusion is categorised as a video-based combined pupil and corneal reflection technology. This tracker provides several advantages, such as the collection of comprehensive data, comprising gaze, pupil, and eye-opening metrics. The eye tracker captures up to 250 images per second (250Hz), enhancing its precision and eye movement analysis. In addition, Tobii Pro Fusion is capable of performing under different lighting conditions, thus making this portable device ideal for off-site studies.

    Ensuring consistent quality across all experiment participants is crucial. Prior to each individual experiment, eye trackers are calibrated, aiming for a maximum reproduction error of less or equal than 0.2 degree during calibration to minimize deviations. The calibration is excluded from the experiment recording. Each participant is given the same instructions for their single trial of the experiment. The stimuli is displayed on a 24-inch monitor in a 16:9 format, positioned approximately 65cm away from the participants' eyes. Any effect related to the characteristics of the participants, such as age, visual acuity, eye colour, pupil size, etc., are considered in the experiment design.

    Procedure:

    Initially, the participants are requested to confirm their ability to conduct the experiment based on their current condition. Subsequently, the participant must be positioned comfortably and accurately in relation to the eye tracker. The eye tracker calibration is carried out for each participant to ensure a suitable experimental configuration. Once a successful calibration is achieved, the Eye Tracking experiment begin with introductions prior to each task. The stimuli presentation is unrestricted by time constraints, and no prior knowledge of the stimuli contents is necessary. Employing a within-subject design, each stimulus is exposed to each subject. Following completion of the experiment, participants anonymously answer the ILS questionnaire. To prevent any impact on the experiment, it is important that the questionnaire only be seen and completed after the experiment.

    Stimuli:

    The specially designed stimuli shown to participants during the study are illustrated in the left-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. For this research, only specific regions of a stimulus, referred to as AOI, are taken into consideration. The size of the AOI depends on both stimulus information and distance between multiple AOIs. Adequate results are ensured by not overlapping AOIs and appropriate spacing. The AOIs of the various stimuli employed in this research are illustrated in the right-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. The stimuli are presented in German language, ensuring reliable Eye Tracking measurements without any interference from language barriers. Each stimulus comprises diverse learning materials to engage students with varying learning styles, with some general information about the quantitative research cycle. Some stimuli feature identical type of material, e.g. illustrations or key words, but with different contexts and positions on the stimuli. Rearranging the identical material reduces the influence of reading style and enhances the impact of the learning style, producing a more reliable experiment. These identical types of material or AOIs on different stimuli can be grouped together, identified by the same colour and title, and referred to as AOI groupings.
    There are ten different AOI groupings in total, as illustrated in the figure in the "[Documentation]stimuli_preview.pdf" file, where each grouping consists of several AOIs.
    In detail, the AOI grouping regarding:

    • table of contents and summary contain only a single AOI each,
    • illustrations, key words, theory, exercise, example and additional material contain three AOIs each,
    • supporting text and multiple choice question contain two AOIs each.

    Research data management:

    To ensure the transparency and reproducibility of this study, effective management of research data is essential. This section provides details on the management, storage and analysis of the extensive dataset collected as part of the study. Importantly, this research, the study and its processes adhered to ethical guidelines at all times, including informed consent, participant anonymity and secure data handling. The data collected will only be kept for a specific period of time as defined in the research project guidelines. The collection itself involves the recording of participants' eye movements during the ET study and the collection of their demographic data and responses to the ILS questionnaire.

  2. r

    A Messy Handwriting Dataset with Student Crossouts and Corrections...

    • research-repository.rmit.edu.au
    • researchdata.edu.au
    zip
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa (2023). A Messy Handwriting Dataset with Student Crossouts and Corrections (Line-version) [Dataset]. http://doi.org/10.25439/rmt.24419986.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 24, 2023
    Dataset provided by
    RMIT University
    Authors
    Hiqmat Nisa
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.1)SMHD.txt contain all the line level transcription in the form of image name, threshold value, label 0001-000,178 Bombay Phenotype :- 2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text. 3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.Dataset Description:We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.

  3. Z

    Industrial screw driving dataset collection: Time series data for process...

    • data.niaid.nih.gov
    Updated Feb 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    West, Nikolai (2025). Industrial screw driving dataset collection: Time series data for process monitoring and anomaly detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14729547
    Explore at:
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Deuse, Jochen
    West, Nikolai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Industrial Screw Driving Datasets

    Overview

    This repository contains a collection of real-world industrial screw driving datasets, designed to support research in manufacturing process monitoring, anomaly detection, and quality control. Each dataset represents different aspects and challenges of automated screw driving operations, with a focus on natural process variations and degradation patterns.

    Scenario name Number of work pieces Repetitions (screw cylces) per workpiece Individual screws per workpiece Observations Unique classes Purpose

    s01_thread-degradation 100 25 2 5.000 1 Investigation of thread degradation through repeated fastening

    s02_surface-friction 250 25 2 12.500 8 Surface friction effects on screw driving operations

    s03_error-collection-1

    1 2

    20

    s04_error-collection-2 2.500 1 2 5.000 25

    s05_injection-molding-manipulations-upper-workpiece 1.200 1 2 2.400 44 Investigation of changes in the injection molding process of the workpieces

    Dataset Collection

    The datasets were collected from operational industrial environments, specifically from automated screw driving stations used in manufacturing. Each scenario investigates specific mechanical phenomena that can occur during industrial screw driving operations:

    Currently Available Datasets:

    1. s01_thread-degradation

    Focus: Investigation of thread degradation through repeated fastening

    Samples: 5,000 screw operations (4,089 normal, 911 faulty)

    Features: Natural degradation patterns, no artificial error induction

    Equipment: Delta PT 40x12 screws, thermoplastic components

    Process: 25 cycles per location, two locations per workpiece

    First published in: HICSS 2024 (West & Deuse, 2024)

    1. s02_surface-friction

    Focus: Surface friction effects on screw driving operations

    Samples: 12,500 screw operations (9,512 normal, 2,988 faulty)

    Features: Eight distinct surface conditions (baseline to mechanical damage)

    Equipment: Delta PT 40x12 screws, thermoplastic components, surface treatment materials

    Process: 25 cycles per location, two locations per workpiece

    First published in: CIE51 2024 (West & Deuse, 2024)

    1. s05_injection-molding-manipulations-upper-workpiece

    Manipulations of the injection molding process with no changes during tightening

    Samples: 2,400 screw operations (2,397 normal, 3 faulty)

    Features: 44 classes in five distinct groups:

    Mold temperature

    Glass fiber content

    Recyclate content

    Switching point

    Injection velocity

    Equipment: Delta PT 40x12 screws, thermoplastic components

    Unpublished, work in progress

    Upcoming Datasets:

    1. s03_screw-error-collection-1 (recorded but unpublished)

    Focus: Varius manipulations of the screw driving process

    Features: More than 20 different errors recorded

    First published in: Publication planned

    Status: In preparation

    1. s04_screw-error-collection-2 (recorded but unpublished)

    Focus: Varius manipulations of the screw driving process

    Features: 25 distinct errors recorded over the course of a week

    First published in: Publication planned

    Status: In preparation

    1. s06_injection-molding-manipulations-lower-workpiece (recorded but unpublished)

    Manipulations of the injection molding process with no changes during tightening

    Additional scenarios may be added to this collection as they become available.

    Data Format

    Each dataset follows a standardized structure:

    JSON files containing individual screw operation data

    CSV files with operation metadata and labels

    Comprehensive documentation in README files

    Example code for data loading and processing is available in the companion library PyScrew

    Research Applications

    These datasets are suitable for various research purposes:

    Machine learning model development and validation

    Process monitoring and control systems

    Quality assurance methodology development

    Manufacturing analytics research

    Anomaly detection algorithm benchmarking

    Usage Notes

    All datasets include both normal operations and process anomalies

    Complete time series data for torque, angle, and additional parameters available

    Detailed documentation of experimental conditions and setup

    Data collection procedures and equipment specifications available

    Access and Citation

    These datasets are provided under an open-access license to support research and development in manufacturing analytics. When using any of these datasets, please cite the corresponding publication as detailed in each dataset's README file.

    Related Tools

    We recommend using our library PyScrew to load and prepare the data. However, the the datasets can be processed using standard JSON and CSV processing libraries. Common data analysis and machine learning frameworks may be used for the analysis. The .tar file provided all information required for each scenario.

    Contact and Support

    For questions, issues, or collaboration interests regarding these datasets, either:

    Open an issue in our GitHub repository PyScrew

    Contact us directly via email

    Acknowledgments

    These datasets were collected and prepared by:

    RIF Institute for Research and Transfer e.V.

    University of Kassel, Institute of Material Engineering

    Technical University Dortmund, Institute for Production Systems

    The preparation and provision of the research was supported by:

    German Ministry of Education and Research (BMBF)

    European Union's "NextGenerationEU" program

    The research is part of this funding program

    More information regarding the research project is available here

    Change Log

    Version Date Features

    v1.1.3 18.02.2025

    • Upload of s05 with injection molding manipulations in 44 classes

    v1.1.2 12.02.2025

    • Change to default names label.csv and README.md in all scenarios

    v1.1.1 12.02.2025

    • Reupload of both s01 and s02 as zip (smaller size) and tar (faster extraction) files

    • Change to the data structure (now organized as subdirectories per class in json/)

    v1.1.0 30.01.2025

    • Initial uplload of the second scenario s02_surface-friction

    v1.0.0 24.01.2025

    • Initial upload of the first scenario s01_thread-degradation
  4. Z

    Data from: Dataset for Investigating Anomalies in Compute Clusters

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hild, Laura (2023). Dataset for Investigating Anomalies in Compute Clusters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10058229
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Jones, Mark
    Schram, Malachi
    Hild, Laura
    Moore, Wesley
    McSpadden, Diana
    Smirni, Evgenia
    Mohammed, Ahmed
    Lu, Yiyang
    Hess, Bryan
    Ren, Jie
    Yasir, Alanazi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data. Background Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff. The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job. Usage Notes While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster. https://doi.org/10.48550/arXiv.2311.16129

  5. Q

    Data for: The Bystander Affect Detection (BAD) Dataset for Failure Detection...

    • data.qdr.syr.edu
    pdf, tsv, txt, zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Bremers; Alexandra Bremers; Xuanyu Fang; Xuanyu Fang; Natalie Friedman; Natalie Friedman; Wendy Ju; Wendy Ju (2023). Data for: The Bystander Affect Detection (BAD) Dataset for Failure Detection in HRI [Dataset]. http://doi.org/10.5064/F6TAWBGS
    Explore at:
    zip(66872585), zip(67359564), zip(49981372), zip(45063165), zip(35942055), tsv(5431), zip(63732190), zip(32108293), zip(33064251), zip(49848937), zip(38858151), zip(137880775), zip(90804192), zip(36477139), zip(38068214), zip(36039067), zip(37592931), zip(34234760), zip(63445623), zip(38092264), zip(45582594), zip(50915158), zip(111033502), zip(32955394), zip(30549219), zip(39991378), zip(166237686), zip(50351519), zip(62744513), zip(46810648), zip(34379478), zip(35492684), zip(22036189), pdf(197935), zip(66187509), zip(40085473), zip(40798037), pdf(113804), zip(12931695), zip(31593404), zip(26677367), zip(35547615), tsv(244631), zip(35954889), txt(7329), zip(74593629), zip(52574377), zip(55483165), zip(31323914), zip(43519637), zip(42743107), zip(55790691), zip(50499507), zip(76761027), zip(38063092), zip(55654900), zip(30504764), zip(48203736), zip(40422817)Available download formats
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Qualitative Data Repository
    Authors
    Alexandra Bremers; Alexandra Bremers; Xuanyu Fang; Xuanyu Fang; Natalie Friedman; Natalie Friedman; Wendy Ju; Wendy Ju
    License

    https://qdr.syr.edu/policies/qdr-restricted-access-conditionshttps://qdr.syr.edu/policies/qdr-restricted-access-conditions

    Description

    Project Overview For a robot to repair its own error, it must first know it has made a mistake. One way that people detect errors is from the implicit reactions from bystanders – their confusion, smirks, or giggles clue us in that something unexpected occurred. To enable robots to detect and act on bystander responses to task failures, we developed a novel method to elicit bystander responses to human and robot errors. Data Overview This project introduces the Bystander Affect Detection (BAD) dataset – a dataset of videos of bystander reactions to videos of failures. This dataset includes 2,452 human reactions to failure, collected in contexts that approximate “in-the-wild” data collection – including natural variances in webcam quality, lighting, and background. The BAD dataset may be requested for use in related research projects. As the dataset contains facial video data of participants, access can be requested along with the presentation of a research protocol and data use agreement that protects participants. Data Collection Overview and Access Conditions Using 46 different stimulus videos featuring a variety of human and machine task failures, we collected a total of 2,452 webcam videos of human reactions from 54 participants. Recruitment happened through the online behavioral research platform Prolific (https://www.prolific.co/about), where the options were selected to recruit a gender-balanced sample across all countries available. Participants had to use a laptop or desktop. Compensation was set at the Prolific rate of $12/hr, which came down to about $8 per participant for about 40 minutes of participation. Participants agreed that their data can be shared for future research projects and the data were approved to be shared publicly by IRB review. However, considering the fact that this is a machine-learning dataset containing identifiable crowdsourced human subjects data, the research team has decided that potential secondary users of the data must meet the following criteria for the access request to be granted: 1. Agreement to three usage terms: - I will not redistribute the contents of the BAD Dataset - I will not use videos for purposes outside of human interaction research (broadly defined as any project that aims to study or develop improvements to human interactions with technology to result in a better user experience) - I will not use the videos to identify, defame, or otherwise negatively impact the health, welfare, employment or reputation of human participants 2. A description of what you want to use the BAD dataset for, indicating any applicable human subjects protection measures that are in place. (For instance, "Me and my fellow researchers at University of X, lab of Y, will use the BAD dataset to train a model to detect when our Nao robot interrupts people at awkward times. The PI is Professor Z. Our protocol was approved under IRB #.") 3. A copy of the IRB record or ethics approval document, confirming the research protocol and institutional approval. Data Analysis To test the viability of the collected data, we used the Bystander Reaction Dataset as input to a deep-learning model, BADNet, to predict failure occurrence. We tested different data labeling methods and learned how they affect model performance, achieving precisions above 90%. Shared Data Organization This data project consists of 54 zipped folders of recorded video data organized by participant, totaling 2,452 videos. The accompanying documentation includes a file containing the text of the consent form used for the research project, an inventory of the stimulus videos used, aggregate survey data, this data narrative, and an administrative readme file. Special Notes The data were approved to be shared publicly by IRB review. However, considering the fact that this is a machine-learning dataset containing identifiable crowdsourced human subjects data, the research team has decided that potential secondary users of the data must meet specific criteria before they qualify for access. Please consult the Terms tab below for more details and follow the instructions there if interested in requesting access.

  6. Z

    WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music

    • data.niaid.nih.gov
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patchbanks (2024). WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13769543
    Explore at:
    Dataset updated
    Oct 11, 2024
    Dataset provided by
    WaivOps
    Patchbanks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EDM-HSE Dataset

    EDM-HSE is an open audio dataset containing a collection of code-generated drum recordings in the style of modern electronic house music. It includes 8,000 audio loops recorded in uncompressed stereo WAV format, created using custom audio samples and a MIDI drum dataset. The dataset also comes with paired JSON files containing MIDI note numbers (pitch) and tempo data, intended for supervised training of generative AI audio models.

    Overview

    The EDM-HSE Dataset was developed using an algorithmic framework to generate probable drum notations commonly played by EDM music producers. For supervised training with labeled data, a variational mixing technique was applied to the rendered audio files. This method systematically includes or excludes drum notes, assisting the model in recognizing patterns and relationships between drum instruments, thereby enhancing its generalization capabilities.

    The primary purpose of this dataset is to provide accessible content for machine learning applications in music and audio. Potential use cases include generative music, feature extraction, tempo detection, audio classification, rhythm analysis, drum synthesis, music information retrieval (MIR), sound design and signal processing.

    Specifications

    8,000 audio loops (approximately 17 hours)

    16-bit WAV format

    Tempo range: 120–130 BPM

    Paired label data (WAV + JSON)

    Variational drum patterns

    Subgenre styles (Big room, electro, minimal, classic)

    A JSON file is provided for referencing and converting MIDI note numbers to text labels. You can update the text labels to suit your preferences.

    License

    This dataset was compiled by WaivOps, a crowdsourced music project managed by the sound label company Patchbanks. All recordings have been compiled by verified sources for copyright clearance.

    The EDM-HSE dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

    Additional Info

    Please note that this dataset has not been fully reviewed and may contain minor notational errors or audio defects.

    For audio examples or more information about this dataset, please refer to the GitHub repository.

  7. Probabilistic AI: A New Approach to Artificial Intelligence (Forecast)

    • kappasignal.com
    Updated May 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KappaSignal (2023). Probabilistic AI: A New Approach to Artificial Intelligence (Forecast) [Dataset]. https://www.kappasignal.com/2023/05/probabilistic-ai-new-approach-to.html
    Explore at:
    Dataset updated
    May 27, 2023
    Dataset authored and provided by
    KappaSignal
    License

    https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html

    Description

    This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

    Probabilistic AI: A New Approach to Artificial Intelligence

    Financial data:

    • Historical daily stock prices (open, high, low, close, volume)

    • Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

    • Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

    Machine learning features:

    • Feature engineering based on financial data and technical indicators

    • Sentiment analysis data from social media and news articles

    • Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

    Potential Applications:

    • Stock price prediction

    • Portfolio optimization

    • Algorithmic trading

    • Market sentiment analysis

    • Risk management

    Use Cases:

    • Researchers investigating the effectiveness of machine learning in stock market prediction

    • Analysts developing quantitative trading Buy/Sell strategies

    • Individuals interested in building their own stock market prediction models

    • Students learning about machine learning and financial applications

    Additional Notes:

    • The dataset may include different levels of granularity (e.g., daily, hourly)

    • Data cleaning and preprocessing are essential before model training

    • Regular updates are recommended to maintain the accuracy and relevance of the data

  8. m

    IoT Monitoring Dataset of Water Quality and Tilapia (Oreochromis niloticus)...

    • data.mendeley.com
    Updated Nov 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rubén Baena-Navarro (2024). IoT Monitoring Dataset of Water Quality and Tilapia (Oreochromis niloticus) Health in Aquaculture Ponds in Montería, Colombia (2024)) [Dataset]. http://doi.org/10.17632/3g2b4sh65m.1
    Explore at:
    Dataset updated
    Nov 5, 2024
    Authors
    Rubén Baena-Navarro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Colombia, Montería
    Description

    Description This dataset contains six months of water quality and tilapia (Oreochromis niloticus) health monitoring data collected from aquaculture ponds in Montería, Colombia. Using an IoT-based monitoring system, critical parameters such as dissolved oxygen (DO), pH, water temperature, and turbidity were recorded. Fish health indicators, including average fish weight and survival rate, are also included. Data was collected from January to June 2024, with hourly readings to capture daily fluctuations and ensure comprehensive monitoring of aquaculture conditions and tilapia well-being.

    Included Files 1. Data Model IoTMLCQ 2024.xlsx o Contains sensor readings and fish health data collected over six months. o Columns:  Datetime: Date and time of each reading.  Month: Data collection month (January to June).  Average Fish Weight (g): Average weight of the tilapia fish in grams.  Survival Rate (%): Percentage of fish survival during the monitoring period.  Disease Occurrence (Cases): Number of disease cases observed.  Temperature (°C): Water temperature readings.  Dissolved Oxygen (mg/L): Levels of dissolved oxygen in the water.  pH: Water pH values.  Turbidity (NTU): Water turbidity measured in Nephelometric Turbidity Units (NTU).  Oxygenation Automatic: Indicates if automatic oxygenation was applied (Yes/No).  Oxygenation Interventions: Oxygenation interventions applied (Yes/No).  Corrective Interventions: Number of corrective measures taken.  Thermal Risk Index: Indicates if the thermal risk is "High" or "Normal."  Low Oxygen Alert: Shows "Critical" if DO levels are below 5 mg/L, otherwise "Safe."  Health Status: Fish health status, showing "At Risk" or "Stable" based on thermal and oxygen risk alerts.

    Data Collection Method Data was collected using IoT sensors strategically placed in the aquaculture ponds. Readings were taken every hour throughout the monitoring period. This dataset provides valuable insights into the relationship between water quality parameters and the health of tilapia (Oreochromis niloticus) in controlled aquaculture conditions.

    Usage Notes • This dataset is useful for research in aquaculture management, water quality monitoring, and predictive modeling of fish health and growth. • Missing data due to sensor or communication failures were addressed using interpolation methods. • Regular sensor calibrations were performed to ensure accuracy in the collected data.

  9. Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, txt
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell (2024). OpenChart-SE: A corpus of artificial Swedish electronic health records for imagined emergency care patients written by physicians in a crowd-sourcing project [Dataset]. http://doi.org/10.5281/zenodo.7499831
    Explore at:
    txt, csv, bin, pdfAvailable download formats
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johanna Berg; Johanna Berg; Carl Ollvik Aasa; Björn Appelgren Thorell; Sonja Aits; Sonja Aits; Carl Ollvik Aasa; Björn Appelgren Thorell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically.

    Dataset content

    OpenChart-SE, version 1 corpus (txt files and and dataset.csv)

    The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication.

    Codebook.xlsx

    The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection.

    suppl_data_1_openchart-se_form.pdf

    OpenChart-SE mock emergency care EHR form.

    suppl_data_3_openchart-se_dataexploration.ipynb

    This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus.

    More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se).

  10. AI-based Clinical Trials Solution Provider Market Report | Global Forecast...

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). AI-based Clinical Trials Solution Provider Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-based-clinical-trials-solution-provider-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    Authors
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI-based Clinical Trials Solution Provider Market Outlook



    The global AI-based Clinical Trials Solution Provider market size was valued at USD 1.5 billion in 2023 and is projected to reach USD 7.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. The rapid growth of this market can be attributed to the increasing adoption of artificial intelligence (AI) in clinical trials to enhance data accuracy, reduce trial times, and cut costs. The integration of AI in clinical trials is revolutionizing the pharmaceutical and biotechnology industries by providing more efficient, cost-effective, and reliable solutions.



    One of the primary growth factors for this market is the rising complexity and cost of traditional clinical trials. AI-based solutions offer a significant reduction in the time and resources required for clinical trials by automating various processes such as patient recruitment, data collection, and data analysis. This not only accelerates the trial process but also minimizes human errors, thus enhancing the reliability of the results. Moreover, the increasing incidence of chronic diseases and the subsequent rise in the number of clinical trials are further driving the demand for AI-based solutions.



    Another crucial growth factor is the growing awareness and acceptance of AI technology within the healthcare sector. As more pharmaceutical companies and contract research organizations (CROs) recognize the benefits of AI, there is an increasing willingness to invest in these technologies. AI can analyze vast amounts of data much faster and more accurately than traditional methods, leading to more effective and personalized treatments. Additionally, regulatory bodies are beginning to support the use of AI in clinical trials, which is further fueling market growth.



    The advancements in AI technology itself are also a significant growth driver. Innovations such as machine learning, natural language processing, and deep learning are continually being refined and applied to clinical trials. These technologies can predict patient outcomes more accurately, identify suitable candidates for trials more efficiently, and provide valuable insights from unstructured data. Consequently, the continuous improvement in AI technologies is expected to sustain market growth in the coming years.



    Regionally, North America is expected to dominate the market, followed by Europe and the Asia Pacific. The robust healthcare infrastructure, high adoption rate of advanced technologies, and presence of major pharmaceutical companies in North America are key factors contributing to its leading position. Europe is also a significant market due to its strong emphasis on research and development (R&D) and favorable regulatory environment. Meanwhile, the Asia Pacific region is anticipated to witness the highest growth rate due to increasing investments in healthcare infrastructure and the growing number of clinical trials in emerging economies like China and India.



    Component Analysis



    The AI-based Clinical Trials Solution Provider market by component is segmented into software and services. The software segment is expected to hold a substantial share of the market, driven by the increasing demand for advanced analytics and predictive modeling tools. These software solutions are designed to streamline various aspects of clinical trials, from patient recruitment to data analysis, thereby reducing trial timelines and costs. The rapid adoption of cloud-based solutions is further propelling the growth of this segment, enabling real-time data access and enhanced collaboration among stakeholders.



    Within the software segment, predictive analytics tools are gaining significant traction. These tools leverage machine learning algorithms to predict patient outcomes and identify potential risks, thereby enabling more informed decision-making. Natural language processing (NLP) software is another critical component, used to extract valuable insights from unstructured data such as clinical notes and research papers. The continuous advancements in these technologies are expected to drive substantial growth in the software segment over the forecast period.



    The services segment, comprising consulting, implementation, and support services, is also poised for significant growth. As pharmaceutical companies and CROs increasingly adopt AI-based solutions, the demand for expert consulting services to guide them through the implementation process is rising. These services ensure that the AI solutions are effectively integrated into existin

  11. Forensic Toolkit Dataset

    • kaggle.com
    Updated May 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SUNNY THAKUR (2025). Forensic Toolkit Dataset [Dataset]. https://www.kaggle.com/datasets/cyberprince/forensic-toolkit-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SUNNY THAKUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Forensic Toolkit Dataset Overview The Forensic Toolkit Dataset is a comprehensive collection of 300 digital forensics and incident response (DFIR) tools, designed for training AI models, supporting forensic investigations, and enhancing cybersecurity workflows. The dataset includes both mainstream and unconventional tools, covering disk imaging, memory analysis, network forensics, mobile forensics, cloud forensics, blockchain analysis, and AI-driven forensic techniques. Each entry provides detailed information about the tool's name, commands, usage, description, supported platforms, and official links, making it a valuable resource for forensic analysts, data scientists, and machine learning engineers. Dataset Description The dataset is provided in JSON Lines (JSONL) format, with each line representing a single tool as a JSON object. It is optimized for AI training, data analysis, and integration into forensic workflows. Schema Each entry contains the following fields:

    id: Sequential integer identifier (1–300).
    tool_name: Name of the forensic tool.
    commands: List of primary commands or usage syntax (if applicable; GUI-based tools noted).
    usage: Brief description of how the tool is used in forensic or incident response tasks.
    description: Detailed explanation of the tool’s purpose, capabilities, and forensic applications.
    link: URL to the tool’s official website or documentation (verified as of May 26, 2025).
    system: List of supported platforms (e.g., Linux, Windows, macOS, Android, iOS, Cloud).
    
    
    Sample Entry
    {
     "id": 1,
     "tool_name": "The Sleuth Kit (TSK)",
     "commands": ["fls -r -m / image.dd > bodyfile", "ils -e image.dd", "icat image.dd 12345 > output.file", "istat image.dd 12345"],
     "usage": "Analyze disk images to recover files, list file metadata, and create timelines.",
     "description": "Open-source collection of command-line tools for analyzing disk images and file systems (NTFS, FAT, ext). Enables recovery of deleted files, metadata examination, and timeline generation.",
     "link": "https://www.sleuthkit.org/sleuthkit/",
     "system": ["Linux", "Windows", "macOS"]
    }
    

    Dataset Structure

    Total Entries: 300

    Content Focus: Mainstream tools (e.g., The Sleuth Kit, FTK Imager). Unconventional tools (e.g., IoTSeeker, Chainalysis Reactor, DeepCase). Specialized areas: IoT, blockchain, cloud, mobile, and AI-driven forensics.

    Purpose The dataset is designed for:

    AI Training: Fine-tuning machine learning models for forensic tool recommendation, command generation, or artifact analysis. Forensic Analysis: Reference for forensic analysts to identify tools for specific investigative tasks. Cybersecurity Research: Supporting incident response, threat hunting, and vulnerability analysis. Education: Providing a structured resource for learning about DFIR tools and their applications.

    Usage Accessing the Dataset

    Download the JSONL files from the repository. Each file can be parsed using standard JSONL libraries (e.g., jsonlines in Python, jq in Linux). Combine files for a complete dataset or use individual segments as needed. ```python

    Example: Parsing with Python import json

    with open('forensic_toolkit_dataset_1_50.jsonl', 'r') as file: for line in file: tool = json.loads(line) print(f"Tool: {tool['tool_name']}, Supported Systems: {tool['system']}")

    Applications
    
    AI Model Training: Use the dataset to train models for predicting tool usage based on forensic tasks or generating command sequences.
    Forensic Workflows: Query the dataset to select tools for specific platforms (e.g., Cloud, Android) or tasks (e.g., memory analysis).
    Data Analysis: Analyze tool distribution across platforms or forensic categories using data science tools (e.g., Pandas, R).
    
    Contribution Guidelines
    We welcome contributions to expand or refine the dataset. To contribute:
    
    Fork the repository.
    Add new tools or update existing entries in JSONL format, ensuring adherence to the schema.
    Verify links and platform compatibility as of the contribution date.
    Submit a pull request with a clear description of changes.
    Avoid duplicating tools from existing entries (check IDs 1–300).
    
    Contribution Notes
    
    Ensure tools are forensically sound (preserve evidence integrity, court-admissible where applicable).
    Include unconventional or niche tools to maintain dataset diversity.
    Validate links and commands against official documentation.
    
    License
    This dataset is licensed under the MIT License. See the LICENSE file for details.
    Acknowledgments
    
    Inspired by forensic toolkits and resources from ForensicArtifacts.com, SANS, and open-source communities.
    Thanks to contributors for identifying unique and unconventional DFIR tools.
    
    Contact
    For issues, suggestions, or inquiries, please open an issue on the repository or contact the maintainers at sunny48445@gmail.com.
    
  12. Replication Package for 'How do Machine Learning Models Change?'

    • zenodo.org
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2024). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.14128997
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.

    Our research addresses three main aspects:

    1. Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models.
    2. Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes.
    3. Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

    This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.

    Data Collection and Preprocessing

    Data Collection

    We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:

    • Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards.
    • Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit.
    • Release Information: Information on model releases marked by tags in their repositories.

    To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.

    Data Preprocessing

    Commit Diffs

    We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.

    Commit Classification

    We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.

    Model Metadata

    We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.

    Folder Structure

    The replication package is organized as follows:

    - code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.

    • Collection/: Contains two Jupyter notebooks for data collection:
      • HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.
      • HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.
    • Preprocessing/: Contains preprocessing scripts:
      • HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.
      • HFCommitsPreprocessing.ipynb: Processes commit data, including:
        • Retrieval of diff information between commits.
        • Classification of commits following Bhatia et al.'s taxonomy using LLMs.
        • Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis.
      • HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.
    • Analysis/: Contains three Jupyter notebooks with the analysis for each research question:
      • RQ1_Analysis.ipynb: Analysis for RQ1.
      • RQ2_Analysis.ipynb: Analysis for RQ2.
      • RQ3_Analysis.ipynb: Analysis for RQ3.

    - datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.

    • Main Datasets:
      • HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.
      • HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.
      • HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.
      • model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.
      • These datasets correspond to the following dataset splits:
        • +200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models.
        • +200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study.
        • +1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution.
        • Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations.
    • Additional Datasets:
      • HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.
      • HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.
      • Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps.

    - metadata/: Contains the tags_metadata.yaml file used during preprocessing.

    - models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.

    - requirements.txt: Lists the required Python packages to set up the environment and run the code.

    Setup and Execution

    Prerequisites

    • Python 3.10.11 or later.
    • Jupyter Notebook or JupyterLab.

    Installation

    1. Download and Extract the Replication Package
    2. Create a Virtual Environment (Recommended):bash
      python -m venv venv
      source venv/bin/activate # On Windows, use venv\Scripts\activate
    3. Install Required Packages:bash
      pip install -r requirements.txt

    Notes

    - LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.

    - Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.

    - Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.

    Additional Information

    Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.

    This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.

  13. E

    Data for: Automated Generation of Structure Datasets for Machine Learning...

    • edmond.mpg.de
    application/gzip +3
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marvin Poul; Marvin Poul (2025). Data for: Automated Generation of Structure Datasets for Machine Learning Potentials and Alloys [Dataset]. http://doi.org/10.17617/3.DYLLSS
    Explore at:
    text/comma-separated-values(1657), application/x-gzip(1616269), application/x-gzip(86419116), application/x-gzip(9050356), application/x-gzip(1960106), application/x-gzip(4819324), application/x-gzip(2836478), bin(86419036), application/x-gzip(1425372), application/gzip(3716967195)Available download formats
    Dataset updated
    Jun 10, 2025
    Dataset provided by
    Edmond
    Authors
    Marvin Poul; Marvin Poul
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    DFG
    Description

    DFT Training Data for fitting Moment Tensor Potentials for the system Mg/Al/Ca. See https://github.com/eisenforschung/mgalca-mtp-data for further notes and usage examples.

  14. d

    100K+ Text Rich Images | AI Training Data | Annotated imagery data for AI |...

    • datarade.ai
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds (2024). 100K+ Text Rich Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/100k-text-rich-images-ai-training-data-annotated-imagery-data-seeds
    Explore at:
    .bin, .csv, .json, .sql, .txt, .xmlAvailable download formats
    Dataset updated
    Mar 18, 2024
    Dataset authored and provided by
    Data Seeds
    Area covered
    Bonaire, Réunion, Montenegro, Senegal, Mongolia, Côte d'Ivoire, Italy, Turks and Caicos Islands, Papua New Guinea, Timor-Leste
    Description

    This dataset features over 100,000 high-quality images containing visible, naturally occurring text, sourced from photographers worldwide. Designed to support AI and machine learning applications, it offers a richly annotated and globally diverse collection ideal for training models in OCR, scene text recognition, and multimodal understanding.

    Key Features: 1. Comprehensive Metadata Each image includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Pre-annotations include object detection, scene classification, and text presence. Many images contain metadata on language type, script, and text region properties. Popularity metrics derived from user engagement on our proprietary platform are also included.

    1. Unique Sourcing Capabilities Images are sourced through a gamified photography platform that runs themed competitions — in this case, focused on capturing text in real-world environments. This ensures a steady flow of fresh, relevant, and contextually diverse submissions. Custom datasets can be sourced within 72 hours, including requests for specific languages, signage types, or visual environments (e.g., storefronts, menus, documents, public transport).

    2. Global Diversity Contributors from over 100 countries provide a vast array of languages, scripts (Latin, Cyrillic, Arabic, Chinese, etc.), and contexts. The dataset includes urban signage, handwritten notes, printed posters, digital displays, packaging, street graffiti, books, and more — offering a robust training set for global OCR and text-detection models.

    3. High-Quality Imagery Resolution varies from standard to high-definition, supporting a range of computer vision tasks. The collection includes a mix of candid, environmental shots and deliberate, close-up captures of text, enabling both practical OCR training and stylistic or multimodal research.

    4. Popularity Scores Each image is assigned a popularity score based on performance in our GuruShots photography competitions. This provides additional insight into user-perceived relevance and aesthetic appeal — useful for building models around user engagement, content filtering, or recommendation systems.

    5. AI-Ready Design Optimized for AI workflows, this dataset supports applications in OCR, text spotting, translation, semantic understanding, and cross-modal retrieval. It integrates smoothly into popular machine learning frameworks and pipelines.

    6. Licensing & Compliance The dataset is fully compliant with data privacy regulations and comes with clear, transparent licensing for commercial and academic use. All images have appropriate contributor agreements and usage rights in place.

    Use Cases: 1. Training OCR and scene text recognition models across multiple scripts and environments. 2. Powering AI for multilingual translation, navigation, and AR applications. 3. Supporting retail and logistics models through packaging and signage text extraction. 4. Enhancing multimodal AI systems that combine visual and textual understanding. 5. Enabling research in typography, linguistics, and global textual design.

    This dataset offers a rich, AI-optimized collection of real-world, text-containing imagery — diverse in content, language, and style — with customization options available for your specific needs. Contact us to request samples or a tailored delivery.

  15. d

    Vocalizations in the plains zebra (Equus quagga)

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer (2024). Vocalizations in the plains zebra (Equus quagga) [Dataset]. http://doi.org/10.5061/dryad.v9s4mw73w
    Explore at:
    Dataset updated
    Jun 22, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer
    Description

    Acoustic signals are vital in animal communication, and quantifying these signals them is fundamental for understanding animal behaviour and ecology. Vocaliszations can be classified into acoustically and functionally or contextually distinct categories, but establishing these categories can be challenging. Newly developed methods, such as machine learning, can provide solutions for classification tasks. The plains zebra is known for its loud and specific vocaliszations, yet limited knowledge exists on the structure and information content of its vocaliszations. In this study, we employed both feature-based and spectrogram-based algorithms, incorporating supervised and unsupervised machine learning methods to enhance robustness in categoriszing zebra vocaliszation types. Additionally, we implemented a permuted discriminant function analysis (pDFA) to examine the individual identity information contained in the identified vocaliszation types. The findings revealed at least four distinct ..., Data collection and sampling We collected data in three locations, in Denmark and South Africa: 1) 10 months between December 2020 and July 2021 and between September and December 2021, at Pilanesberg National Park (hereafter “PNP†), South Africa, covering both dry season (i.e. from May to September) and wet season (i.e. from October to April) (1); 2) 16 days between May and June 2019, and 33 days between February and May 2022, at Knuthenborg Safari Park (hereafter “KSP†), Denmark, covering both periods before the park’s opening for tourists (i.e. from November to March) and after (i.e. from April to October); 3) 4 days in August 2019 at Givskud Zoo (hereafter “GKZ†), Denmark. For all places and periods, three types of data were collected as follows: 1) Pictures were taken for each individual from both sides using a camera (Nikon COOLPIX P950); 2) Contexts of vocal production were recorded either through notes (in the first period of KSP and in GKZ) or videos (in the second period of KS..., , # Vocalizations in the plains zebra (Equus quagga)

    Data and Scripts

    • 1_Praat_Script_Zebra_Vocalisations.praat: This script is used to extract vocal features using the software Praat.
    • 2_Data_Script_Vocal_Repertoire.zip: This archive contains data and scripts for analyzing the vocal repertoire. It includes two folders:
      • Feature_based_analyses:
      • The dataset "feature_based_input.csv" is the input for both scripts in this folder.
      • "feature_based_supervised_classification_xgboost.ipynb" is used for supervised analysis.
      • "feature_based_unsupervised_clustering.ipynb" is used for unsupervised analysis.
      • Spectrogram_based_analyses:
      • The "spectrogram_based_classification" folder contains the input data "calltype_spec.npz" and "calltype_y.csv", as well as the notebook script "spectrogram_based_classification_cnn.ipynb" for supervised machine learning analysis.
      • The "spectrogram_based_clustering" folder contains subfolders "audio", "data"...
  16. t

    NBA Player Dataset & Prediction Model Artifacts

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png +2
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43
    Explore at:
    json, png, csv, bin, txt, text/markdownAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Burak Baltali; Burak Baltali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

    Brief overview of Files

    1. end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;

    2. the Jupyter notebook (Analysis.ipynb); All the code can be executed in there

    3. the trained model binary (nba_model.pkl); Serialized Random Forest model artifact

    4. Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here

    5. FAIR4ML metadata (fair4ml_metadata.jsonld);
      see README.md and abbreviations.txt for file details.”

    6. For further information you can go to the github site (Link below)

    File Details

    Notebook

    Analysis.ipynb: Involves the graphica output of the trained and tested data.

    Trained/ Test csv Data

    NameDescriptionPID
    regular_train.csvFor training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose4421e56c-4cd3-4ec1-a566-a89d7ec0bced
    regular_test.csv:For testing purpose of the regular season, the 2022-2023 season was selectedf9d84d5e-db01-4475-b7d1-80cfe9fe0e61
    playoff_train.csvFor training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected bcb3cf2b-27df-48cc-8b76-9e49254783d0
    playoff_test.csvFor testing purpose of the playoff season, 2023-2024 season was selectedde37d568-e97f-4cb9-bc05-2e600cc97102

    Others

    abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

    Additional Notes

    Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

    Some preprocessing has to be done before uploading into dbrepo

    Plots have also been uploaded as an output for visual purposes.

    A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

  17. m

    Bangladeshi Currency (Coins & Notes) Recognition Dataset

    • data.mendeley.com
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak Shuvo (2025). Bangladeshi Currency (Coins & Notes) Recognition Dataset [Dataset]. http://doi.org/10.17632/xn44yz596n.2
    Explore at:
    Dataset updated
    Jan 20, 2025
    Authors
    Shuvo Kumar Basak Shuvo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    The Bangladeshi Currency (Coins & Notes) Recognition Dataset is a comprehensive collection of high-quality images of Bangladeshi coins and banknotes. It is designed to facilitate machine learning and computer vision applications for currency recognition, classification, and detection.

    This dataset is organized into various denominations of coins and notes, with each folder representing a specific currency denomination. Each folder contains 10,000 images, providing a total of 100,000 images in the dataset.

    The images have been resized to a uniform dimension of 256x256 pixels, ensuring consistency and enabling easy integration into machine learning workflows. The images are saved in JPEG format to optimize storage and speed for large-scale training tasks.

    Currency Denominations Included: 10 Poisha (Small denomination coin) 1 Poisha 1 Taka 25 Poisha 2 Taka 50 Poisha 5 Poisha 5 Taka Commemorative Coins Demonetized Notes Features: Image Size: All images have been resized to 256x256 pixels (Width x Height). Image Format: JPEG. Total Images: 100,000 (10,000 images per folder, one per denomination). Categories: Each folder corresponds to a unique denomination of currency. The folder names are aligned with the specific denominations such as 10_Poisha, 1_Taka, 5_Taka, etc. Objective: This dataset is ideal for training and evaluating models for the following tasks:

    Currency Classification: Identifying the denomination of a given image of a coin or banknote. Currency Recognition: Detecting and recognizing specific Bangladeshi coins and notes from real-world images. Coin and Note Detection: Identifying and classifying multiple coins and notes in a single image. Possible Use Cases: Currency detection systems: Automated systems in ATMs, vending machines, or cash counting machines that recognize Bangladeshi coins and banknotes. Banknote and Coin Classification: Machine learning models that classify various denominations of coins and notes for digital payment applications. Real-world Applications: Currency recognition for mobile apps, kiosks, or any system that needs to automatically recognize Bangladeshi currency. Research in Currency Image Recognition: Researchers working on currency recognition problems using computer vision techniques. Collected (https://www.bb.org.bd/currency) + own
    Note for Researchers Using the dataset This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  18. Indian Currency Notes Classifier

    • kaggle.com
    Updated Jul 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Rajesh Sahani (2020). Indian Currency Notes Classifier [Dataset]. https://www.kaggle.com/gauravsahani/indian-currency-notes-classifier/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Rajesh Sahani
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    India
    Description

    This Dataset contains 195 images of 7 categories of Indian Currency Notes, this data is collected from Google images, Shutter-stock images and Data flicr respectively, you can Play with this Dataset, to recognize type of Indian Note, from the Photo, or real Time Applications.

    This Dataset looks forward, as a Image Classification Data, which contains 7 Distinct types of Indian Currency Notes, the images are not reduced to any single size, they may have different proportions.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4793224%2F9cfc752286ce58ad7277d0c807087a19%2Fstock-vector-rupee-banknotes-set-flat-style-highly-detailed-vector-illustration-isolated-on-white-background-1208483845.jpg?generation=1593957005761091&alt=media" alt="">

    These Distinct Types of Indian Currency can be Classified as: 1)Ten Rupee Notes 2)Twenty Rupee Notes 3)Fifty Rupee Notes 4)Hundred Rupee Notes 5)Two Hundred Rupee Notes 6)Five Hundred Rupee Notes, and, 7)Two Thousand Rupee Notes.

    Do Download and Play with this Dataset, If you Like this dataset, Please do UpVote! Thank-you.

  19. m

    Dataset of Spoilt Banknotes of India (Rupees)

    • data.mendeley.com
    Updated Jul 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vidula Meshram (2023). Dataset of Spoilt Banknotes of India (Rupees) [Dataset]. http://doi.org/10.17632/jh6979fg2t.4
    Explore at:
    Dataset updated
    Jul 17, 2023
    Authors
    Vidula Meshram
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Accurate Currency recognition and classification is one of the challenging tasks for visually impaired people. As the damaged banknotes are not accepted by vendors while doing financial transactions, thus it is necessary to classify between spoilt and unspoilt banknotes. With this objective we have created the Spoilt Indian (Rupees) bank-notes dataset. This dataset consists of total 5125 (2584 old banknotes and 2541 new banknotes) high quality images. Mobile phones rear camera was used to take the images of spoilt Indian banknotes. The bank-note dataset consists of 8 classes namely Spoilt New 10 Rupees, Spoilt Old 10 Rupees, Spoilt New 20 Rupees, Spoilt Old 20 Rupees, Spoilt New 50 Rupees, Spoilt Old 50 Rupees, Spoilt New 100 Rupees, and Spoilt Old 100 Rupees. The banknote images which were soiled, mutated, holed, torn were considered for dataset creation. The images were taken in dark and illuminated backgrounds as well in the cluttered background.

  20. Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3...

    • kappasignal.com
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KappaSignal (2023). Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3 Months (Forecast) [Dataset]. https://www.kappasignal.com/2023/06/machine-learning-predicts-qqq-to.html
    Explore at:
    Dataset updated
    Jun 2, 2023
    Dataset authored and provided by
    KappaSignal
    License

    https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html

    Description

    This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

    Machine Learning Predicts QQQ to Increase in Value by 5% in the Next 3 Months

    Financial data:

    • Historical daily stock prices (open, high, low, close, volume)

    • Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

    • Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

    Machine learning features:

    • Feature engineering based on financial data and technical indicators

    • Sentiment analysis data from social media and news articles

    • Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

    Potential Applications:

    • Stock price prediction

    • Portfolio optimization

    • Algorithmic trading

    • Market sentiment analysis

    • Risk management

    Use Cases:

    • Researchers investigating the effectiveness of machine learning in stock market prediction

    • Analysts developing quantitative trading Buy/Sell strategies

    • Individuals interested in building their own stock market prediction models

    • Students learning about machine learning and financial applications

    Additional Notes:

    • The dataset may include different levels of granularity (e.g., daily, hourly)

    • Data cleaning and preprocessing are essential before model training

    • Regular updates are recommended to maintain the accuracy and relevance of the data

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok (2024). Eye Tracking based Learning Style Identification for Learning Management Systems [Dataset]. http://doi.org/10.5281/zenodo.8349468
Organization logo

Eye Tracking based Learning Style Identification for Learning Management Systems

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
bin, tsv, pdfAvailable download formats
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Bittner; Dominik Bittner; Timur Ezer; Timur Ezer; Lisa Grabinger; Lisa Grabinger; Florian Hauser; Florian Hauser; Jürgen Mottok; Jürgen Mottok
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Abstract:

In recent years, universities have been faced with increasing numbers of students dropping out. This is partly due to the fact that students are limited in their ability to explore individual learning paths through different course materials. However, a promising remedy to this issue is the implementation of adaptive learning management systems. These systems recommend customised learning paths to students - based on their individual learning styles. Learning styles are commonly classified using questionnaires and learning analytics, but both methods are prone to error. Questionnaires may yield superficial responses due to time constraints or lack of motivation, while learning analytics ignore offline learning behaviour. To address these limitations, this study aims to integrating Eye Tracking for a more accurate classification of students' learning styles. Ultimately, this comprehensive approach could not only open up a deeper understanding of subconscious processes, but also provide valuable insights into students' unique learning preferences.

Research:

As an example of a possible analysis of the eye-tracking stimuli and eye movement recordings available here, as well as the corresponding ILS questionnaire responses, we refer to the following research works, which should also be referred to if necessary:

  • Bittner, D., Nadimpalli, V. K., Grabinger, L., Ezer, T., Hauser, F., & Mottok, J. (2024, June), Uncovering Learning Styles through Eye Tracking and Artificial Intelligence, In 2024 Symposium on Eye Tracking Research and Applications. ETRA.
  • Bittner, D. (2024), Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence. Master’s Thesis, Regensburg University of Applied Sciences (OTH), Regensburg, Germany
  • Bittner, D., Ezer, T., Grabinger, L., Hauser, F., & Mottok, J. (2023). Unveiling the secrets of learning styles: decoding eye movements via machine learning. In ICERI2023 Proceedings (pp. 5153-5162). IATED.
  • Bittner, D., Hauser, F., Nadimpalli, V. K., Grabinger, L., Staufer, S., & Mottok, J. (2023, June). Towards eye tracking based learning style identification. In Proceedings of the 5th European Conference on Software Engineering Education (pp. 138-147). ECSEE.

The following descriptions and the previous abstract are part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. and have to be cited accordingly.

Experimental Setup:

In the following section, crucial notes on the circumstances and the experiment itself as well as the equipment are given.
In order to reduce the external influence on the experiment, variables such as:

  • order, number, and presentation of the stimuli,
  • instruction to the participant prior to the experiment,
  • position of the participant in respect to the Eye Tracking equipment,
  • environment such as illuminance and ambient noise for the participant,
  • Eye Tracking equipment, software, settings such as sampling frequency and latency as well as calibration

were attempted to keep constant and consistent throughout the experiment.

Equipment:

In this study, the Tobii Pro Fusion (https://go.tobii.com/tobii-pro-fusion-user-manual) eye tracker is utilized without a chin rest along with the Tobii IVT filter for fixation detection and Tobii Pro Lab software for data collection. The Tobii Pro Fusion is categorised as a video-based combined pupil and corneal reflection technology. This tracker provides several advantages, such as the collection of comprehensive data, comprising gaze, pupil, and eye-opening metrics. The eye tracker captures up to 250 images per second (250Hz), enhancing its precision and eye movement analysis. In addition, Tobii Pro Fusion is capable of performing under different lighting conditions, thus making this portable device ideal for off-site studies.

Ensuring consistent quality across all experiment participants is crucial. Prior to each individual experiment, eye trackers are calibrated, aiming for a maximum reproduction error of less or equal than 0.2 degree during calibration to minimize deviations. The calibration is excluded from the experiment recording. Each participant is given the same instructions for their single trial of the experiment. The stimuli is displayed on a 24-inch monitor in a 16:9 format, positioned approximately 65cm away from the participants' eyes. Any effect related to the characteristics of the participants, such as age, visual acuity, eye colour, pupil size, etc., are considered in the experiment design.

Procedure:

Initially, the participants are requested to confirm their ability to conduct the experiment based on their current condition. Subsequently, the participant must be positioned comfortably and accurately in relation to the eye tracker. The eye tracker calibration is carried out for each participant to ensure a suitable experimental configuration. Once a successful calibration is achieved, the Eye Tracking experiment begin with introductions prior to each task. The stimuli presentation is unrestricted by time constraints, and no prior knowledge of the stimuli contents is necessary. Employing a within-subject design, each stimulus is exposed to each subject. Following completion of the experiment, participants anonymously answer the ILS questionnaire. To prevent any impact on the experiment, it is important that the questionnaire only be seen and completed after the experiment.

Stimuli:

The specially designed stimuli shown to participants during the study are illustrated in the left-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. For this research, only specific regions of a stimulus, referred to as AOI, are taken into consideration. The size of the AOI depends on both stimulus information and distance between multiple AOIs. Adequate results are ensured by not overlapping AOIs and appropriate spacing. The AOIs of the various stimuli employed in this research are illustrated in the right-hand column of the figure in the PDF file "[Documentation]stimuli_preview.pdf", which is part of the Master's thesis "Behind the Scenes - Learning Style Uncovered using Eye Tracking and Artificial Intelligence" by Bittner D. The stimuli are presented in German language, ensuring reliable Eye Tracking measurements without any interference from language barriers. Each stimulus comprises diverse learning materials to engage students with varying learning styles, with some general information about the quantitative research cycle. Some stimuli feature identical type of material, e.g. illustrations or key words, but with different contexts and positions on the stimuli. Rearranging the identical material reduces the influence of reading style and enhances the impact of the learning style, producing a more reliable experiment. These identical types of material or AOIs on different stimuli can be grouped together, identified by the same colour and title, and referred to as AOI groupings.
There are ten different AOI groupings in total, as illustrated in the figure in the "[Documentation]stimuli_preview.pdf" file, where each grouping consists of several AOIs.
In detail, the AOI grouping regarding:

  • table of contents and summary contain only a single AOI each,
  • illustrations, key words, theory, exercise, example and additional material contain three AOIs each,
  • supporting text and multiple choice question contain two AOIs each.

Research data management:

To ensure the transparency and reproducibility of this study, effective management of research data is essential. This section provides details on the management, storage and analysis of the extensive dataset collected as part of the study. Importantly, this research, the study and its processes adhered to ethical guidelines at all times, including informed consent, participant anonymity and secure data handling. The data collected will only be kept for a specific period of time as defined in the research project guidelines. The collection itself involves the recording of participants' eye movements during the ET study and the collection of their demographic data and responses to the ILS questionnaire.

Search
Clear search
Close search
Google apps
Main menu