80 datasets found
  1. f

    Orange dataset table

    • figshare.com
    xlsx
    Updated Mar 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 4, 2022
    Dataset provided by
    figshare
    Authors
    Rui Simões
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

    Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

  2. f

    Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

    • acs.figshare.com
    xlsx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

  3. E

    Exploratory Data Analysis (EDA) Tools Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54257
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing need for businesses to derive actionable insights from their ever-expanding datasets. The market, currently estimated at $15 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching an estimated $45 billion by 2033. This growth is fueled by several factors, including the rising adoption of big data analytics, the proliferation of cloud-based solutions offering enhanced accessibility and scalability, and the growing demand for data-driven decision-making across diverse industries like finance, healthcare, and retail. The market is segmented by application (large enterprises and SMEs) and type (graphical and non-graphical tools), with graphical tools currently holding a larger market share due to their user-friendly interfaces and ability to effectively communicate complex data patterns. Large enterprises are currently the dominant segment, but the SME segment is anticipated to experience faster growth due to increasing affordability and accessibility of EDA solutions. Geographic expansion is another key driver, with North America currently holding the largest market share due to early adoption and a strong technological ecosystem. However, regions like Asia-Pacific are exhibiting high growth potential, fueled by rapid digitalization and a burgeoning data science talent pool. Despite these opportunities, the market faces certain restraints, including the complexity of some EDA tools requiring specialized skills and the challenge of integrating EDA tools with existing business intelligence platforms. Nonetheless, the overall market outlook for EDA tools remains highly positive, driven by ongoing technological advancements and the increasing importance of data analytics across all sectors. The competition among established players like IBM Cognos Analytics and Altair RapidMiner, and emerging innovative companies like Polymer Search and KNIME, further fuels market dynamism and innovation.

  4. R

    Solar Panel Eda Dataset

    • universe.roboflow.com
    zip
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramkumar (2024). Solar Panel Eda Dataset [Dataset]. https://universe.roboflow.com/ramkumar/solar-panel-eda
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2024
    Dataset authored and provided by
    Ramkumar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Solar Panel Bounding Boxes
    Description

    Solar Panel EDA

    ## Overview
    
    Solar Panel EDA is a dataset for object detection tasks - it contains Solar Panel annotations for 721 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. R

    Enbor Eda Dataset

    • universe.roboflow.com
    zip
    Updated Sep 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    2014 Series License Plate (2024). Enbor Eda Dataset [Dataset]. https://universe.roboflow.com/2014-series-license-plate/enbor-eda
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 27, 2024
    Dataset authored and provided by
    2014 Series License Plate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    License Plate MgLz Licenseplate Bounding Boxes
    Description

    Enbor Eda

    ## Overview
    
    Enbor Eda is a dataset for object detection tasks - it contains License Plate MgLz Licenseplate annotations for 3,120 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  7. f

    EDA augmentation parameters.

    • plos.figshare.com
    xls
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). EDA augmentation parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

  8. R

    Eda_all Dataset

    • universe.roboflow.com
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cropperyash (2024). Eda_all Dataset [Dataset]. https://universe.roboflow.com/cropperyash/eda_all
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 24, 2024
    Dataset authored and provided by
    cropperyash
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    All Polygons
    Description

    Eda_all

    ## Overview
    
    Eda_all is a dataset for instance segmentation tasks - it contains All annotations for 1,314 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  9. Data: Anscombe's quintet

    • kaggle.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2025). Data: Anscombe's quintet [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/data-anscombes-quartet/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This file is the data set form the famous publication Francis J. Anscombe "*Graphs in Statistical Analysis*", The American Statistician 27 pp. 17-21 (1973) (doi: 10.1080/00031305.1973.10478966). It consists of four data sets of 11 points each. Note the peculiarity that the same 'x' values are used for the first three data sets, and I have followed this exactly as in the original publication (originally done to save space), i.e. the first column (x123) serves as the 'x' for the next three 'y' columns; y1, y2 and y3.

    In the dataset Anscombe_quintet_data.csv there is a new column (y5) as an example of Simpson's paradox (C. McBride Ellis "*Anscombe dataset No. 5: Simpson's paradox*", Zenodo doi: 10.5281/zenodo.15209087 (2025)

  10. R

    Enbor Eda Article 2 V4 Dataset

    • universe.roboflow.com
    zip
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    2014 Series License Plate (2024). Enbor Eda Article 2 V4 Dataset [Dataset]. https://universe.roboflow.com/2014-series-license-plate/enbor-eda-article-2-v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2024
    Dataset authored and provided by
    2014 Series License Plate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Plate AQOx Plate XoZD License Plate MgLz Licenseplate E90T 00Hp Bounding Boxes
    Description

    Enbor Eda Article 2 V4

    ## Overview
    
    Enbor Eda Article 2 V4 is a dataset for object detection tasks - it contains Plate AQOx Plate XoZD License Plate MgLz Licenseplate E90T 00Hp annotations for 2,125 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. A

    ‘US Health Insurance Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘US Health Insurance Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-health-insurance-dataset-8b56/068994aa/?iid=012-655&v=presentation
    Explore at:
    Dataset updated
    Nov 15, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘US Health Insurance Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/teertha/ushealthinsurancedataset on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries, even within the BFSI sphere (compared to the Banking sector for example.) Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.

    Despite this, emergent technologies like AI and Block Chain have brought a radical change in Insurance, and Data Analytics sits at the core of this transformation. We can identify 4 key factors behind the emergence of Analytics as a crucial part of InsurTech:

    • Big Data: The explosion of unstructured data in the form of images, videos, text, emails, social media
    • AI: The recent advances in Machine Learning and Deep Learning that can enable businesses to gain insight, do predictive analytics and build cost and time - efficient innovative solutions
    • Real time Processing: Ability of real time information processing through various data feeds (for ex. social media, news)
    • Increased Computing Power: a complex ecosystem of new analytics vendors and solutions that enable carriers to combine data sources, external insights, and advanced modeling techniques in order to glean insights that were not possible before.

    This dataset can be helpful in a simple yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured and see how they affect the insurance premium.

    Content

    This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.

    Inspiration

    This relatively simple dataset should be an excellent starting point for EDA, Statistical Analysis and Hypothesis testing and training Linear Regression models for predicting Insurance Premium Charges.

    Proposed Tasks: - Exploratory Data Analytics - Statistical hypothesis testing - Statistical Modeling - Linear Regression

    --- Original source retains full ownership of the source dataset ---

  12. Understanding Fatigue Through Biosignals: A Comprehensive Dataset

    • zenodo.org
    Updated Mar 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marta Gabbi; Marta Gabbi; Luca Cornia; Luca Cornia; Valeria Villani; Valeria Villani; Lorenzo Sabattini; Lorenzo Sabattini (2024). Understanding Fatigue Through Biosignals: A Comprehensive Dataset [Dataset]. http://doi.org/10.5281/zenodo.8423405
    Explore at:
    Dataset updated
    Mar 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marta Gabbi; Marta Gabbi; Luca Cornia; Luca Cornia; Valeria Villani; Valeria Villani; Lorenzo Sabattini; Lorenzo Sabattini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fatigue is a multifaceted construct, that represents an important part of human experience. The two main aspects of fatigue are the mental one and the physical one, that often intertwine, intensifying their collective impact on daily life and overall well-being.
    To soften this impact, understanding and quantifying fatigue is crucial. Physiological data play a pivotal role in the comprehension of fatigue, allowing a precious insight into the level and type of fatigue experienced.

    The MePhy dataset includes physiological data gathered while inducing different types of fatigue conditions, in particular mental and physical fatigue. We collected various biosignals closely associated with fatigue (ECG, EDA, EMG and Eye Blinking). Test participants endured a four-part experiment that aimed to elicit mental fatigue, physical fatigue and a combination of both.

    The main folder contains:

    • MePhy Dataset folder, which contains the dataset;
    • ReadMe.pdf, which provides more informations about the dataset;
    • Mental Fatigue Inducing Test folder, which includes the HTML application used to simulate mental fatigue in the test participants.

    A more in depth description of the MePhy dataset can be found in the following paper https://doi.org/10.1145/3610977.3637485.

    Marta Gabbi, Luca Cornia, Valeria Villani, and Lorenzo Sabattini (2024) Understanding Fatigue Through Biosignals: A Comprehensive Dataset. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (HRI ’24).

  13. A

    ‘Groceries dataset for Market Basket Analysis(MBA)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Groceries dataset for Market Basket Analysis(MBA)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-groceries-dataset-for-market-basket-analysis-mba-d4c7/a0d6998a/?iid=009-334&v=presentation
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Groceries dataset for Market Basket Analysis(MBA)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rashikrahmanpritom/groceries-dataset-for-market-basket-analysismba on 13 November 2021.

    --- Dataset description provided by original source is as follows ---

    The initial dataset was collected from Groceries dataset. Then data was modified and fragmented into 2 datasets for ease of MBA implementation. Here the "groceries data.csv" contains groceries transaction data from which you can do EDA and pre-process the data to feed it in the apriori algorithm. But I have also added pre-processed data as "basket.csv" from which you'll just need to replace nan and encode it using TransactionEncoder after that you can feed the encoded data into the apriori algorithm.

    --- Original source retains full ownership of the source dataset ---

  14. M

    Surface Water Stations - MPCA Environmental Data Access

    • gisdata.mn.gov
    • data.wu.ac.at
    fgdb, gpkg, html +2
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pollution Control Agency (2025). Surface Water Stations - MPCA Environmental Data Access [Dataset]. https://gisdata.mn.gov/dataset/env-eda-surfacewater-stations
    Explore at:
    gpkg, html, jpeg, fgdb, shpAvailable download formats
    Dataset updated
    Aug 30, 2025
    Dataset provided by
    Minnesota Pollution Control Agency
    Description

    Minnesota Pollution Control Agency (MPCA) surface water monitoring station locations, including lake, stream, biological and discharge. Locations of United States Geological Survey (USGS) stream flow stations are also included. This data set was created as part of MPCA's Environmental Data Access project, which was designed to provide internet access to MPCA's surface water monitoring data. The data set contains locational data and limited attributes for all MPCA stream chemistry stations, MPCA lake monitoring stations, MPCA stream biology stations, MPCA permitted dischargers [National Pollutant Discharge Elimination System (NPDES) permits], and (locations only) of USGS stream flow stations. MPCA lake and stream monitoring stations are the same stations found in MPCA's EQuIS database.

  15. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  16. f

    Cognitive Fatigue

    • figshare.com
    csv
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Varandas; Inês Silveira; Hugo Gamboa (2025). Cognitive Fatigue [Dataset]. http://doi.org/10.6084/m9.figshare.28188143.v3
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    figshare
    Authors
    Rui Varandas; Inês Silveira; Hugo Gamboa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Cognitive Fatigue2.1. Experimental designCognitive fatigue (CF) is a phenomenon that arises following the prolonged engagement in mentally demanding cognitive tasks. Thus, we developed an experimental procedure that involved three demanding tasks: a digital lesson in Jupyter Notebook format, three repetitions of Corsi-Block task, and two repetitions of a concentration test.Before the Corsi-Block task and after the concentration task there were periods of baseline of two min. In our analysis, the first baseline period, although not explicitly present in the dataset, was designated as representing no CF, whereas the final baseline period was designated as representing the presence of CF. Between repetitions of the Corsi-Block task, there were periods of baseline of 15 s after the task and of 30 s before the beginning of each repetition of the task.2.2. Data recordingA data sample of 10 volunteer participants (4 females) aged between 22 and 48 years old (M = 28.2, SD = 7.6) took part in this study. All volunteers were recruited at NOVA School of Science and Technology, fluent in English, right-handed, none reported suffering from psychological disorders, and none reported taking regular medication. Written informed consent was obtained before participating and all Ethical Procedures approved by the Ethics Committee of NOVA University of Lisbon were thoroughly followed.In this study, we omitted the data from one participant due to the insufficient duration of data acquisition.2.3. Data labellingThe labels easy, difficult, very difficult and repeat found in the ECG_lesson_answers.txt files represent the subjects' opinion of the content read in the ECG lesson. The repeat label represents the most difficult level. It's called repeat because when you press it, the answer to the question is shown again. This system is based on the Anki system, which has been proposed and used to memorise information effectively. In addition, the PB description JSON files include timestamps indicating the start and end of cognitive tasks, baseline periods, and other events, which are useful for defining CF states as we defined in 2.1.2.4. Data descriptionBiosignals include EEG, fNIRS (not converted to oxi and deoxiHb), ECG, EDA, respiration (RIP), accelerometer (ACC), and push-button data (PB). All signals have already been converted to physical units. In each biosignal file, the first column corresponds to the timestamps.HCI features encompass keyboard, mouse, and screenshot data. Below is a Python code snippet for extracting screenshot files from the screenshots CSV file.import base64from os import mkdirfrom os.path import joinfile = '...'with open(file, 'r') as f: lines = f.readlines()for line in lines[1:]: timestamp = line.split(',')[0] code = line.split(',')[-1][:-2] imgdata = base64.b64decode(code) filename = str(timestamp) + '.jpeg' mkdir('screenshot') with open(join('screenshot', filename), 'wb') as f: f.write(imgdata)A characterization file containing age and gender information for all subjects in each dataset is provided within the respective dataset folder (e.g., D2_subject-info.csv). Other complementary files include (i) description of the pushbuttons to help segment the signals (e.g., D2_S2_PB_description.json) and (ii) labelling (e.g., D2_S2_ECG_lesson_results.txt). The files D2_Sx_results_corsi-block_board_1.json and D2_Sx_results_corsi-block_board_2.json show the results for the first and second iterations of the corsi-block task, where, for example, row_0_1 = 12 means that the subject got 12 pairs right in the first row of the first board, and row_0_2 = 12 means that the subject got 12 pairs right in the first row of the second board.
  17. Replication Package of Deep Learning and Data Augmentation for Detecting...

    • zenodo.org
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). Replication Package of Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [Dataset]. http://doi.org/10.5281/zenodo.10521909
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 17, 2024
    Description

    Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.

    In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.

    Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:

    1. Natural Language Processing (NLP) and Matches task Annotation Tags (MAT) [Github]
    2. eXtreme Gradient Boosting+Synthetic Minority Oversampling Technique (XGBoost+SMOTE) [Figshare]
    3. eXtreme Gradient Boosting+Easy Data Augmentation (XGBoost+EDA) [Github]
    4. MT-Text-CNN [Github]

    Structure of the Replication Package:

    In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.

    ├── SATD Keywords
    │ ├── Keywords based on Source of Artifacts
    │ │ ├── Code comment.txt
    │ │ ├── Commit message.txt
    │ │ ├── Issue section.txt
    │ │ └── Pull section.txt
    │ ├── Keywords based on Types of SATD
    │ │ ├── code-design debt.txt
    │ │ ├── documentation debt.txt
    │ │ ├── requirement debt.txt
    │ │ └── test debt.txt
    ├── src
    │ ├── bert.py
    │ ├── bilstm.py
    │ └── preprocessing.py
    ├── data-augmentation-code_comments.csv
    ├── data-augmentation-commit_messages.csv
    ├── data-augmentation-issues.csv
    ├── data-augmentation-pull_requests.csv
    └── Supplementary Material.docx

    Requirements:

    nltk
    transformers
    torch
    tensorflow
    keras
    langdetect
    inflect
    inflection
    Project sources for each artifact are as follows:
    Source code commentIssue sectionPull sectionCommit message
    ant
    argouml
    columba
    emf
    hibernate
    jedit
    jfreechart
    jmeter
    jruby
    squirrel
    camel
    chromium
    gerrit
    hadoop
    hbase
    impala
    thrift
    accumulo
    activemq
    activemq-artemis
    airflow
    ambari
    apisix
    apisix-dashboard
    arrow
    attic-apex-core
    attic-apex-malhar
    attic-stratos
    avro
    beam
    bigtop
    bookkeeper
    brooklyn-server
    calcite
    camel
    camel-k
    camel-quarkus
    camel-website
    carbondata
    cassandra
    cloudstack
    commons-lang
    couchdb
    cxf
    daffodil
    drill
    druid
    dubbo
    echarts
    fineract
    flink
    fluo
    geode
    geode-native
    gobblin
    griffin
    groovy
    guacamole-client
    hadoop
    hawq
    hbase
    helix
    hive
    hudi
    iceberg
    ignite
    incubator-brooklyn
    incubator-dolphinscheduler
    incubator-doris
    incubator-heron
    incubator-hop
    incubator-mxnet
    incubator-pagespeed-ngx
    incubator-pinot
    incubator-weex
    infrastructure-puppet
    jena
    jmeter
    kafka
    karaf
    kylin
    lucene-solr
    madlib
    myfaces-tobago
    netbeans
    netbeans-website
    nifi
    nifi-minifi-cpp
    nutch
    openwhisk
    openwhisk-wskdeploy
    orc
    ozone
    parquet-mr
    phoenix
    pulsar
    qpid-dispatch
    reef
    rocketmq
    samza
    servicecomb-java-chassis
    shardingsphere
    shardingsphere-elasticjob
    skywalking
    spark
    storm
    streams
    superset
    systemds
    tajo
    thrift
    tinkerpop
    tomee
    trafficcontrol
    trafficserver
    trafodion
    tvm
    usergrid
    zeppelin
    zookeeper
    accumulo
    activemq
    activemq-artemis
    airflow
    ambari
    apisix
    apisix-dashboard
    arrow
    attic-apex-core
    attic-apex-malhar
    attic-stratos
    avro
    beam
    bigtop
    bookkeeper
    brooklyn-server
    calcite
    camel
    camel-k
    camel-quarkus
    camel-website
    carbondata
    cassandra
    cloudstack
    commons-lang
    couchdb
    cxf
    daffodil
    drill
    druid
    dubbo
    echarts
    fineract
    flink
    fluo
    geode
    geode-native
    gobblin
    griffin
    groovy
    guacamole-client
    hadoop
    hawq
    hbase
    helix
    hive
    hudi
    iceberg
    ignite
    incubator-brooklyn
    incubator-dolphinscheduler
    incubator-doris
    incubator-heron
    incubator-hop
    incubator-mxnet
    incubator-pagespeed-ngx
    incubator-pinot
    incubator-weex
    infrastructure-puppet
    jena
    jmeter
    kafka
    karaf
    kylin
    lucene-solr
    madlib
    myfaces-tobago
    netbeans
    netbeans-website
    nifi
    nifi-minifi-cpp
    nutch
    openwhisk
    openwhisk-wskdeploy
    orc
    ozone
    parquet-mr
    phoenix
    pulsar
    qpid-dispatch
    reef
    rocketmq
    samza
    servicecomb-java-chassis
    shardingsphere
    shardingsphere-elasticjob
    skywalking
    spark
    storm
    streams
    superset
    systemds
    tajo
    thrift
    tinkerpop
    tomee
    trafficcontrol
    trafficserver
    trafodion
    tvm
    usergrid
    zeppelin
    zookeeper

    This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data

  18. G-REx: A Real-World Dataset of Group Emotion Experiences based on...

    • zenodo.org
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patricia Bota; Patricia Bota; Joana Brito; Ana Fred; Ana Fred; Pablo Cesar; Pablo Cesar; Hugo Plácido Silva; Hugo Plácido Silva; Joana Brito (2023). G-REx: A Real-World Dataset of Group Emotion Experiences based on Physiological Data [Dataset]. http://doi.org/10.5281/zenodo.8136135
    Explore at:
    Dataset updated
    Jul 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patricia Bota; Patricia Bota; Joana Brito; Ana Fred; Ana Fred; Pablo Cesar; Pablo Cesar; Hugo Plácido Silva; Hugo Plácido Silva; Joana Brito
    Description

    G-REX is a novel dataset for real-world affective computing, with data collected in a naturalistic setting during movie sessions. Group physiological data (Photoplethysmography – PPG and Electrodermal Activity – EDA) are collected using a wrist-worn unobtrusive device and emotion ground-truth annotation is performed retrospectively, based on selected segments where each subject showed an elevated physiological response. In total, we provide data over 14 movie sessions, making a total of more than 300 hours of physiological data collected over 100+ subjects in groups.

  19. A

    ‘IMDB Movies Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘IMDB Movies Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-imdb-movies-dataset-f301/9b433bd2/?iid=018-445&v=presentation
    Explore at:
    Dataset updated
    Nov 13, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘IMDB Movies Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows on 13 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    IMDB Dataset of top 1000 movies and tv shows. You can find the EDA Process on - https://www.kaggle.com/harshitshankhdhar/eda-on-imdb-movies-dataset

    Please consider UPVOTE if you found it useful.

    Content

    Data:- - Poster_Link - Link of the poster that imdb using - Series_Title = Name of the movie - Released_Year - Year at which that movie released - Certificate - Certificate earned by that movie - Runtime - Total runtime of the movie - Genre - Genre of the movie - IMDB_Rating - Rating of the movie at IMDB site - Overview - mini story/ summary - Meta_score - Score earned by the movie - Director - Name of the Director - Star1,Star2,Star3,Star4 - Name of the Stars - No_of_votes - Total number of votes - Gross - Money earned by that movie

    Inspiration

    • Analysis of the gross of a movie vs directors.
    • Analysis of the gross of a movie vs different - different stars.
    • Analysis of the No_of_votes of a movie vs directors.
    • Analysis of the No_of_votes of a movie vs different - different stars.
    • Which actor prefer which Genre more?
    • Which combination of actors are getting good IMDB_Rating maximum time?
    • Which combination of actors are getting good gross?

    --- Original source retains full ownership of the source dataset ---

  20. T

    Impact of AI in Education Processes

    • dataverse.tdl.org
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saksham Adhikari; Saksham Adhikari (2024). Impact of AI in Education Processes [Dataset]. http://doi.org/10.18738/T8/RXUCHK
    Explore at:
    application/x-ipynb+json(428065), pptx(80640), tsv(7079)Available download formats
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    Texas Data Repository
    Authors
    Saksham Adhikari; Saksham Adhikari
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We did data analysis on a open dataset which contained responses regarding a survey about how useful students find AI in the educational process. We cleaned the data, preprocessed and then did analysis on it. We did an EDA (Exploratory Data Analysis) on the dataset and visualized the results and our findings. Then we interpreted the findings into our digital poster.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1

Orange dataset table

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
xlsxAvailable download formats
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Authors
Rui Simões
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

Search
Clear search
Close search
Google apps
Main menu