100+ datasets found
  1. A/B Testing Data

    • kaggle.com
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanchi (2025). A/B Testing Data [Dataset]. https://www.kaggle.com/datasets/sanxhi/ab-testing-data-simulated-web-user-engagement
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    Kaggle
    Authors
    Sanchi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Simulated A/B Testing Data for Web User Engagement This dataset contains synthetically generated A/B testing data that mimics user behavior on a website with two versions: Control (con) and Experimental (exp). The dataset is designed for practicing data cleaning, statistical testing (e.g., Z-test, T-test), and pipeline development.

    Each row represents an individual user session, with attributes capturing click behavior, session duration, access device, referral source, and timestamp.

    Features: click — Binary (1 if clicked, 0 if not)

    group — A/B group assignment (con or exp, with injected label inconsistencies)

    session_time — Time spent in the session (in minutes), including outliers

    click_time — Timestamp of user interaction (nullable)

    device_type — Device used (mobile or desktop, mixed casing)

    referral_source — Where the user came from (e.g., social, email, with some typos/whitespace)

    Use Cases: A/B testing analysis (CTR, CVR)

    Hypothesis testing (Z-test, T-test)

    ETL pipeline design

    Data cleaning and standardization practice

    Dashboard creation and segmentation analysis

    Notes: The dataset includes intentional inconsistencies (nulls, duplicates, casing issues, typos) to reflect real-world challenges.

    Fully synthetic — safe for public use.

  2. Pandas Test Data

    • kaggle.com
    zip
    Updated Aug 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gyan Kumar (2020). Pandas Test Data [Dataset]. https://www.kaggle.com/kgmgyan57/pandas-test-data
    Explore at:
    zip(63445451 bytes)Available download formats
    Dataset updated
    Aug 23, 2020
    Authors
    Gyan Kumar
    Description

    Dataset

    This dataset was created by Gyan Kumar

    Contents

    It contains the following files:

  3. f

    Data from: Functional Time Series Analysis and Visualization Based on...

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Israel Martínez-Hernández; Marc G. Genton (2024). Functional Time Series Analysis and Visualization Based on Records [Dataset]. http://doi.org/10.6084/m9.figshare.26207477.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 19, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Israel Martínez-Hernández; Marc G. Genton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In many phenomena, data are collected on a large scale and at different frequencies. In this context, functional data analysis (FDA) has become an important statistical methodology for analyzing and modeling such data. The approach of FDA is to assume that data are continuous functions and that each continuous function is considered as a single observation. Thus, FDA deals with large-scale and complex data. However, visualization and exploratory data analysis, which are very important in practice, can be challenging due to the complexity of the continuous functions. Here we introduce a type of record concept for functional data, and we propose some nonparametric tools based on the record concept for functional data observed over time (functional time series). We study the properties of the trajectory of the number of record curves under different scenarios. Also, we propose a unit root test based on the number of records. The trajectory of the number of records over time and the unit root test can be used for visualization and exploratory data analysis. We illustrate the advantages of our proposal through a Monte Carlo simulation study. We also illustrate our method on two different datasets: Daily wind speed curves at Yanbu, Saudi Arabia and annual mortality rates in France. Overall, we can identify the type of functional time series being studied based on the number of record curves observed. Supplementary materials for this article are available online.

  4. Marketing Analytics

    • kaggle.com
    zip
    Updated Mar 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Daoud (2022). Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/jackdaoud/marketing-data/discussion
    Explore at:
    zip(658411 bytes)Available download formats
    Dataset updated
    Mar 6, 2022
    Authors
    Jack Daoud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations.

    Content

    The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance

    Acknowledgement

    I do not own this dataset. I am simply making it accessible on this platform via the public GitHub link.

  5. A/B Test Aggregated Data

    • kaggle.com
    zip
    Updated Sep 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergei Logvinov (2022). A/B Test Aggregated Data [Dataset]. https://www.kaggle.com/datasets/sergylog/ab-test-aggregated-data/discussion
    Explore at:
    zip(394999 bytes)Available download formats
    Dataset updated
    Sep 18, 2022
    Authors
    Sergei Logvinov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Simulated user-aggregated data from an experiment with webpage views and button clicks attributes. Can be very useful for preparing for interviews and practicing statistical tests. The data was prepared using a special selection of parameters: success_rate, uplift, beta, skew

  6. Evaluate Drone-AI Models for Crowd & Traffic Monitoring - EDA

    • ai.tracebloc.io
    json
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tracebloc (2025). Evaluate Drone-AI Models for Crowd & Traffic Monitoring - EDA [Dataset]. https://ai.tracebloc.io/explore/drones-object-detection-for-traffic-monitoring?tab=exploratory-data-analysis
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    Tracebloc GmbH
    Authors
    tracebloc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Missing Values
    Measurement technique
    Statistical and exploratory data analysis
    Description

    Discover, test and benchmark 3rd-party AI models for drone-based crowd and traffic detection — accuracy, latency & rare-object performance for enterprise use.

  7. u

    Data from: Supplementary Material for "Sonification for Exploratory Data...

    • pub.uni-bielefeld.de
    • search.datacite.org
    Updated Feb 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. https://pub.uni-bielefeld.de/record/2920448
    Explore at:
    Dataset updated
    Feb 5, 2019
    Authors
    Thomas Hermann
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Sonification for Exploratory Data Analysis

    Chapter 8: Sonification Models

    In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data.

    8.1 Data Sonograms

    Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space.

    • Table 8.2, page 87: Sound examples for Data Sonograms
    File:
    Iris dataset: started in plot "https://pub.uni-bielefeld.de/download/2920448/2920454">(a) at S0 (b) at S1 (c) at S2
    10d noisy circle dataset: started in plot (c) at "https://pub.uni-bielefeld.de/download/2920448/2920451">S0 (mean) (d) at S1 (edge)
    10d Gaussian: plot (d) started at S0
    3 clusters: Example 1
    3 clusters: invisible columns used as output variables: "https://pub.uni-bielefeld.de/download/2920448/2920450">Example 2
    Description:
    Data Sonogram Sound examples for synthetic datasets and the Iris dataset
    Duration:
    about 5 s
    8.2 Particle Trajectory Sonification Model

    This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset.

    • Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x).
    • Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters.
    • Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters.
    • Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster.
    • Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters
    • Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster
    • Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step.
    • Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step.
    • Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset.
    8.3 Markov chain Monte Carlo Sonification

    The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound.

    • Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes.
    • Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset
    • McMC Sonification for Cluster Analysis, dataset with three clusters, page 107
    • McMC Sonification for Cluster
  8. Eukaryotes test, kingdom level: Distribution of Eukaryotes into four groups...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene García; Bessem Chouaia; Mercè Llabrés; Marta Simeoni (2023). Eukaryotes test, kingdom level: Distribution of Eukaryotes into four groups for every graph kernel. [Dataset]. http://doi.org/10.1371/journal.pone.0281047.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Irene García; Bessem Chouaia; Mercè Llabrés; Marta Simeoni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Eukaryotes test, kingdom level: Distribution of Eukaryotes into four groups for every graph kernel.

  9. Data and Code for the paper "An Empirical Study on Exploratory Crowdtesting...

    • zenodo.org
    zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Di Martino; Sergio Di Martino; Anna Rita Fasolino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Porfirio Tramontana; Porfirio Tramontana (2023). Data and Code for the paper "An Empirical Study on Exploratory Crowdtesting of Android Applications" [Dataset]. http://doi.org/10.5281/zenodo.7260112
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Di Martino; Sergio Di Martino; Anna Rita Fasolino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Porfirio Tramontana; Porfirio Tramontana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This package contains data and code to replicate the findings presented in our paper titled "Influence of the Number of Testers in Exploratory Crowd-Testing of Android Applications".

    Abstract

    Crowdtesting is an emerging paradigm in which a ``crowd'' of people independently carry out testing tasks, and proved to be especially promising in the mobile apps domain and in combination with exploratory testing strategies, in which individual testers pursue a creative, experience-based approach to design tests.

    Managing the crowdtesting process, however, is still a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the unpredictability of remote testing activities. A number of works in the literature investigated the application of crowdtesting in the mobile apps domain. These works, however, consider crowdtesting tasks in which the goal is to find bugs, and not to generate a re-executable test suite. Moreover, existing works do not consider the impact of the application of different exploratory testing strategies.

    As a first step towards filling this gap in the literature, in this work, we conduct an empirical evaluation involving four open source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in crowdtesting activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized crowds of students achieve using different exploratory testing strategies. Results provide useful insights to project managers interested in using crowdtesting to produce GUI test suites for mobile apps, on which they can make more informed decisions.

    Contents and Instructions

    This package contains:

    • apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.
    • students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.
    • compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute code coverage unions, run the analysisAndReport.py script.
    • data-analysis-scripts.zip A zip archive containing a RStudio project and all the R scripts we developed to carry out statistical analysis and draw plots. All data is available as a R object is the ./data/ce/data_augmented.rds file. Moreover, the hypotheses_testing.R scripts performs statistical tests and measures effect size for RQ1. The script hypotheses_testing_across_strategy.R performs statistical tests and measures effect size for RQ2.
  10. Z

    Data and Code for the paper "GUI Testing of Android Applications:...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Di Martino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Porfirio Tramontana (2023). Data and Code for the paper "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7260111
    Explore at:
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Università degli Studi di Napoli Federico II, Naples, Italy
    Authors
    Sergio Di Martino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Porfirio Tramontana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".

    Abstract

    Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.

    Contents and Instructions

    This package contains:

    apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

    apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.

    students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

    compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.

    branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.

    data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.

  11. f

    Descriptive statistics.

    • plos.figshare.com
    xls
    Updated Oct 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    PLOS Global Public Health
    Authors
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.

  12. f

    ML models performance (Chi-square test).

    • plos.figshare.com
    xls
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed (2023). ML models performance (Chi-square test). [Dataset]. http://doi.org/10.1371/journal.pone.0294803.t014
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.

  13. Whole dataset test: Evaluation of the SVM results for discerning...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene García; Bessem Chouaia; Mercè Llabrés; Marta Simeoni (2023). Whole dataset test: Evaluation of the SVM results for discerning Eukaryotes/Prokaryotes and Kingdoms. [Dataset]. http://doi.org/10.1371/journal.pone.0281047.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Irene García; Bessem Chouaia; Mercè Llabrés; Marta Simeoni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Whole dataset test: Evaluation of the SVM results for discerning Eukaryotes/Prokaryotes and Kingdoms.

  14. Students Performance in Exams.

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zahranusrat (2025). Students Performance in Exams. [Dataset]. https://www.kaggle.com/datasets/zahranusrat/students-performance-in-exams
    Explore at:
    zip(8915 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    zahranusrat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    content

    This dataset includes comprehensive data on students' exam results in three subjects: writing, reading, and math. A single student is represented by each record, which contains information about the student's gender, race and ethnicity, type of meal (reduced or free), parental education level, and whether or not the student took a test-prep course. These variables aid in the explanation of the potential relationships between various socioeconomic and personal characteristics and academic results. Each subject's exam results are included in numerical form in the dataset, enabling comparison and trend analysis of performance.

    context

    The purpose of this dataset is to analyze how different personal and socio-economic factors relate to student academic performance. By comparing test scores with variables like parental education level, test preparation, and lunch type, you can explore.

  15. ML models performance (PCA).

    • plos.figshare.com
    xls
    Updated Nov 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed (2023). ML models performance (PCA). [Dataset]. http://doi.org/10.1371/journal.pone.0294803.t016
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.

  16. Academic level of the study group.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Nov 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed (2023). Academic level of the study group. [Dataset]. http://doi.org/10.1371/journal.pone.0294803.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Muntequa Imtiaz Siraji; Ahnaf Akif Rahman; Mirza Muntasir Nishat; Md Abdullah Al Mamun; Fahim Faisal; Lamim Ibtisam Khalid; Ashik Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.

  17. H

    Replication data for: Reassessing Schoenfeld Residual Tests of Proportional...

    • dataverse.harvard.edu
    Updated Nov 18, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunhee Park; David Hendry (2015). Replication data for: Reassessing Schoenfeld Residual Tests of Proportional Hazards in Political Science Event History Analyses [Dataset]. http://doi.org/10.7910/DVN/27682
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Sunhee Park; David Hendry
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    An underlying assumption of proportional hazards models is that the effect of a change in a covariate on the hazard rate of event occurrence is constant over time. For scholars using the Cox model, a Schoenfeld residual-based test has become the disciplinary standard for detecting violations of this assumption. However, using this test requires researchers to make a choice about a transformation of the time scale. In practice, this choice has largely consisted of arbitrary decisions made without justification. Using replications and simulations, we demonstrate that the decision about time transformations can have profound implications for the conclusions reached. In particular, we show that researchers can make far more informed decisions by paying closer attention to the presence of outlier survival times and levels of censoring in their data. We suggest a new standard for best practices in Cox diagnostics that buttresses the current standard with in-depth exploratory data analysis.

  18. A test case data set with requirements

    • kaggle.com
    zip
    Updated Jun 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zumar Khalid (2021). A test case data set with requirements [Dataset]. https://www.kaggle.com/zumarkhalid/a-test-case-data-set-with-requirements
    Explore at:
    zip(54803 bytes)Available download formats
    Dataset updated
    Jun 11, 2021
    Authors
    Zumar Khalid
    Description

    Context

    Since i have started research in the field of data science, i have noticed there are lot of data sets available for NLP, medicine, images and other subjects but i could not find any single adequate data for the domain of software testing. The data sets which are hardly available are extracted from some piece of code or some historical data that too not available publicly to analyze. The domain of software testing and data science, especially machine learning has a lot of potential. While conducting research on testcase prioritization especially in initial stages of software test cycle the way companies set the priorities in software industry there is no black box data set available in that format. This was the reason that i wanted such data set to exist. So i collected the necessary attributes , arrange them against their values and make one.

    Content

    This data was gathered in [Aug, 2020], from a software company worked on a car financing lease company's whole software package from web to their management system. The dataset is in .csv format, there are 2000 rows and 6 columns in this data set. The detail of six attributes are as under: B_Req --> Business Requirement R_Prioirty --> Requirement Priority of particular business requirement FP --> Function point of each testing task, which in our case are test cases against each requirement under covers a particular FP Complexity --> Complexity of a particular function point or related modules(the description of assigning complexity is listed below in this section)* Time --> Estimated max time assigned to each Function Point of particular testing task by QA team lead or sr. SQA analyst Cost --> Calculated cost for each function point using complexity and time with function point estimation technique to calculates cost using the formula listed below: cost = “Cost = (Complexity * Time) * average amount set per task or per Function Point note: In this case it is set as 5$ per FP. The criteria for complexity is listed in .txt file attached with new version.

    Acknowledgements

    I would like to thank the persons from QA departments of different software companies. Especially team of the the company who provided me this estimation data and traceability matrix to extract data and compile these in to a dataset. I get a great help from the websites like www.softwaretestinghelp.com, www.coderus.com and many other sources which helps me to understand all the testing process and in which phases priorities are assigned usually.

    Inspiration

    My inspiration to collect this data is the shortage of dataset showing the priority of testcases with their requirements and estimated metrics to analyze the data while doing research in automation of testcase priority using machine learning. --> The dataset can be used to analyze and apply classification or any machine learning algorithm to prioritize testcases. --> Can be used reduce , select or automate testing based on priority, or cost and time or complexity and requirements. --> Can be used to build recommendation system problem related to software testing which helps software testing team to ease their task based estimation and recommendation.

  19. Shell Buckling Knockdown Factors

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Shell Buckling Knockdown Factors [Dataset]. https://data.nasa.gov/dataset/Shell-Buckling-Knockdown-Factors/n5qw-angw
    Explore at:
    csv, json, xml, application/rdfxml, application/rssxml, tsvAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The Shell Buckling Knockdown Factor (SBKF) Project, NASA Engineering and Safety Center (NESC) Assessment #: 07-010-E, was established in March of 2007 by the NESC in collaboration with the former NASA Constellation Program (CxP) and now the Space Launch System (SLS) Program. The SBKF Project has the goal of developing and experimentally validating improved (i.e., less-conservative, more robust) analysis-based shell buckling design factors (a.k.a., knockdown factors (KDFs)) and developing design recommendations for launch vehicle structures.

    Shell buckling knockdown factors have been historically based on test data from laboratory-scale test articles obtained from the 1930s through the 1960s. The knockdown factors are used to account for the differences observed between the theoretical buckling load and the buckling load obtained from test. However, these test-based KDFs may not be relevant for modern launch- vehicle designs, and are likely overly conservative for many designs. Significant advances in structural stability theory, high-fidelity analysis methods, manufacturing, and testing are enabling the development of new, less conservative, robust analysis-based knockdown factors for modern structural concepts. Preliminary design studies indicate that implementation of new knockdown factors can enable significant weight savings in these vehicles and will help mitigate some of NASA’s launch-vehicle development and performance risks, by reducing reliance on large-scale testing, and providing high-fidelity estimates of as-built structural performance, increased payload capability, and improved structural reliability.

    To achieve its KDF development and implementation goals, the SBKF Project is engaged in several work areas including launch-vehicle design trade studies, subcomponent and component level design, analysis and structural testing, and shell buckling design technology development including analysis-method development, analysis benchmarking and standardization, and analysis-based KDF development. Finite-element analysis is used extensively in all these work areas. In particular, there are four main categories analyses conducted by SBKF and include:
    1) high-fidelity structural simulations, 2) imperfection sensitivity studies, 3) test article design and analysis and 4) exploratory studies. Each of these types of analysis may have different analysis objectives and utilize different modeling approaches that depend on the results required
    to meet the Project needs. A description of the four main categories follows.

    High-fidelity structural simulations

    High-fidelity structural simulations are defined as simulations that can predict accurately the complex behavior of a structural component or an assembly of components (e.g., virtual structural test) and often require a significant level of modeling detail and knowledge of the structural system (e.g., its physical behavior and expected variability). Models are considered high-fidelity if results predicted with these models correlate with test data to within a small range of variance and represent accurately the true physical behavior of the structure. The permissible amount of variance is determined based on the analysis requirements defined by the Project in accordance with the intended end use of the predicted data. High-fidelity shell buckling analysis objectives considered by the SBKF Project often require the accurate prediction of stiffnesses, local and global deformations, strains, load paths and buckling-induced load redistribution, and buckling and failure loads and modes. To achieve these analysis goals, the models typically must accurately represent loading and boundary conditions, and expected or measured geometric and material variations (imperfections). It is expected that high-fidelity models developed by SBKFwill predict effective axial stiffness (slope of the load versus end-shortening curve) within ±2%, buckling loads and point displacements (displacement measured at a point) within ±5%, and point strains within ±10%. However, if the displacements or strains of interest are in a high-gradient location, then the overall trend will be assessed for correlation.

    Imperfection sensitivity studies
    Imperfection sensitivity studies are used to assess the sensitivity of a structure’s nonlinear response and buckling load to initial imperfections, such as geometric imperfections (imperfections in the shell wall geometry including out-of-roundness or local dimples), and loading and material
    non-uniformities. Geometric imperfections included in an analysis model can be based upon the measured geometry of test articles or flight hardware, or they can be defined analytically using eigenmode shapes or other perturbations. The SBKF Project is developing analysis-based SBKFs (KDFs) that are derived from imperfection sensitivity studies and several imperfection types are being investigated. First, a single dimple-shaped imperfection is being used as a “worst-expected” imperfection shape and is similar to the initial dimple that is observed in the shell wall at the onset of buckling. The dimple is created in the shell by applying a radially inward lateral load at the mid-length of the cylinder. The magnitude of the lateral load is held fixed and the active destabilizing load (e.g., axial compression) is then applied until buckling occurs in the shell. The magnitude of the lateral load is increased incrementally in subsequent buckling analyses until a minimum or lower-bound buckling load is achieved. A second imperfection type used includes actual measured geometry data from as-built launch-vehicle-like test articles and flight hardware. These measured geometric imperfections are included in the model by adjusting the original geometrically perfect finite-element mesh nodal coordinates to the perturbed imperfect geometry. Finally, the effects of loading imperfections are investigated by applying localized concentrated loads on the ends of the shell in combination with the geometric imperfections or separately. Loading imperfections can occur due to manufacturing/machining variabilities and/or fit- up mismatch at component interfaces.

    Test article design and analysis

    Test article design and analysis encompass unique requirements that differ significantly from those associated with the design of aircraft, spacecraft, or launch-vehicle structures. Aerospace structures are designed and evaluated to ensure that they are able to sustain the required loads, but they are not typically required to exhibit a specific controlling or critical failure mode (i.e., they are not typically designed such that a specific failure mode has the minimum design margin). In contrast, test articles used in the SBKF Project are designed and evaluated to ensure that a particular failure mechanism is exhibited during a test so that the resulting test data may be used to validate modeling and analysis methods
    for predicting specific behaviors. In addition, the test articles are typically designed such that they lie within the same design space as the full-scale structure they represent and exhibit similar response characteristics.

    Exploratory studies
    Exploratory studies are typically quick assessments used to guide future detailed analysis tasks. Data from these exploratory studies are not intended for future use or as decisional data and are often only used by the analyst to make informed decisions on the direction of future work. Thus, rigorous quality control and reporting of these analysis studies is typically not required.

    The specific class of analysis and corresponding analysis and data requirements shall be determined by the SBKF team leads and the analyst. The analysis approach shall be based on standard best practices, when possible, and shall be uniform across all related analysis activities to ensure consistency. However, deviations from standard practice may be required and/or new approaches may be necessary to meet the analysis objectives. In such circumstances, the analyst and team lead will work together to develop and validate any new approach required.

  20. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sanchi (2025). A/B Testing Data [Dataset]. https://www.kaggle.com/datasets/sanxhi/ab-testing-data-simulated-web-user-engagement
Organization logo

A/B Testing Data

Simulated Web User Engagement

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2025
Dataset provided by
Kaggle
Authors
Sanchi
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Simulated A/B Testing Data for Web User Engagement This dataset contains synthetically generated A/B testing data that mimics user behavior on a website with two versions: Control (con) and Experimental (exp). The dataset is designed for practicing data cleaning, statistical testing (e.g., Z-test, T-test), and pipeline development.

Each row represents an individual user session, with attributes capturing click behavior, session duration, access device, referral source, and timestamp.

Features: click — Binary (1 if clicked, 0 if not)

group — A/B group assignment (con or exp, with injected label inconsistencies)

session_time — Time spent in the session (in minutes), including outliers

click_time — Timestamp of user interaction (nullable)

device_type — Device used (mobile or desktop, mixed casing)

referral_source — Where the user came from (e.g., social, email, with some typos/whitespace)

Use Cases: A/B testing analysis (CTR, CVR)

Hypothesis testing (Z-test, T-test)

ETL pipeline design

Data cleaning and standardization practice

Dashboard creation and segmentation analysis

Notes: The dataset includes intentional inconsistencies (nulls, duplicates, casing issues, typos) to reflect real-world challenges.

Fully synthetic — safe for public use.

Search
Clear search
Close search
Google apps
Main menu