100+ datasets found

A/B Testing Data
kaggle.com
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sanchi (2025). A/B Testing Data [Dataset]. https://www.kaggle.com/datasets/sanxhi/ab-testing-data-simulated-web-user-engagement
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2025
Dataset provided by
Kaggle
Authors
Sanchi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Simulated A/B Testing Data for Web User Engagement This dataset contains synthetically generated A/B testing data that mimics user behavior on a website with two versions: Control (con) and Experimental (exp). The dataset is designed for practicing data cleaning, statistical testing (e.g., Z-test, T-test), and pipeline development.

Each row represents an individual user session, with attributes capturing click behavior, session duration, access device, referral source, and timestamp.

Features: click — Binary (1 if clicked, 0 if not)

group — A/B group assignment (con or exp, with injected label inconsistencies)

session_time — Time spent in the session (in minutes), including outliers

click_time — Timestamp of user interaction (nullable)

device_type — Device used (mobile or desktop, mixed casing)

referral_source — Where the user came from (e.g., social, email, with some typos/whitespace)

Use Cases: A/B testing analysis (CTR, CVR)

Hypothesis testing (Z-test, T-test)

ETL pipeline design

Data cleaning and standardization practice

Dashboard creation and segmentation analysis

Notes: The dataset includes intentional inconsistencies (nulls, duplicates, casing issues, typos) to reflect real-world challenges.

Fully synthetic — safe for public use.
Data after outlier processing.
plos.figshare.com
txt
Updated Dec 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qian Yang; Xueli Wang; Xianbing Cao; Shuai Liu; Feng Xie; Yumei Li (2023). Data after outlier processing. [Dataset]. http://doi.org/10.1371/journal.pone.0295674.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295674.s002
Dataset updated
Dec 22, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Qian Yang; Xueli Wang; Xianbing Cao; Shuai Liu; Feng Xie; Yumei Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Physical fitness is a key element of a healthy life, and being overweight or lacking physical exercise will lead to health problems. Therefore, assessing an individual’s physical health status from a non-medical, cost-effective perspective is essential. This paper aimed to evaluate the national physical health status through national physical examination data, selecting 12 indicators to divide the physical health status into four levels: excellent, good, pass, and fail. The existing challenge lies in the fact that most literature on physical fitness assessment mainly focuses on the two major groups of sports athletes and school students. Unfortunately, there is no reasonable index system has been constructed. The evaluation method has limitations and cannot be applied to other groups. This paper builds a reasonable health indicator system based on national physical examination data, breaks group restrictions, studies national groups, and hopes to use machine learning models to provide helpful health suggestions for citizens to measure their physical status. We analyzed the significance of the selected indicators through nonparametric tests and exploratory statistical analysis. We used seven machine learning models to obtain the best multi-classification model for the physical fitness test level. Comprehensive research showed that MLP has the best classification effect, with macro-precision reaching 74.4% and micro-precision reaching 72.8%. Furthermore, the recall rates are also above 70%, and the Hamming loss is the smallest, i.e., 0.272. The practical implications of these findings are significant. Individuals can use the classification model to understand their physical fitness level and status, exercise appropriately according to the measurement indicators, and adjust their lifestyle, which is an important aspect of health management.
Marketing Analytics
kaggle.com
zip
Updated Mar 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Daoud (2022). Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/jackdaoud/marketing-data/discussion
Explore at:
zip(658411 bytes)Available download formats
Dataset updated
Mar 6, 2022
Authors
Jack Daoud
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations.

Content

The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance

Acknowledgement

I do not own this dataset. I am simply making it accessible on this platform via the public GitHub link.
A/B Test Aggregated Data
kaggle.com
zip
Updated Sep 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Logvinov (2022). A/B Test Aggregated Data [Dataset]. https://www.kaggle.com/datasets/sergylog/ab-test-aggregated-data/discussion
Explore at:
zip(394999 bytes)Available download formats
Dataset updated
Sep 18, 2022
Authors
Sergei Logvinov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Simulated user-aggregated data from an experiment with webpage views and button clicks attributes. Can be very useful for preparing for interviews and practicing statistical tests. The data was prepared using a special selection of parameters: success_rate, uplift, beta, skew
f
Data from: Functional Time Series Analysis and Visualization Based on...
tandf.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Israel Martínez-Hernández; Marc G. Genton (2024). Functional Time Series Analysis and Visualization Based on Records [Dataset]. http://doi.org/10.6084/m9.figshare.26207477.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26207477.v1
Dataset updated
Sep 19, 2024
Dataset provided by
Taylor & Francis
Authors
Israel Martínez-Hernández; Marc G. Genton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In many phenomena, data are collected on a large scale and at different frequencies. In this context, functional data analysis (FDA) has become an important statistical methodology for analyzing and modeling such data. The approach of FDA is to assume that data are continuous functions and that each continuous function is considered as a single observation. Thus, FDA deals with large-scale and complex data. However, visualization and exploratory data analysis, which are very important in practice, can be challenging due to the complexity of the continuous functions. Here we introduce a type of record concept for functional data, and we propose some nonparametric tools based on the record concept for functional data observed over time (functional time series). We study the properties of the trajectory of the number of record curves under different scenarios. Also, we propose a unit root test based on the number of records. The trajectory of the number of records over time and the unit root test can be used for visualization and exploratory data analysis. We illustrate the advantages of our proposal through a Monte Carlo simulation study. We also illustrate our method on two different datasets: Daily wind speed curves at Yanbu, Saudi Arabia and annual mortality rates in France. Overall, we can identify the type of functional time series being studied based on the number of record curves observed. Supplementary materials for this article are available online.
Data and Code for the paper "An Empirical Study on Exploratory Crowdtesting...
zenodo.org
zip
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Di Martino; Sergio Di Martino; Anna Rita Fasolino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Porfirio Tramontana; Porfirio Tramontana (2023). Data and Code for the paper "An Empirical Study on Exploratory Crowdtesting of Android Applications" [Dataset]. http://doi.org/10.5281/zenodo.7260112
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7260112
Dataset updated
Sep 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sergio Di Martino; Sergio Di Martino; Anna Rita Fasolino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Porfirio Tramontana; Porfirio Tramontana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This package contains data and code to replicate the findings presented in our paper titled "Influence of the Number of Testers in Exploratory Crowd-Testing of Android Applications".

Abstract

Crowdtesting is an emerging paradigm in which a ``crowd'' of people independently carry out testing tasks, and proved to be especially promising in the mobile apps domain and in combination with exploratory testing strategies, in which individual testers pursue a creative, experience-based approach to design tests.

Managing the crowdtesting process, however, is still a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the unpredictability of remote testing activities. A number of works in the literature investigated the application of crowdtesting in the mobile apps domain. These works, however, consider crowdtesting tasks in which the goal is to find bugs, and not to generate a re-executable test suite. Moreover, existing works do not consider the impact of the application of different exploratory testing strategies.

As a first step towards filling this gap in the literature, in this work, we conduct an empirical evaluation involving four open source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in crowdtesting activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized crowds of students achieve using different exploratory testing strategies. Results provide useful insights to project managers interested in using crowdtesting to produce GUI test suites for mobile apps, on which they can make more informed decisions.

Contents and Instructions

This package contains:

apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute code coverage unions, run the analysisAndReport.py script.

data-analysis-scripts.zip A zip archive containing a RStudio project and all the R scripts we developed to carry out statistical analysis and draw plots. All data is available as a R object is the ./data/ce/data_augmented.rds file. Moreover, the hypotheses_testing.R scripts performs statistical tests and measures effect size for RQ1. The script hypotheses_testing_across_strategy.R performs statistical tests and measures effect size for RQ2.
f
Descriptive statistics.
plos.figshare.com
xls
Updated Oct 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t003
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
SEM regression for H1-5.
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). SEM regression for H1-5. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t004
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
Insurance(HealthCare)
kaggle.com
zip
Updated Jul 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damini Tiwari (2020). Insurance(HealthCare) [Dataset]. https://www.kaggle.com/datasets/daminitiwari/insurance/discussion
Explore at:
zip(16433 bytes)Available download formats
Dataset updated
Jul 27, 2020
Authors
Damini Tiwari
Description
Dataset

This dataset was created by Damini Tiwari

Contents
f
Descriptive statistics for factors (F) extracted through exploratory factor...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucrezi, Serena; Cerrano, Carlo; Milanese, Martina; Palma, Marco (2019). Descriptive statistics for factors (F) extracted through exploratory factor analysis (EFA) and reliability tests under the following categories: Divers’ self-assessment; divers’ satisfaction with diving at the study areas; and divers’ perceptions of scuba diving impacts. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000144453
Explore at:
Dataset updated
Jul 5, 2019
Authors
Lucrezi, Serena; Cerrano, Carlo; Milanese, Martina; Palma, Marco
Description
Descriptive statistics for factors (F) extracted through exploratory factor analysis (EFA) and reliability tests under the following categories: Divers’ self-assessment; divers’ satisfaction with diving at the study areas; and divers’ perceptions of scuba diving impacts.
Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum...
zenodo.org
zip
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio (2025). Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum Anomalies in the 10-15 GeV Range [Dataset]. http://doi.org/10.5281/zenodo.17220766
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17220766
Dataset updated
Sep 29, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the results of an exploratory analysis of CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), focusing on the dimuon invariant mass spectrum in the 10-15 GeV range. The analysis investigates potential anomalies at 11.9 GeV and applies various statistical methods to characterize observed features.

Methodology:

Event selection and reconstruction using CMS NanoAOD format

Dimuon invariant mass analysis with background estimation

Angular distribution studies for quantum number determination

Statistical analysis including significance testing

Systematic uncertainty evaluation

Conservation law verification

Key Analysis Components:

Mass spectrum reconstruction and peak identification

Background modeling using sideband methods

Angular correlation analysis (sphericity, thrust, momentum distributions)

Cross-validation using multiple event selection criteria

Monte Carlo comparison for background understanding

Results Summary: The analysis identifies several features in the dimuon mass spectrum requiring further investigation. Preliminary observations suggest potential anomalies around 11.9 GeV, though these findings require independent validation and peer review before drawing definitive conclusions.

Data Products:

Processed event datasets

Analysis scripts and methodology

Statistical outputs and uncertainty estimates

Visualization tools and plots

Systematic studies documentation

Limitations: This work represents preliminary exploratory analysis. Results have not undergone formal peer review and should be considered investigative rather than conclusive. Independent replication and validation by the broader physics community are essential before any definitive claims can be made.

Keywords: CMS experiment, dimuon analysis, mass spectrum, exploratory analysis, LHC data, particle physics, statistical analysis, anomaly investigation

# Dark Photon Search for at 11.9 GeV

## Executive Summary

**Historic Search for: First Evidence of a Massive Dark Photon**

We report the Search for a new vector gauge boson at 11.9 GeV, identified as a dark photon (A'), representing the first confirmed portal anomaly between the Standard Model and a hidden sector. This search, based on CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), provides direct experimental evidence for physics beyond the Standard Model.

## Search for Highlights

### Anomaly Properties

- **Mass**: 11.9 ± 0.1 GeV

- **Quantum Numbers**: J^PC = 1^-- (vector gauge boson)

- **Spin**: 1

- **Parity**: Negative

- **Isospin**: 0 (singlet)

- **Hypercharge**: 0

### Statistical Significance

- **Total Events**: 63,788 candidates in Run 1

- **Signal Strength**: > 5σ significance

- **Decay Channel**: A' → μ⁺μ⁻ (dominant)

- **Branching Ratio**: ~50% to neutral pairs

### Conservation Laws

All fundamental symmetries preserved:

- ✓ Energy-momentum

- ✓ Charge

- ✓ Lepton number

- ✓ CPT

## Project Structure

```

search/

├── README.md # This file

├── docs/

│ ├── paper/ # Main search paper

│ │ ├── manuscript.tex # LaTeX source

│ │ ├── abstract.txt # Paper abstract

│ │ └── figures/ # Paper figures

│ └── supplementary/ # Additional materials

│ ├── methods.pdf # Detailed methodology

│ ├── systematics.pdf # Systematic uncertainties

│ └── theory.pdf # Theoretical implications

├── data/

│ ├── run1/ # 7-8 TeV (2010-2012)

│ │ ├── raw/ # Original ROOT files

│ │ ├── processed/ # Processed datasets

│ │ └── results/ # Analysis outputs

│ └── run2/ # 13 TeV (2015-2018)

│ ├── raw/ # Original ROOT files

│ ├── processed/ # Processed datasets

│ └── results/ # Analysis outputs

├── analysis/

│ └── scripts/ # Analysis code

│ ├── dark_photon_symmetry_analysis.py

│ ├── hidden_sector_10_150_search.py

│ ├── hidden_10_15_gev_analysis.py

│ └── validation/ # Cross-checks

├── figures/ # Publication-ready plots

│ ├── mass_spectrum.png # Invariant mass distribution

│ ├── angular_dist.png # Angular distributions

│ ├── symmetry_plots.png # Symmetry analysis

│ └── cascade_spectrum.png # Hidden sector cascade

└── validation/ # Systematic studies

├── background_estimation/

├── signal_extraction/

└── systematic_errors/

```

## Key Evidence

### 1. Quantum Number Determination

- **Angular Distribution**: ⟨|P₁|⟩ = 0.805 (strong anisotropy)

- **Quadrupole Moment**: ⟨P₂⟩ = 0.573 (non-zero)

- **Anomaly Type Score**: Vector = 90/100 (Preliminary)

### 2. Hidden Sector Connection

- 236,181 total events in 10-150 GeV range

- Exponential cascade spectrum indicating hidden valley dynamics

- Dark photon serves as portal anomaly

### 3. Decay Topology

- **Sphericity**: 0.161 (jet-like)

- **Thrust**: 0.686 (moderate collimation)

- Consistent with two-body decay A' → μ⁺μ⁻

## Physical Interpretation

The search anomaly represents:

1. **New Force Carrier**: Fifth fundamental force beyond the four known forces

2. **Portal Anomaly**: Mediator between Standard Model and hidden/dark sector

3. **Dark Matter Connection**: Potential mediator for dark matter interactions

## Theoretical Framework

### Kinetic Mixing

The dark photon arises from kinetic mixing between U(1)_Y (hypercharge) and U(1)_D (dark charge):

```

L_mix = -(ε/2) F_μν^Y F^Dμν

```

where ε is the mixing parameter (~10^-3 based on observed coupling).

### Hidden Valley Scenario

The exponential cascade spectrum suggests:

- Complex hidden sector with multiple states

- Possible dark hadronization

- Rich phenomenology awaiting exploration

## Collaborators and Credits

**Lead Analysis**: CMS Open Data Analysis Team

**Data Source**: CERN Open Data Portal

**Period**: 2010-2012 (Run 1), 2015-2018 (Run 2)

**Computing**: Local analysis on CMS NanoAOD format

## How to Reproduce

### Requirements

```bash

pip install uproot awkward numpy matplotlib

```

### Quick Start

```bash

cd analysis/scripts/

python dark_photon_symmetry_analysis.py

python hidden_10_15_gev_analysis.py

```

## Significance Statement

This search represents the first confirmed Evidence of a portal anomaly connecting the Standard Model to a hidden sector. The 11.9 GeV dark photon opens an entirely new frontier in anomaly physics, providing experimental access to previously invisible physics and potentially explaining dark matter interactions.

## Contact

For questions about this search or collaboration opportunities:

- Email: andreluisdionisio@gmail.com

---

"We're not at the end of anomaly physics - we're at the beginning of dark sector physics!"

3665778186 00382C40-4D7F-E211-AD6F-003048FFCBFC.root
2581315530 0E5F189B-5D7F-E211-9423-002354EF3BE1.root
2149825126 1AE176AC-5A7F-E211-8E63-00261894397D.root
1792851725 2044D46B-DE7F-E211-9C82-003048FFD76E.root
3186214416 4CAE8D51-4A7F-E211-9937-0025905964A2.root
3220923349 72FDEF89-497F-E211-9CFA-002618943958.root
2555255008 7A35A5A2-547F-E211-940B-003048678DA2.root
3875410897 7E942EED-457F-E211-938E-002618FDA28E.root
2409745919 8406DE2F-407F-E211-A6A5-00261894395F.root
2421251748 8A61DAA8-3C7F-E211-94A6-002618943940.root
2315643699 98909097-417F-E211-9009-002618943838.root
2614932091 A0963AD9-567F-E211-A8AF-002618943901.root
2438057881 ACE2DF9A-477F-E211-9C29-003048679266.root
2206652387 B6AA897F-467F-E211-8381-002618943854.root
2365666837 C09519C8-4B7F-E211-9BCE-003048678B34.root
2477336101 C68AE3A5-447F-E211-928E-00261894388B.root
2556444022 C6CEC369-437F-E211-81B0-0026189438BD.root
3184171088 D60FF379-4E7F-E211-8BA4-002590593878.root
2381001693
d
Data from: Using decision trees to understand structure in missing data
datamed.org
datasetcatalog.nlm.nih.gov
+2more
Updated Jun 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). Data from: Using decision trees to understand structure in missing data [Dataset]. https://datamed.org/display-item.php?repository=0010&id=5937ae305152c60a13865bb4&query=CARTPT
Explore at:
Dataset updated
Jun 2, 2015
Description
Objectives: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. Setting: Data taken from employees at 3 different industrial sites in Australia. Participants: 7915 observations were included. Materials and methods: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Discussion: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusions: Researchers are encouraged to use CART and BRT models to explore and understand missing data.
Groups of words for our Z and X variables.
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). Groups of words for our Z and X variables. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t002
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
Z
Data and Code for the paper "GUI Testing of Android Applications:...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Di Martino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Porfirio Tramontana (2023). Data and Code for the paper "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7260111
Explore at:
Dataset updated
Sep 25, 2023
Dataset provided by
Università degli Studi di Napoli Federico II, Naples, Italy
Authors
Sergio Di Martino; Anna Rita Fasolino; Luigi Libero Lucio Starace; Porfirio Tramontana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".

Abstract

Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.

Contents and Instructions

This package contains:

apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.

students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.

branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.

data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.
Data for Measures of abdominal obesity, metabolic dysfunction, and metabolic...
catalog.data.gov
datasets.ai
+1more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data for Measures of abdominal obesity, metabolic dysfunction, and metabolic syndrome in United States adolescents: exploratory analysis using the National Health and Nutrition Examination Survey (NHANES) 2011-2014 data [Dataset]. https://catalog.data.gov/dataset/data-for-measures-of-abdominal-obesity-metabolic-dysfunction-and-metabolic-syndrome-in-uni
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
United States
Description
NHANES data from the 2011-2014 survey years. Specific to adolescents. Ancillary data related to metabolic syndrome and other covariates. This dataset is associated with the following publication: Gaston, S., N. Tulve, and T. Ferguson. Abdominal obesity, metabolic dysfunction, and metabolic syndrome in U.S. adolescents: National Health and Nutrition Examination Survey 2011–2016. ANNALS OF EPIDEMIOLOGY. Elsevier Science Ltd, New York, NY, USA, 30: 30-36, (2019).
Data from: An Exploratory Analysis of Barriers to Usage of the USDA Dietary...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: An Exploratory Analysis of Barriers to Usage of the USDA Dietary Guidelines for Americans [Dataset]. https://catalog.data.gov/dataset/data-from-an-exploratory-analysis-of-barriers-to-usage-of-the-usda-dietary-guidelines-for--bb6c7
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
The average American’s diet does not align with the Dietary Guidelines for Americans (DGA) provided by the U.S. Department of Agriculture and the U.S. Department of Health and Human Services (2020). The present study aimed to compare fruit and vegetable consumption among those who had and had not heard of the DGA, identify characteristics of DGA users, and identify barriers to DGA use. A nationwide survey of 943 Americans revealed that those who had heard of the DGA ate more fruits and vegetables than those who had not. Men, African Americans, and those who have more education had greater odds of using the DGA as a guide when preparing meals relative to their respective counterparts. Disinterest, effort, and time were among the most cited reasons for not using the DGA. Future research should examine how to increase DGA adherence among those unaware of or who do not use the DGA. Comparative analyses of fruit and vegetable consumption among those who were aware/unaware and use/do not use the DGA were completed using independent samples t tests. Fruit and vegetable consumption variables were log-transformed for analysis. Binary logistic regression was used to examine whether demographic features (race, gender, and age) predict DGA awareness and usage. Data were analyzed using SPSS version 28.1 and SAS/STAT® version 9.4 TS1M7 (2023 SAS Institute Inc).
n
HadISD: Global sub-daily, surface meteorological station data, 1931-2023,...
data-search.nerc.ac.uk
Updated Jul 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). HadISD: Global sub-daily, surface meteorological station data, 1931-2023, v3.4.0.2023f [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=dewpoint
Explore at:
Dataset updated
Jul 24, 2021
Description
This is version v3.4.0.2023f of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data. This update (v3.4.0.2023f) to HadISD corrects a long-standing bug which was discovered in autumn 2023 whereby the neighbour checks (and associated [un]flagging for some other tests) were not being implemented. For more details see the posts on the HadISD blog: https://hadisd.blogspot.com/2023/10/bug-in-buddy-checks.html & https://hadisd.blogspot.com/2024/01/hadisd-v3402023f-future-look.html The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20240101_v3.4.1.2023f.nc. The station codes can be found under the docs tab. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., (2019), HadISD version 3: monthly updates, Hadley Centre Technical Note. Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016. Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.
Evaluate AI Models for Weld Inspection & NDT in Auto Manufac - EDA
ai.tracebloc.io
json
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tracebloc (2025). Evaluate AI Models for Weld Inspection & NDT in Auto Manufac - EDA [Dataset]. https://ai.tracebloc.io/explore/ai-weld-inspection-ndt-testing-in-automotive-manufacturing?tab=exploratory-data-analysis
Explore at:
jsonAvailable download formats
Dataset updated
Dec 3, 2025
Dataset provided by
Tracebloc GmbH
Authors
tracebloc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Missing Values
Measurement technique
Statistical and exploratory data analysis
Description
Benchmark and compare 3rd-party AI models for weld defect detection & NDT in automotive production lines. Focus on recall, latency and enterprise deployment.
Feature contributions and top-three feature interactions (MFIs).
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). Feature contributions and top-three feature interactions (MFIs). [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t003
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Feature contributions and top-three feature interactions (MFIs).
Z
Usability test used for inspiraconciencia exploratory tool analysis
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Sep 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miriam, Calvera-Isabal (2024). Usability test used for inspiraconciencia exploratory tool analysis [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13732251
Explore at:
Dataset updated
Sep 8, 2024
Authors
Miriam, Calvera-Isabal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the analysis and questionnaire of the material collected during the workshops conducted educators to evaluate the usuability of the exploratory tool inspiraconciencia. It is part of a study by Calvera-Isabal M. (to be published).

This work has been funded by PID2020-112584RB-C33 funded by MCIN/AEI/10.13039/501100011033, the CS Track project, EU Horizon 2020 programme [grant agreement No 872522] and H2O Learn project PID2020-112584RB-C33 funded by MCIN/ AEI / 10.13039/501100011033.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sanchi (2025). A/B Testing Data [Dataset]. https://www.kaggle.com/datasets/sanxhi/ab-testing-data-simulated-web-user-engagement

A/B Testing Data

Simulated Web User Engagement

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 4, 2025

Dataset provided by

Kaggle

Authors

Sanchi

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Simulated A/B Testing Data for Web User Engagement This dataset contains synthetically generated A/B testing data that mimics user behavior on a website with two versions: Control (con) and Experimental (exp). The dataset is designed for practicing data cleaning, statistical testing (e.g., Z-test, T-test), and pipeline development.

Each row represents an individual user session, with attributes capturing click behavior, session duration, access device, referral source, and timestamp.

Features: click — Binary (1 if clicked, 0 if not)

group — A/B group assignment (con or exp, with injected label inconsistencies)

session_time — Time spent in the session (in minutes), including outliers

click_time — Timestamp of user interaction (nullable)

device_type — Device used (mobile or desktop, mixed casing)

referral_source — Where the user came from (e.g., social, email, with some typos/whitespace)

Use Cases: A/B testing analysis (CTR, CVR)

Hypothesis testing (Z-test, T-test)

ETL pipeline design

Data cleaning and standardization practice

Dashboard creation and segmentation analysis

Notes: The dataset includes intentional inconsistencies (nulls, duplicates, casing issues, typos) to reflect real-world challenges.

Fully synthetic — safe for public use.

Clear search

Close search

Google apps

Main menu

A/B Testing Data

Data after outlier processing.

Marketing Analytics

Context

Content

Acknowledgement

A/B Test Aggregated Data

Data from: Functional Time Series Analysis and Visualization Based on...

Data and Code for the paper "An Empirical Study on Exploratory Crowdtesting...

Descriptive statistics.

SEM regression for H1-5.

Insurance(HealthCare)

Dataset

Contents

Descriptive statistics for factors (F) extracted through exploratory factor...

Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum...

Data from: Using decision trees to understand structure in missing data

Groups of words for our Z and X variables.

Data and Code for the paper "GUI Testing of Android Applications:...

Data for Measures of abdominal obesity, metabolic dysfunction, and metabolic...

Data from: An Exploratory Analysis of Barriers to Usage of the USDA Dietary...

HadISD: Global sub-daily, surface meteorological station data, 1931-2023,...

Evaluate AI Models for Weld Inspection & NDT in Auto Manufac - EDA

Feature contributions and top-three feature interactions (MFIs).

Usability test used for inspiraconciencia exploratory tool analysis

A/B Testing Data

Simulated Web User Engagement