31 datasets found

Exploratory Data Analysis
kaggle.com
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saubhagya Mishra (2025). Exploratory Data Analysis [Dataset]. https://www.kaggle.com/datasets/saubhagyamishra1992/exploratory-data-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saubhagya Mishra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Saubhagya Mishra

Released under MIT

Contents
Cyclistic Bike - Data Analysis (Python)
kaggle.com
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Amirthavarshini
Description
Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.
Z
Dataset for "Machine learning predictions on an extensive geotechnical...
data.niaid.nih.gov
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soranzo, Enrico (2024). Dataset for "Machine learning predictions on an extensive geotechnical dataset of laboratory tests in Austria" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14251190
Explore at:
Dataset updated
Dec 5, 2024
Dataset authored and provided by
Soranzo, Enrico
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Austria
Description
This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.

The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.

Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.

This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.

Key Features:

Temporal Coverage: Over 20 years of data.

Geographical Coverage: Vienna, Lower Austria, and Burgenland.

Tests Included:

Particle Size Distribution

Atterberg Limits

Proctor Tests

Permeability Tests

Direct Shear Tests

Number of Variables: 24

Potential Applications: Correlation analysis, predictive modeling, and geotechnical design.

Technical Details:

Missing values have been addressed using K-Nearest Neighbors (KNN) imputation, and anomalies identified using Local Outlier Factor (LOF) methods in previous studies.

Data normalization and standardization steps are recommended for specific analyses.

Acknowledgments:The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).
Keith Galli's Sales Analysis Exercise
kaggle.com
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zulkhairee Sulaiman (2022). Keith Galli's Sales Analysis Exercise [Dataset]. https://www.kaggle.com/datasets/zulkhaireesulaiman/sales-analysis-2019-excercise/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zulkhairee Sulaiman
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is the dataset required for Keith Galli's 'Solving real world data science tasks with Python Pandas!' video. Where he analyzes and answers business questions for 12 months worth of business data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.

I decided to upload the data here so that I can carry out the exercise straight on Kaggle Notebooks. Making it ready for viewing as a portfolio project.

Content

12 .csv files containing sales data for each month of 2019.

Acknowledgements

Of course, all thanks goes to Keith Galli and the great work he does with his tutorials. He has several other amazing tutorials that you can follow and subscribe at his channel.
Insurance(HealthCare)
kaggle.com
Updated Jul 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damini Tiwari (2020). Insurance(HealthCare) [Dataset]. https://www.kaggle.com/daminitiwari/insurance/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Damini Tiwari
Description
Dataset

This dataset was created by Damini Tiwari

Contents
House Prices
kaggle.com
Updated May 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanya Chawla (2021). House Prices [Dataset]. https://www.kaggle.com/tanyachawla412/house-prices/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 13, 2021
Dataset provided by
Kaggle
Authors
Tanya Chawla
Description
Context

To explore and learn more on Multiple Linear Regression.

Content

The dataset consists of house prices across the USA. It has the following columns: - Avg. Area Income: Numerical data about the average area of the income where the house is located. - House Age: Age of the house in years. - Number of Rooms - Number of Bedrooms - Area Population: Population of the area where the house is located. - Price - Address: The only textual data in the dataset consisting of the address of the house.
Sales Data (Project1 IIITD)
kaggle.com
Updated Jan 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Sharma (2022). Sales Data (Project1 IIITD) [Dataset]. https://www.kaggle.com/rahultheogre/iiitd-project1/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rahul Sharma
Description
Dataset

This dataset was created by Rahul Sharma

Contents
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Google Play Store_Cleaned
kaggle.com
Updated Mar 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yash (2023). Google Play Store_Cleaned [Dataset]. https://www.kaggle.com/datasets/yash16jr/google-play-store-cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yash
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This Dataset is the cleaned up version of the Google Play Store Data dataset , available on Kaggle. The EDA and data cleaning was performed using Python .
Representations of Sound and Music in the Middle Ages: Analysis and...
zenodo.org
json
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho (2025). Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database (Records and Performances) [Dataset]. http://doi.org/10.5281/zenodo.15037823
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15037823
Dataset updated
Mar 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the study “Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database”, authored by Edmundo Camacho, Xavier Fresquet, and Frédéric Billiet.

It contains structured descriptions of musical performances, performers, and instruments extracted from the Musiconis database (December 2024 version). This dataset does not include organological descriptions, which are available in a separate dataset.

The Musiconis database provides a structured and interoperable framework for studying medieval music iconography. It enables investigations into:

• The evolution and spread of musical instruments across Europe and the Mediterranean.

• Performer typologies and their representation in medieval art.

• The relationships between musical practices and social or religious contexts.

Contents:

• Musiconis Dataset (JSON format, December 2024 version):

• Musical scenes and their descriptions

• Performer metadata (roles, social status, gender, interactions)

• Instrument classifications (without detailed organological descriptions)

• Colab Notebook (Python):

• Data processing and structuring

• Visualization of performer distributions and instrument usage

• Exploratory statistics and mapping

Tools Used:

• Python (Pandas, Seaborn, Matplotlib, Plotly)

• Statistical and exploratory data analysis

• Visualization of instrument distributions, performer interactions, and musical context
RICardo dataset 2017.12
zenodo.org
zip
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Béatrice Dedinger; Paul Girard; Paul Girard; Béatrice Dedinger (2020). RICardo dataset 2017.12 [Dataset]. http://doi.org/10.5281/zenodo.1119592
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1119592
Dataset updated
Jan 21, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Béatrice Dedinger; Paul Girard; Paul Girard; Béatrice Dedinger
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This is the first public release of the RICardo dataset under the licence odbl v1.0. This dataset is precisely described un der the data package format.

This release includes 368,871 bilateral or total trade flows from 1787 to 1938 for 373 reporting entities. It also contains python scripts used to compile and filter the flows to fuel our exploratory data analysis online tool.
m
ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...
data.mendeley.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
Explore at:
Unique identifier
https://doi.org/10.17632/g2sdzmssgh.1
Dataset updated
Aug 15, 2025
Authors
Christopher Lynch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

Tagged datasets (.csv): human-tagged gold labels for evaluation

Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative

Suitable for inference, semi-automatic labeling, or transfer learning

Python and R code for preprocessing, model training, evaluation, and visualization

Configuration files and environment specifications to enable end-to-end reproducibility

The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

Funding Note * Funding sources provided time in support of human taggers annotating the data sets.
Digital_Payments_2025_Dataset
figshare.com
csv
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shreyash tiwari (2025). Digital_Payments_2025_Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28873229.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28873229.v1
Dataset updated
Apr 25, 2025
Dataset provided by
figshare
Authors
shreyash tiwari
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The "Digital Payments 2025 Dataset" is a synthetic dataset representing digital payment transactions across various payment applications in India for the year 2025. It captures monthly transaction data for multiple payment apps, including banks, UPI platforms, and mobile payment services, reflecting the growing adoption of digital payments in India. The dataset was created as part of a college project to simulate realistic transaction patterns for research, education, and analysis in data science, economics, and fintech studies. It includes metrics such as customer transaction counts and values, total transaction counts and values, and temporal data (month and year). The data is synthetic, generated using Python libraries to mimic real-world digital payment trends, and is suitable for academic research, teaching, and exploratory data analysis.
Parkison Diseases EEG Dataset
kaggle.com
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WARNER (2024). Parkison Diseases EEG Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8600168
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8600168
Dataset updated
Jun 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
WARNER
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recordings from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to the "status" column which is set to 0 for healthy and 1 for PD.

Attribute Information: Matrix column entries (attributes): name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude NHR, HNR - Two measures of the ratio of noise to tonal components in the voice status - The health status of the subject (one) - Parkinson's, (zero) - healthy RPDE, D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation
D
Data from: Exploratory temporal ICA based analysis in task and resting-state...
data.ru.nl
narcis.nl
07_720_v1
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel E. P. Gomez; Alberto Llera; José Marques; Christian Beckmann; David Norris (2022). Exploratory temporal ICA based analysis in task and resting-state fMRI [Dataset]. http://doi.org/10.34973/g044-ka42
Explore at:
07_720_v1(12350478903 bytes)Available download formats
Unique identifier
https://doi.org/10.34973/g044-ka42
Dataset updated
Dec 6, 2022
Dataset provided by
Radboud University
Authors
Daniel E. P. Gomez; Alberto Llera; José Marques; Christian Beckmann; David Norris
Description
Temporally independent functional modes (TFMs) are functional brain networks identified based on their temporal independence. The rationale behind identifying TFMs is that different functional networks may share a common anatomical infrastructure yet display distinct temporal dynamics. Extracting TFMs usually require a larger number of samples than acquired in standard fMRI experiments, and thus have therefore previously only been performed at the group level. Here, using an ultra-fast fMRI sequence, MESH-EPI, with a volume repetition time of 158 ms, we conducted an exploratory study with n = 6 subjects and computed TFMs at the single subject level on both task and resting-state datasets. We identified 6 common temporal modes of activity in our participants, including a temporal default mode showing patterns of anti-correlation between the default mode and the task-positive networks, a lateralised motor mode and a visual mode integrating the visual cortex and the visual streams. In alignment with other findings reported recently, we also showed that independent time-series are largely free from confound contamination. In particular for ultra-fast fMRI, TFMs can separate the cardiac signal from other fluctuations. Using a non-linear dimensionality reduction technique, UMAP, we obtained preliminary evidence that combinations of spatial networks as described by the TFM model are highly individual. Our results show that it is feasible to measure reproducible TFMs at the single-subject level, opening new possibilities for investigating functional networks and their integration. Finally, we provide a python toolbox for generating TFMs and comment on possible applications of the technique and avenues for further investigation.
Z
Data and Code for the paper "GUI Testing of Android Applications:...
data.niaid.nih.gov
zenodo.org
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Di Martino (2023). Data and Code for the paper "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7260111
Explore at:
Dataset updated
Sep 25, 2023
Dataset provided by
Anna Rita Fasolino
Luigi Libero Lucio Starace
Porfirio Tramontana
Sergio Di Martino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".

Abstract

Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.

Contents and Instructions

This package contains:

apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.

students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.

branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.

data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.
Stack Overflow tags
kaggle.com
Updated Jan 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abid Ali Awan (2021). Stack Overflow tags [Dataset]. https://www.kaggle.com/kingabzpro/stack-overflow-tags/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abid Ali Awan
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?

One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.

Content

Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.

We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.

Acknowledgements

DataCamp
Replication package for EMSE article: "Does Microservice Adoption Impact the...
zenodo.org
zip
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikel Robredo; Mikel Robredo; Nyyti Saarimäki; Nyyti Saarimäki; Agbonvihele Gregrey Oko-oboh; Agbonvihele Gregrey Oko-oboh; Davide Taibi; Davide Taibi; Valentina Lenarduzzi; Valentina Lenarduzzi (2025). Replication package for EMSE article: "Does Microservice Adoption Impact the Velocity? A Cohort Study" [Dataset]. http://doi.org/10.5281/zenodo.16407138
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16407138
Dataset updated
Jul 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikel Robredo; Mikel Robredo; Nyyti Saarimäki; Nyyti Saarimäki; Agbonvihele Gregrey Oko-oboh; Agbonvihele Gregrey Oko-oboh; Davide Taibi; Davide Taibi; Valentina Lenarduzzi; Valentina Lenarduzzi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains all the Python and R source code to conduct the data collection, preprocessing and analysis of this study.

Contents

This repository contains the following

INSTALL: Detailed installation instructions for each of the used tools as well as the required Python dependencies.

Figures: Figures added in the PDF version of the manuscript. The analysis and scripts generate further figures that support the results of the study.

Continuous cohort outline: The overall design of a cohort study with a continuous outcome.

Fixed follow-up dates: The start and end of follow-up dates fixed.

Relative follow-up dates: The start and end of follow-up dates relative to the subject.

Confounding: Relationships between an exposure, outcome, and confounding factors.

Study design: Graphical description of the stages addressed in the study design.

Retrospective cohort study design: The design of the retrospective cohort study and timeline

DAG: Directed acyclic graph demonstrating the relationships between the dependent, independent and confounding variables.

Velocity calculation: Velocity calculation example diagram.

Data collection: Data sources and measurement process diagram.

Data analysis: Flow diagram of the data analysis process undergone and the methods used.

Subject selection: Workflow of the data filtering and subject selection procedure.

Exploratory Boxplot analysis: Exploratory visualization of the modelled confounder variables.

Velocity Normality in studied groups: Probability density function plot and QQ plot of the dependent variable for each studied group.

VIF results: Barplot identifying confounding variables that might be causing multicollinearity in our model.

SMD results (Loveplot): Loveplot of the Standardized Mean Differences of the confounding variables.

Velocity GLM residuls: Residuals plot (left) and Q-Q Residuals plot (right) between standardized model residuals (y-axis) and fitted values (left x-axis), and theoretical quantiles (right x-axis).

Codes:

Jupyter notebooks

apacheGithub.ipynb: Downloads project metadata from the selected Apache projects.

collectionMerge.ipynb: Merges and cleans attributes from the collected data into the final shape of the dataset to be used in the data analysis.

commitCrawler.ipynb: GitHub API crawler to fetch project data from GitHub repositories.

issueCrawlerGithub.ipynb: GitHub API crawler to fetch project issue data from GitHub repositories.

jiraCrawler.ipynb: Jira API crawler to fetch project issue data from Jira repositories.

sonarQubeCrawler.ipynb: SonarQube API crawler to fetch project data from SonarQube repositories.

Python codes:

aggregateExp.py: Aggregates the overall experience at a project level from all the activity before and during follow-up.

calculate_velocity.py: Script to calculate the development velocity for projects.

clone_projects.py: Performs the repostory clonning.

commons.py: Stores the global paths and variables.

contributors_api.py: Script to extract information from the collaborators involved in the software projects.

format_data_for_analysis.py: Reformats the time data format to adjust for the data analysis structure required by R functions.

get_commits.py: Auxiliary script to fetch and store project commits at a repository level (Some repositories differed with the data fetched from GitHub)

get_confounders_from_repos.py: Script to fetch identified confounders (e.g. Size...) from the cloned repositories.

get_developer_experience.py: Script to quantify the developer experience.

mergeAttributes.py: Script to merge attributes into single data file (some scripts did simply collect the data but this was not added into the same single file with the rest of attributes)

repo_experience.py: Aggregates the developer experience into a single metric per project.

R codes:

confounder-matching_EMSE.R: Performs the Matching stage to reach variable balance and optimal overlap.

crudeanalysis_EMSE.R: Performs the crude (unadjusted) analysis of the study.

data-transformation_EMSE.R: Performs the multicollinearity check of the study.

descanalysis_EMSE.R: Performs the descriptive analysis of the study.

regressionanalysis_EMSE.R: Performs the statistical adjustment of the study.

Datasets: Contains all the required data to start, follow and finish the analysis of this study.

Getting Started

These instructions will get you a copy of the project up and running on your local machine. Beforehand, please follow the installation
instructions in the INSTALL documentation.

Prerequisites

Running the code requires Python3.9. See installation instructions here.

The dependencies needed to run the code are all listed in the file requirements.txt. They can be installed using pip:

pip install -r requirements.txt

You might also want to consider using virtual env.

Running the R code requires installing RStudio. Installation instructions can be found from the official webpage of the CRAN project.

For installing the necessary libraries. A two-step process is needed to run in any of the used R scripts.

For installing the packages: install.packages("package")
For importing the package: library(package)

List of required packages:
effsize
dplyr
psych
corrplot
AICcmodavg
xtable

Running the code

NOTE: Remember featuring the project folders as in the code. Change the name of the path names in each of the Python files.

DATA-MINING PROCEDURES (All the content is described in the code)

NOTE: Make the structure of the folders in the same way displayed in figshare so that the code works, or else manage on your own the locations through the code.
Subsequent csv files made out from the crawlers will be stored in the mentioned folders until the merge stage.

1.1. Mining projects files with initial confounders from GitHub API

- Use notebook apacheGitHub.ipynb - Remember to get create a token in (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token).

1.2. Mining registered ASF projects in SonarCloud from its API

- Execute sonarQubeCrawler.ipynb - Remember to use create a token in (https://docs.sonarcloud.io/advanced-setup/web-api/).

1.3. Mining issues from ASF repositories in Jira and GitHub

- Use notebook issueCrawlerGithub.ipynb for issues tracked in GitHub and jiraCrawler.ipynb for issues tracked in Jira. (No need to use token with Atlassian for Jira issues)

1.4. Mining commits from ASF projects in GitHub.

- Use commitCrawler.ipynb to crawl over the considered repositories and mine their commits. In addition it will handle the name difference for projects using SQ since their names in
BlocPower - Summarize, plot and validate
redivis.com
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kumar H (2023). BlocPower - Summarize, plot and validate [Dataset]. https://redivis.com/workflows/tajy-74j9c5jyx
Explore at:
Dataset updated
Oct 22, 2023
Dataset provided by
Redivis Inc.
Authors
Kumar H
Description
Abstract

This project uses Python to load BlocPower's data for 121 million buildings in the US, summarize it to the spatial unit of interest (state, county or zipcode) and plot key statistics. It also compares and validates the zipcode-level statistics with other independent data sources - Microsoft (for building counts) and Goldstein et al (2022) for energy use.
Wuzzuf Data Analyst jobs
kaggle.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ahmed abbas (2023). Wuzzuf Data Analyst jobs [Dataset]. https://www.kaggle.com/datasets/ahmedabbas757/wuzzuf-data-analyst-jobs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kaggle
Authors
ahmed abbas
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Content

This is a Web scraping project, to extract jobs from wuzzuf website using python language to analyze this data and create a dashboard that shows the distribution of jobs by location, industry, and experience level.

Dataset Glossary (Column-Wise)

Job title : Job name.

Company name : Name of the company that owns the advertisement.

Location : Company location.

Job type : Type of job: is it full time or part time.

Exp level : Required level is whether senior or manger or ect.

Exp years : Number of years required to obtain job.

Skills : Skills required to obtain job.

Facebook

Twitter

Click to copy link

Link copied

Cite

Saubhagya Mishra (2025). Exploratory Data Analysis [Dataset]. https://www.kaggle.com/datasets/saubhagyamishra1992/exploratory-data-analysis

Exploratory Data Analysis

Exploratory Data Analysis using Python

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 26, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Saubhagya Mishra

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by Saubhagya Mishra

Released under MIT

Clear search

Close search

Google apps

Main menu

Exploratory Data Analysis

Dataset

Contents

Cyclistic Bike - Data Analysis (Python)

Dataset for "Machine learning predictions on an extensive geotechnical...

Keith Galli's Sales Analysis Exercise

Context

Content

Acknowledgements

Insurance(HealthCare)

Dataset

Contents

House Prices

Context

Content

Sales Data (Project1 IIITD)

Dataset

Contents

Reddit r/AskScience Flair Dataset

Google Play Store_Cleaned

Representations of Sound and Music in the Middle Ages: Analysis and...

RICardo dataset 2017.12

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

Digital_Payments_2025_Dataset

Parkison Diseases EEG Dataset

Data from: Exploratory temporal ICA based analysis in task and resting-state...

Data and Code for the paper "GUI Testing of Android Applications:...

Stack Overflow tags

Context

Content

Acknowledgements

Replication package for EMSE article: "Does Microservice Adoption Impact the...

Contents

Getting Started

Prerequisites

Running the code

BlocPower - Summarize, plot and validate

Abstract

Wuzzuf Data Analyst jobs

Content

Dataset Glossary (Column-Wise)

Exploratory Data Analysis

Exploratory Data Analysis using Python

Dataset

Contents