31 datasets found
  1. Exploratory Data Analysis

    • kaggle.com
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saubhagya Mishra (2025). Exploratory Data Analysis [Dataset]. https://www.kaggle.com/datasets/saubhagyamishra1992/exploratory-data-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saubhagya Mishra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Saubhagya Mishra

    Released under MIT

    Contents

  2. Cyclistic Bike - Data Analysis (Python)

    • kaggle.com
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Amirthavarshini
    Description

    Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

  3. Z

    Dataset for "Machine learning predictions on an extensive geotechnical...

    • data.niaid.nih.gov
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soranzo, Enrico (2024). Dataset for "Machine learning predictions on an extensive geotechnical dataset of laboratory tests in Austria" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14251190
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset authored and provided by
    Soranzo, Enrico
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Austria
    Description

    This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.

    The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.

    Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.

    This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.

    Key Features:

    Temporal Coverage: Over 20 years of data.

    Geographical Coverage: Vienna, Lower Austria, and Burgenland.

    Tests Included:

    Particle Size Distribution

    Atterberg Limits

    Proctor Tests

    Permeability Tests

    Direct Shear Tests

    Number of Variables: 24

    Potential Applications: Correlation analysis, predictive modeling, and geotechnical design.

    Technical Details:

    Missing values have been addressed using K-Nearest Neighbors (KNN) imputation, and anomalies identified using Local Outlier Factor (LOF) methods in previous studies.

    Data normalization and standardization steps are recommended for specific analyses.

    Acknowledgments:The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).

  4. Keith Galli's Sales Analysis Exercise

    • kaggle.com
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zulkhairee Sulaiman (2022). Keith Galli's Sales Analysis Exercise [Dataset]. https://www.kaggle.com/datasets/zulkhaireesulaiman/sales-analysis-2019-excercise/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zulkhairee Sulaiman
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This is the dataset required for Keith Galli's 'Solving real world data science tasks with Python Pandas!' video. Where he analyzes and answers business questions for 12 months worth of business data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.

    I decided to upload the data here so that I can carry out the exercise straight on Kaggle Notebooks. Making it ready for viewing as a portfolio project.

    Content

    12 .csv files containing sales data for each month of 2019.

    Acknowledgements

    Of course, all thanks goes to Keith Galli and the great work he does with his tutorials. He has several other amazing tutorials that you can follow and subscribe at his channel.

  5. Insurance(HealthCare)

    • kaggle.com
    Updated Jul 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damini Tiwari (2020). Insurance(HealthCare) [Dataset]. https://www.kaggle.com/daminitiwari/insurance/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Damini Tiwari
    Description

    Dataset

    This dataset was created by Damini Tiwari

    Contents

  6. House Prices

    • kaggle.com
    Updated May 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanya Chawla (2021). House Prices [Dataset]. https://www.kaggle.com/tanyachawla412/house-prices/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2021
    Dataset provided by
    Kaggle
    Authors
    Tanya Chawla
    Description

    Context

    To explore and learn more on Multiple Linear Regression.

    Content

    The dataset consists of house prices across the USA. It has the following columns: - Avg. Area Income: Numerical data about the average area of the income where the house is located. - House Age: Age of the house in years. - Number of Rooms - Number of Bedrooms - Area Population: Population of the area where the house is located. - Price - Address: The only textual data in the dataset consisting of the address of the house.

  7. Sales Data (Project1 IIITD)

    • kaggle.com
    Updated Jan 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Sharma (2022). Sales Data (Project1 IIITD) [Dataset]. https://www.kaggle.com/rahultheogre/iiitd-project1/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rahul Sharma
    Description

    Dataset

    This dataset was created by Rahul Sharma

    Contents

  8. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  9. Google Play Store_Cleaned

    • kaggle.com
    Updated Mar 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yash (2023). Google Play Store_Cleaned [Dataset]. https://www.kaggle.com/datasets/yash16jr/google-play-store-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Dataset is the cleaned up version of the Google Play Store Data dataset , available on Kaggle. The EDA and data cleaning was performed using Python .

  10. Representations of Sound and Music in the Middle Ages: Analysis and...

    • zenodo.org
    json
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho (2025). Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database (Records and Performances) [Dataset]. http://doi.org/10.5281/zenodo.15037823
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the study “Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database”, authored by Edmundo Camacho, Xavier Fresquet, and Frédéric Billiet.

    It contains structured descriptions of musical performances, performers, and instruments extracted from the Musiconis database (December 2024 version). This dataset does not include organological descriptions, which are available in a separate dataset.

    The Musiconis database provides a structured and interoperable framework for studying medieval music iconography. It enables investigations into:

    • The evolution and spread of musical instruments across Europe and the Mediterranean.

    • Performer typologies and their representation in medieval art.

    • The relationships between musical practices and social or religious contexts.

    Contents:

    Musiconis Dataset (JSON format, December 2024 version):

    • Musical scenes and their descriptions

    • Performer metadata (roles, social status, gender, interactions)

    • Instrument classifications (without detailed organological descriptions)

    Colab Notebook (Python):

    • Data processing and structuring

    • Visualization of performer distributions and instrument usage

    • Exploratory statistics and mapping

    Tools Used:

    • Python (Pandas, Seaborn, Matplotlib, Plotly)

    • Statistical and exploratory data analysis

    • Visualization of instrument distributions, performer interactions, and musical context

  11. RICardo dataset 2017.12

    • zenodo.org
    zip
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Béatrice Dedinger; Paul Girard; Paul Girard; Béatrice Dedinger (2020). RICardo dataset 2017.12 [Dataset]. http://doi.org/10.5281/zenodo.1119592
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Béatrice Dedinger; Paul Girard; Paul Girard; Béatrice Dedinger
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This is the first public release of the RICardo dataset under the licence odbl v1.0. This dataset is precisely described un der the data package format.

    This release includes 368,871 bilateral or total trade flows from 1787 to 1938 for 373 reporting entities. It also contains python scripts used to compile and filter the flows to fuel our exploratory data analysis online tool.

  12. m

    ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

    • data.mendeley.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
    Explore at:
    Dataset updated
    Aug 15, 2025
    Authors
    Christopher Lynch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

    • Tagged datasets (.csv): human-tagged gold labels for evaluation
    • Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative
      • Suitable for inference, semi-automatic labeling, or transfer learning
    • Python and R code for preprocessing, model training, evaluation, and visualization
    • Configuration files and environment specifications to enable end-to-end reproducibility

    The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

    Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

    Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

    File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

    Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

    Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

    Funding Note * Funding sources provided time in support of human taggers annotating the data sets.

  13. Digital_Payments_2025_Dataset

    • figshare.com
    csv
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shreyash tiwari (2025). Digital_Payments_2025_Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28873229.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    figshare
    Authors
    shreyash tiwari
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The "Digital Payments 2025 Dataset" is a synthetic dataset representing digital payment transactions across various payment applications in India for the year 2025. It captures monthly transaction data for multiple payment apps, including banks, UPI platforms, and mobile payment services, reflecting the growing adoption of digital payments in India. The dataset was created as part of a college project to simulate realistic transaction patterns for research, education, and analysis in data science, economics, and fintech studies. It includes metrics such as customer transaction counts and values, total transaction counts and values, and temporal data (month and year). The data is synthetic, generated using Python libraries to mimic real-world digital payment trends, and is suitable for academic research, teaching, and exploratory data analysis.

  14. Parkison Diseases EEG Dataset

    • kaggle.com
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WARNER (2024). Parkison Diseases EEG Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8600168
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    WARNER
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recordings from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to the "status" column which is set to 0 for healthy and 1 for PD.

    Attribute Information: Matrix column entries (attributes): name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude NHR, HNR - Two measures of the ratio of noise to tonal components in the voice status - The health status of the subject (one) - Parkinson's, (zero) - healthy RPDE, D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

  15. D

    Data from: Exploratory temporal ICA based analysis in task and resting-state...

    • data.ru.nl
    • narcis.nl
    07_720_v1
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel E. P. Gomez; Alberto Llera; José Marques; Christian Beckmann; David Norris (2022). Exploratory temporal ICA based analysis in task and resting-state fMRI [Dataset]. http://doi.org/10.34973/g044-ka42
    Explore at:
    07_720_v1(12350478903 bytes)Available download formats
    Dataset updated
    Dec 6, 2022
    Dataset provided by
    Radboud University
    Authors
    Daniel E. P. Gomez; Alberto Llera; José Marques; Christian Beckmann; David Norris
    Description

    Temporally independent functional modes (TFMs) are functional brain networks identified based on their temporal independence. The rationale behind identifying TFMs is that different functional networks may share a common anatomical infrastructure yet display distinct temporal dynamics. Extracting TFMs usually require a larger number of samples than acquired in standard fMRI experiments, and thus have therefore previously only been performed at the group level. Here, using an ultra-fast fMRI sequence, MESH-EPI, with a volume repetition time of 158 ​ms, we conducted an exploratory study with n ​= ​6 subjects and computed TFMs at the single subject level on both task and resting-state datasets. We identified 6 common temporal modes of activity in our participants, including a temporal default mode showing patterns of anti-correlation between the default mode and the task-positive networks, a lateralised motor mode and a visual mode integrating the visual cortex and the visual streams. In alignment with other findings reported recently, we also showed that independent time-series are largely free from confound contamination. In particular for ultra-fast fMRI, TFMs can separate the cardiac signal from other fluctuations. Using a non-linear dimensionality reduction technique, UMAP, we obtained preliminary evidence that combinations of spatial networks as described by the TFM model are highly individual. Our results show that it is feasible to measure reproducible TFMs at the single-subject level, opening new possibilities for investigating functional networks and their integration. Finally, we provide a python toolbox for generating TFMs and comment on possible applications of the technique and avenues for further investigation.

  16. Z

    Data and Code for the paper "GUI Testing of Android Applications:...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Di Martino (2023). Data and Code for the paper "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7260111
    Explore at:
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Anna Rita Fasolino
    Luigi Libero Lucio Starace
    Porfirio Tramontana
    Sergio Di Martino
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".

    Abstract

    Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.

    Contents and Instructions

    This package contains:

    apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

    apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.

    students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

    compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.

    branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.

    data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.

  17. Stack Overflow tags

    • kaggle.com
    Updated Jan 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Awan (2021). Stack Overflow tags [Dataset]. https://www.kaggle.com/kingabzpro/stack-overflow-tags/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abid Ali Awan
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?

    One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.

    Content

    Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.

    We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.

    Acknowledgements

    DataCamp

  18. Replication package for EMSE article: "Does Microservice Adoption Impact the...

    • zenodo.org
    zip
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikel Robredo; Mikel Robredo; Nyyti Saarimäki; Nyyti Saarimäki; Agbonvihele Gregrey Oko-oboh; Agbonvihele Gregrey Oko-oboh; Davide Taibi; Davide Taibi; Valentina Lenarduzzi; Valentina Lenarduzzi (2025). Replication package for EMSE article: "Does Microservice Adoption Impact the Velocity? A Cohort Study" [Dataset]. http://doi.org/10.5281/zenodo.16407138
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mikel Robredo; Mikel Robredo; Nyyti Saarimäki; Nyyti Saarimäki; Agbonvihele Gregrey Oko-oboh; Agbonvihele Gregrey Oko-oboh; Davide Taibi; Davide Taibi; Valentina Lenarduzzi; Valentina Lenarduzzi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains all the Python and R source code to conduct the data collection, preprocessing and analysis of this study.

    Contents

    This repository contains the following

    • INSTALL: Detailed installation instructions for each of the used tools as well as the required Python dependencies.

    • Figures: Figures added in the PDF version of the manuscript. The analysis and scripts generate further figures that support the results of the study.

    • Codes:

    • Datasets: Contains all the required data to start, follow and finish the analysis of this study.

    Getting Started

    These instructions will get you a copy of the project up and running on your local machine. Beforehand, please follow the installation
    instructions in the INSTALL documentation.

    Prerequisites

    Running the code requires Python3.9. See installation instructions here.

    The dependencies needed to run the code are all listed in the file requirements.txt. They can be installed using pip:

    pip install -r requirements.txt

    You might also want to consider using virtual env.

    Running the R code requires installing RStudio. Installation instructions can be found from the official webpage of the CRAN project.

    For installing the necessary libraries. A two-step process is needed to run in any of the used R scripts.

    For installing the packages: install.packages("package")
    For importing the package: library(package)

    List of required packages:
    effsize
    dplyr
    psych
    corrplot
    AICcmodavg
    xtable

    Running the code

    NOTE: Remember featuring the project folders as in the code. Change the name of the path names in each of the Python files.

    1. DATA-MINING PROCEDURES (All the content is described in the code)

      • NOTE: Make the structure of the folders in the same way displayed in figshare so that the code works, or else manage on your own the locations through the code.
        Subsequent csv files made out from the crawlers will be stored in the mentioned folders until the merge stage.

      1.1. Mining projects files with initial confounders from GitHub API

       - Use notebook apacheGitHub.ipynb
       - Remember to get create a token in (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token).
      

      1.2. Mining registered ASF projects in SonarCloud from its API

         - Execute sonarQubeCrawler.ipynb 
         - Remember to use create a token in (https://docs.sonarcloud.io/advanced-setup/web-api/).
      

      1.3. Mining issues from ASF repositories in Jira and GitHub

       - Use notebook issueCrawlerGithub.ipynb for issues tracked in GitHub and jiraCrawler.ipynb for issues tracked in Jira. 
        (No need to use token with Atlassian for Jira issues)
      

      1.4. Mining commits from ASF projects in GitHub.

       - Use commitCrawler.ipynb to crawl over the considered repositories and mine their commits. In addition it will handle the name difference for projects using SQ since their names in

  19. BlocPower - Summarize, plot and validate

    • redivis.com
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar H (2023). BlocPower - Summarize, plot and validate [Dataset]. https://redivis.com/workflows/tajy-74j9c5jyx
    Explore at:
    Dataset updated
    Oct 22, 2023
    Dataset provided by
    Redivis Inc.
    Authors
    Kumar H
    Description

    Abstract

    This project uses Python to load BlocPower's data for 121 million buildings in the US, summarize it to the spatial unit of interest (state, county or zipcode) and plot key statistics. It also compares and validates the zipcode-level statistics with other independent data sources - Microsoft (for building counts) and Goldstein et al (2022) for energy use.

  20. Wuzzuf Data Analyst jobs

    • kaggle.com
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ahmed abbas (2023). Wuzzuf Data Analyst jobs [Dataset]. https://www.kaggle.com/datasets/ahmedabbas757/wuzzuf-data-analyst-jobs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Kaggle
    Authors
    ahmed abbas
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Content

    This is a Web scraping project, to extract jobs from wuzzuf website using python language to analyze this data and create a dashboard that shows the distribution of jobs by location, industry, and experience level.

    Dataset Glossary (Column-Wise)

    • Job title : Job name.
    • Company name : Name of the company that owns the advertisement.
    • Location : Company location.
    • Job type : Type of job: is it full time or part time.
    • Exp level : Required level is whether senior or manger or ect.
    • Exp years : Number of years required to obtain job.
    • Skills : Skills required to obtain job.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Saubhagya Mishra (2025). Exploratory Data Analysis [Dataset]. https://www.kaggle.com/datasets/saubhagyamishra1992/exploratory-data-analysis
Organization logo

Exploratory Data Analysis

Exploratory Data Analysis using Python

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saubhagya Mishra
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by Saubhagya Mishra

Released under MIT

Contents

Search
Clear search
Close search
Google apps
Main menu