36 datasets found

Cyclistic Bike - Data Analysis (Python)
kaggle.com
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Amirthavarshini
Description
Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.
o
IPhone Customer Survey | NLP
opendatabay.com
.undefined
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). IPhone Customer Survey | NLP [Dataset]. https://www.opendatabay.com/data/ai-ml/8496ac33-2bc1-4401-868d-3cc6c5369f16
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Context This dataset offers a treasure trove for conducting sentiment analysis, feature analysis, and topic modeling on customer reviews. It includes vital information like product ASIN, country, and date, which help gauge customer trust and engagement. Each review features a rating score, along with a compelling review title and detailed description, providing a window into customer emotions and preferences. Additionally, the review URL, reviewed language/region, and variant ASIN enrich the analysis, allowing for a deeper understanding of how different product versions resonate with consumers in various markets. This comprehensive approach not only highlights customer sentiments but also reveals key insights that can drive product development and marketing strategies.

Dataset Glossary (Column-wise) productAsin: Unique identifier for the product. country: Location where the review was submitted. date: Date of the review. isVerified: Indicates if the reviewer is a verified purchaser. ratingScore: Numerical score given by the reviewer (typically 1-5). reviewTitle: Brief summary of the review. reviewDescription: Detailed feedback from the reviewer. reviewUrl: Link to the full review online. reviewedIn:Language or region in which the review was written. variant: Specific version of the product reviewed. variantAsin: Unique identifier for the product variant.

License

CC0

Original Data Source:IPhone Customer Survey | NLP
f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Amazon Sales Data
kaggle.com
Updated Jun 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mithilesh Kale (2024). Amazon Sales Data [Dataset]. https://www.kaggle.com/datasets/mithilesh9/amazon-sales-data-analysis/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mithilesh Kale
Description
https://www.kaggle.com/code/mithilesh9/amazon-sales-data-analysis-using-python

Dataset Description This dataset contains a 100 rows of sales data for Amazon, including the region, country, item type, sales channel, order priority, order date, order ID, ship date, units sold, unit price, unit cost, total revenue, total cost, and total profit.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19501062%2F5d10a624d07eefb2240c474ca00114b6%2FScreenshot%202024-06-25%20135139.png?generation=1719303822906805&alt=media" alt="">
o
Apple IPhone Customer Reviews
opendatabay.com
.undefined
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Apple IPhone Customer Reviews [Dataset]. https://www.opendatabay.com/data/consumer/42533232-0299-4752-8408-4579f2251a34
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 10, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Reviews & Ratings
Description
Based on the dataset of iPhone reviews from Amazon, here are some project areas we can do:

-> Sentiment analysis: Determine overall sentiment and identify trends.

-> Feature analysis: Analyze user satisfaction with specific features.

-> Topic modeling: Discover underlying themes and discussion points.

Original Data Source: Apple IPhone Customer Reviews
World's Air Quality and Water Pollution Dataset
kaggle.com
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VICTOR AHAJI (2023). World's Air Quality and Water Pollution Dataset [Dataset]. https://www.kaggle.com/datasets/victorahaji/worlds-air-quality-and-water-pollution-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
VICTOR AHAJI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
World
Description
The Dataset "World's Air Quality and Water Pollution" was obtained from Jack Jae Hwan Kim Kaggle page. This Dataset is comprized of 5 columns; "City", "Region", "Country", "Air Quality" and "Water Pollution". The last two columns consist of values varying from 0 to 100; Air Quality Column: Air quality varies from 0 (bad quality) to 100 (top good quality) Water Pollution Column: Water pollution varies from 0 (no pollution) to 100 (extreme pollution).
o
YouTube Trending Videos of the Day
opendatabay.com
.undefined
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). YouTube Trending Videos of the Day [Dataset]. https://www.opendatabay.com/data/ai-ml/34cfa60b-afac-4753-9409-bc00f9e8fbec
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
YouTube, Data Science and Analytics
Description
The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:

IT - Italy ES - Spain GR - Greece HR - Croatia TR - Turkey AL - Albania DZ - Algeria EG - Egypt LY - Lybia TN - Tunisia MA - Morocco IL - Israel ME - Montenegro LB - Lebanon FR - France BA - Bosnia and Herzegovina MT - Malta SI - Slovenia CY - Cyprus

SY - Syria

The columns are, instead, the following:

country: where is the country in which the video was published. video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'. title: title of the video. publishedAt: publication date of the video. channelId: identification number of the channel who published the video. channelTitle: name of the channel who published the video. categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list. trending_date: trending date of the video. tags: tags present in the video. view_count: view count of the video. comment_count: number of comments in the video. thumbnail_link: the link of the image that appears before clicking the video. -comments_disabled: tells if the comments are disabled or not for a certain video. -ratings_disabled: tells if the rating is disabled or not for that video. -description: description below the video. Inspiration You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions. Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information. Remember to upvote if you found the dataset useful :).

License

CC0

Original Data Source: YouTube Trending Videos of the Day
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
Preventive Maintenance for Marine Engines
kaggle.com
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2025
Dataset provided by
Kaggle
Authors
Fijabi J. Adekunle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Preventive Maintenance for Marine Engines: Data-Driven Insights

Introduction:

Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
o
Machine Learning Foundations Course
explore.openaire.eu
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
, SumuduTennakoon (2022). Machine Learning Foundations Course [Dataset]. http://doi.org/10.5281/zenodo.7329327
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7329327
Dataset updated
Nov 17, 2022
Authors
, SumuduTennakoon
Description
Course Description Machine-Learning enables us to uncover trends and patterns hidden in data and make predictions based on historical observations. Machine-Learning is crucial in implementing Artificial Intelligence (AI) systems and helps industry and academia in complex problem-solving, predictive analytics, automation, etc. Therefore, Machine-Learning is an essential skill a Data Science and related technical professionals should carry in their toolboxes. This course aims to provide a fundamental understanding of the core principles of Machine Learning (ML) with hands-on training on applying machine learning to solve real-world problems. A learner who completes this course should be able to define a machine learning problem, understand the solution path, and display the ability to carry out the end-to-end process of building a machine learning application. Topics Covered Introduction to Machine Learning (ML), History, and Applications Setting up a Computing Environment, Python and Required Libraries. Knowledge Foundations for ML (Computing, Statistics, and Mathematics) Exploratory Data Analysis (EDA) and Feature Engineering Supervised Machine Learning Unsupervised Machine Learning Explaining ML Models and Predictions Introduction to Deep Learning and Neural Networks Design, Develop and Deploy ML Solutions Capstone Project Prerequisites: Basics of computer programming, mathematics, and statistics. Basic knowledge in computer applications: spreadsheet, word processor and presentation authoring. This is the initial release of the Machine Learning Foundations Course Repository by Sumudu Tennakoon Full Changelog: https://github.com/SumuduTennakoon/MachineLearningFoundations/commits/v1.0.0
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
bookstore dataset
kaggle.com
Updated Aug 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sbonelo Ndhlazi (2022). bookstore dataset [Dataset]. https://www.kaggle.com/datasets/sbonelondhlazi/bookstore-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sbonelo Ndhlazi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set was scrapped using python from http://books.toscrape.com/ which is a fictional book store. It contains 1000 books, with different categories, star ratings and prices. This data set can be used by anyone who wants to practice data cleaning and simple data manipulations.

The code I used to scrap this data can be found on my github: https://github.com/Sbonelondhlazi/dummybooks
Noon & Amazon
kaggle.com
Updated Apr 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Elghannam (2025). Noon & Amazon [Dataset]. https://www.kaggle.com/datasets/mohamedelghannam15/noon-and-amazon
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohammed Elghannam
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🛍️ Amazon vs Noon: Electronics Price & Discount Comparison This dataset contains scraped product information from two major e-commerce platforms: Amazon and Noon, focusing on electronics. The goal is to compare pricing strategies and discounts offered by each platform.

📌 Dataset Summary Sources: Amazon & Noon (scraped using custom Python scripts) Categories: Electronics (Laptops, Accessories, etc.) Data Fields: Product Title, Brand, Price, Original Price, Discount, Rating, and more Processing: The data needs to be cleaned.
f
Digital_Payments_2025_Dataset
figshare.com
csv
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shreyash tiwari (2025). Digital_Payments_2025_Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28873229.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28873229.v1
Dataset updated
Apr 25, 2025
Dataset provided by
figshare
Authors
shreyash tiwari
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The "Digital Payments 2025 Dataset" is a synthetic dataset representing digital payment transactions across various payment applications in India for the year 2025. It captures monthly transaction data for multiple payment apps, including banks, UPI platforms, and mobile payment services, reflecting the growing adoption of digital payments in India. The dataset was created as part of a college project to simulate realistic transaction patterns for research, education, and analysis in data science, economics, and fintech studies. It includes metrics such as customer transaction counts and values, total transaction counts and values, and temporal data (month and year). The data is synthetic, generated using Python libraries to mimic real-world digital payment trends, and is suitable for academic research, teaching, and exploratory data analysis.
Diwali_Sales_Dataset
kaggle.com
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BharathiD8 (2024). Diwali_Sales_Dataset [Dataset]. https://www.kaggle.com/datasets/bharathid8/diwali-sales-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BharathiD8
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Project Overview

Objective: Analyze Diwali sales data to uncover trends, customer behavior, and sales performance during the festive season. - Tools Used: Python, Pandas, NumPy, Matplotlib, Seaborn

Data Collection and Preparation

Dataset: A dataset containing sales data for Diwali, including details like product categories, customer demographics, sales amounts, discounts, etc.

**Data Cleaning: **Handle missing values, remove duplicates, and correct any inconsistencies in the data.

- Feature Engineering: Create new features if necessary, such as total sales per customer, average discount per sale, etc.

Exploratory Data Analysis (EDA)

Descriptive Statistics: Calculate basic statistics (mean, median, mode) to get a sense of the data distribution. Visualizations: Sales Trends: Plot sales over time to see how they varied during the Diwali season. Top-Selling Products: Identify the products or categories with the highest sales. Customer Demographics: Analyze sales by age, gender, and location to understand customer behavior. Discount Impact: Evaluate how different discount levels affected sales volume.

Key Findings

Customer Behavior: Insights on which customer segments contributed the most to sales. Sales Performance: Which products or categories had the highest sales, and during which days of Diwali sales peaked. Discount Effectiveness: The impact of discounts on sales and whether higher discounts led to significantly higher sales or not.

Conclusion

Summarize the key insights derived from the EDA. Discuss any patterns or trends that were unexpected or particularly interesting. Provide recommendations for future sales strategies based on the findings. .
Singapore Street Co-ordinates
kaggle.com
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nithesh Karthik (2025). Singapore Street Co-ordinates [Dataset]. https://www.kaggle.com/datasets/nitheshkarthik/singapore-street-co-ordinates/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nithesh Karthik
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Singapore
Description
Context This dataset was created as part of the effort for a project in my coursework called Python for Business Analytics (DAO2702). We are performing data analysis on historical resale data of HDBs in Singapore and as a part of this analysis, I had to create this dataset containing the coordinates of the streets that were listed in the resale data.

Content Most of the streets in the list were geo-coded using Python packages and some of them were manually collected by searching for the streets on Google Maps and copying the latitude and longitude. Currently, the total number of streets geo-coded is 589. The list of street names might increase in the future, as new streets can be formed. The name of the streets might be changed from the past too. Would not guarantee 100% accuracy as there might be slight errors. Do use the dataset after such considering these aspects.
Z
Data and Code for the paper "GUI Testing of Android Applications:...
data.niaid.nih.gov
zenodo.org
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luigi Libero Lucio Starace (2023). Data and Code for the paper "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7260111
Explore at:
Dataset updated
Sep 25, 2023
Dataset provided by
Anna Rita Fasolino
Porfirio Tramontana
Luigi Libero Lucio Starace
Sergio Di Martino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".

Abstract

Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.

Contents and Instructions

This package contains:

apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.

students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.

branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.

data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.
Representations of Sound and Music in the Middle Ages: Analysis and...
zenodo.org
json
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho (2025). Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database (Records and Performances) [Dataset]. http://doi.org/10.5281/zenodo.15037823
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15037823
Dataset updated
Mar 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the study “Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database”, authored by Edmundo Camacho, Xavier Fresquet, and Frédéric Billiet.

It contains structured descriptions of musical performances, performers, and instruments extracted from the Musiconis database (December 2024 version). This dataset does not include organological descriptions, which are available in a separate dataset.

The Musiconis database provides a structured and interoperable framework for studying medieval music iconography. It enables investigations into:

• The evolution and spread of musical instruments across Europe and the Mediterranean.

• Performer typologies and their representation in medieval art.

• The relationships between musical practices and social or religious contexts.

Contents:

• Musiconis Dataset (JSON format, December 2024 version):

• Musical scenes and their descriptions

• Performer metadata (roles, social status, gender, interactions)

• Instrument classifications (without detailed organological descriptions)

• Colab Notebook (Python):

• Data processing and structuring

• Visualization of performer distributions and instrument usage

• Exploratory statistics and mapping

Tools Used:

• Python (Pandas, Seaborn, Matplotlib, Plotly)

• Statistical and exploratory data analysis

• Visualization of instrument distributions, performer interactions, and musical context
Train file sizes Google Identify Contrails
kaggle.com
Updated May 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergey Saharovskiy (2023). Train file sizes Google Identify Contrails [Dataset]. https://www.kaggle.com/datasets/sergiosaharovskiy/train-file-sizes-google-identify-contrails/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sergey Saharovskiy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset comprises metadata for 225,819 train files Google Research - Identify Contrails to Reduce Global Warming challenge.

The code was obtained by using a simple bash script:

shopt -s globstar dotglob nullglob for pathname in train/**/*; do if [[ -f $pathname ]] && [[ ! -h $pathname ]]; then stat -c $'%s\t%n' "$pathname" fi done >train_file_sizes.csv

After the bash script, the file was preprocessed with the following python code:

train_sizes = pd.read_csv('data/train_file_sizes.csv', delim_whitespace=True, names=['file_size', 'file_path']) train_sizes['record_id'] = train_sizes.file_path.str.split('/', expand=True)[1].astype(int) train_sizes.to_csv('data/train_file_sizes.csv', index=False)
BlocPower - Summarize, plot and validate
redivis.com
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kumar H (2023). BlocPower - Summarize, plot and validate [Dataset]. https://redivis.com/workflows/tajy-74j9c5jyx
Explore at:
Dataset updated
Oct 22, 2023
Dataset provided by
Redivis Inc.
Authors
Kumar H
Description
This project uses Python to load BlocPower's data for 121 million buildings in the US, summarize it to the spatial unit of interest (state, county or zipcode) and plot key statistics. It also compares and validates the zipcode-level statistics with other independent data sources - Microsoft (for building counts) and Goldstein et al (2022) for energy use.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions

Cyclistic Bike - Data Analysis (Python)

Exploratory Data Analysis of Cyclistic Bike-Share Usage Patterns

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 25, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Amirthavarshini

Description

Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

Clear search

Close search

Google apps

Main menu

Cyclistic Bike - Data Analysis (Python)

IPhone Customer Survey | NLP

License

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

Amazon Sales Data

Apple IPhone Customer Reviews

World's Air Quality and Water Pollution Dataset

YouTube Trending Videos of the Day

SY - Syria

License

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

Preventive Maintenance for Marine Engines

Machine Learning Foundations Course

Reddit r/AskScience Flair Dataset

bookstore dataset

Noon & Amazon

Digital_Payments_2025_Dataset

Diwali_Sales_Dataset

Project Overview

Data Collection and Preparation

Exploratory Data Analysis (EDA)

Key Findings

Conclusion

Singapore Street Co-ordinates

Data and Code for the paper "GUI Testing of Android Applications:...

Representations of Sound and Music in the Middle Ages: Analysis and...

Train file sizes Google Identify Contrails

BlocPower - Summarize, plot and validate

Cyclistic Bike - Data Analysis (Python)

Exploratory Data Analysis of Cyclistic Bike-Share Usage Patterns