36 datasets found
  1. Cyclistic Bike - Data Analysis (Python)

    • kaggle.com
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Amirthavarshini
    Description

    Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

  2. o

    IPhone Customer Survey | NLP

    • opendatabay.com
    .undefined
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). IPhone Customer Survey | NLP [Dataset]. https://www.opendatabay.com/data/ai-ml/8496ac33-2bc1-4401-868d-3cc6c5369f16
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    Context This dataset offers a treasure trove for conducting sentiment analysis, feature analysis, and topic modeling on customer reviews. It includes vital information like product ASIN, country, and date, which help gauge customer trust and engagement. Each review features a rating score, along with a compelling review title and detailed description, providing a window into customer emotions and preferences. Additionally, the review URL, reviewed language/region, and variant ASIN enrich the analysis, allowing for a deeper understanding of how different product versions resonate with consumers in various markets. This comprehensive approach not only highlights customer sentiments but also reveals key insights that can drive product development and marketing strategies.

    Dataset Glossary (Column-wise) productAsin: Unique identifier for the product. country: Location where the review was submitted. date: Date of the review. isVerified: Indicates if the reviewer is a verified purchaser. ratingScore: Numerical score given by the reviewer (typically 1-5). reviewTitle: Brief summary of the review. reviewDescription: Detailed feedback from the reviewer. reviewUrl: Link to the full review online. reviewedIn:Language or region in which the review was written. variant: Specific version of the product reviewed. variantAsin: Unique identifier for the product variant.

    License

    CC0

    Original Data Source:IPhone Customer Survey | NLP

  3. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  4. Amazon Sales Data

    • kaggle.com
    Updated Jun 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mithilesh Kale (2024). Amazon Sales Data [Dataset]. https://www.kaggle.com/datasets/mithilesh9/amazon-sales-data-analysis/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mithilesh Kale
    Description

    https://www.kaggle.com/code/mithilesh9/amazon-sales-data-analysis-using-python

    Dataset Description This dataset contains a 100 rows of sales data for Amazon, including the region, country, item type, sales channel, order priority, order date, order ID, ship date, units sold, unit price, unit cost, total revenue, total cost, and total profit.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19501062%2F5d10a624d07eefb2240c474ca00114b6%2FScreenshot%202024-06-25%20135139.png?generation=1719303822906805&alt=media" alt="">

  5. o

    Apple IPhone Customer Reviews

    • opendatabay.com
    .undefined
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Apple IPhone Customer Reviews [Dataset]. https://www.opendatabay.com/data/consumer/42533232-0299-4752-8408-4579f2251a34
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 10, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Reviews & Ratings
    Description

    Based on the dataset of iPhone reviews from Amazon, here are some project areas we can do:

    -> Sentiment analysis: Determine overall sentiment and identify trends.

    -> Feature analysis: Analyze user satisfaction with specific features.

    -> Topic modeling: Discover underlying themes and discussion points.

    Original Data Source: Apple IPhone Customer Reviews

  6. World's Air Quality and Water Pollution Dataset

    • kaggle.com
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VICTOR AHAJI (2023). World's Air Quality and Water Pollution Dataset [Dataset]. https://www.kaggle.com/datasets/victorahaji/worlds-air-quality-and-water-pollution-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    VICTOR AHAJI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    World
    Description

    The Dataset "World's Air Quality and Water Pollution" was obtained from Jack Jae Hwan Kim Kaggle page. This Dataset is comprized of 5 columns; "City", "Region", "Country", "Air Quality" and "Water Pollution". The last two columns consist of values varying from 0 to 100; Air Quality Column: Air quality varies from 0 (bad quality) to 100 (top good quality) Water Pollution Column: Water pollution varies from 0 (no pollution) to 100 (extreme pollution).

  7. o

    YouTube Trending Videos of the Day

    • opendatabay.com
    .undefined
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). YouTube Trending Videos of the Day [Dataset]. https://www.opendatabay.com/data/ai-ml/34cfa60b-afac-4753-9409-bc00f9e8fbec
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    YouTube, Data Science and Analytics
    Description

    The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:

    IT - Italy ES - Spain GR - Greece HR - Croatia TR - Turkey AL - Albania DZ - Algeria EG - Egypt LY - Lybia TN - Tunisia MA - Morocco IL - Israel ME - Montenegro LB - Lebanon FR - France BA - Bosnia and Herzegovina MT - Malta SI - Slovenia CY - Cyprus

    SY - Syria

    The columns are, instead, the following:

    country: where is the country in which the video was published. video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'. title: title of the video. publishedAt: publication date of the video. channelId: identification number of the channel who published the video. channelTitle: name of the channel who published the video. categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list. trending_date: trending date of the video. tags: tags present in the video. view_count: view count of the video. comment_count: number of comments in the video. thumbnail_link: the link of the image that appears before clicking the video. -comments_disabled: tells if the comments are disabled or not for a certain video. -ratings_disabled: tells if the rating is disabled or not for that video. -description: description below the video. Inspiration You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions. Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information. Remember to upvote if you found the dataset useful :).

    License

    CC0

    Original Data Source: YouTube Trending Videos of the Day

  8. Replication Package for 'Data-Driven Analysis and Optimization of Machine...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño; Joel Castaño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

    This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.
    The framework considers the trade-offs between three key objectives:
    1. Performance (maximizing throughput)
    2. Energy Efficiency (minimizing estimated energy per unit)
    3. Cost (minimizing estimated hardware cost)

    Repository Structure

    This repository is organized as follows:
    • Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.
    • Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.
    • Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.
    • Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.
    • Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.
    • eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.
    • requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.
    • eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.
    • optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.
    • pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.
    • shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

    Requirements and Installation

    To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.
    1. Clone the repository:
    bash
    git clone
    cd
    2. **Create and activate a virtual environment (optional but recommended):
    bash
    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    3. Install the required packages:
    All dependencies are listed in the `requirements.txt` file. Install them using pip:
    bash
    pip install -r requirements.txt

    Step-by-Step Reproduction Workflow

    The notebooks are designed to be run in a logical sequence.

    Step 1: Data Enrichment (Optional)

    The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

    Step 2: Exploratory Data Analysis (Optional)

    All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

    Step 3: Main Model Training, Validation, and Recommendation

    This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:
    1. It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.
    2. It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.
    3. It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.
    4. It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
  9. Preventive Maintenance for Marine Engines

    • kaggle.com
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Fijabi J. Adekunle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Preventive Maintenance for Marine Engines: Data-Driven Insights

    Introduction:

    Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

    Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

    Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

    Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

    Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

    Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

    Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

    Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.

  10. o

    Machine Learning Foundations Course

    • explore.openaire.eu
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    , SumuduTennakoon (2022). Machine Learning Foundations Course [Dataset]. http://doi.org/10.5281/zenodo.7329327
    Explore at:
    Dataset updated
    Nov 17, 2022
    Authors
    , SumuduTennakoon
    Description

    Course Description Machine-Learning enables us to uncover trends and patterns hidden in data and make predictions based on historical observations. Machine-Learning is crucial in implementing Artificial Intelligence (AI) systems and helps industry and academia in complex problem-solving, predictive analytics, automation, etc. Therefore, Machine-Learning is an essential skill a Data Science and related technical professionals should carry in their toolboxes. This course aims to provide a fundamental understanding of the core principles of Machine Learning (ML) with hands-on training on applying machine learning to solve real-world problems. A learner who completes this course should be able to define a machine learning problem, understand the solution path, and display the ability to carry out the end-to-end process of building a machine learning application. Topics Covered Introduction to Machine Learning (ML), History, and Applications Setting up a Computing Environment, Python and Required Libraries. Knowledge Foundations for ML (Computing, Statistics, and Mathematics) Exploratory Data Analysis (EDA) and Feature Engineering Supervised Machine Learning Unsupervised Machine Learning Explaining ML Models and Predictions Introduction to Deep Learning and Neural Networks Design, Develop and Deploy ML Solutions Capstone Project Prerequisites: Basics of computer programming, mathematics, and statistics. Basic knowledge in computer applications: spreadsheet, word processor and presentation authoring. This is the initial release of the Machine Learning Foundations Course Repository by Sumudu Tennakoon Full Changelog: https://github.com/SumuduTennakoon/MachineLearningFoundations/commits/v1.0.0

  11. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  12. bookstore dataset

    • kaggle.com
    Updated Aug 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sbonelo Ndhlazi (2022). bookstore dataset [Dataset]. https://www.kaggle.com/datasets/sbonelondhlazi/bookstore-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sbonelo Ndhlazi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data set was scrapped using python from http://books.toscrape.com/ which is a fictional book store. It contains 1000 books, with different categories, star ratings and prices. This data set can be used by anyone who wants to practice data cleaning and simple data manipulations.

    The code I used to scrap this data can be found on my github: https://github.com/Sbonelondhlazi/dummybooks

  13. Noon & Amazon

    • kaggle.com
    Updated Apr 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Elghannam (2025). Noon & Amazon [Dataset]. https://www.kaggle.com/datasets/mohamedelghannam15/noon-and-amazon
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohammed Elghannam
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🛍️ Amazon vs Noon: Electronics Price & Discount Comparison This dataset contains scraped product information from two major e-commerce platforms: Amazon and Noon, focusing on electronics. The goal is to compare pricing strategies and discounts offered by each platform.

    📌 Dataset Summary Sources: Amazon & Noon (scraped using custom Python scripts) Categories: Electronics (Laptops, Accessories, etc.) Data Fields: Product Title, Brand, Price, Original Price, Discount, Rating, and more Processing: The data needs to be cleaned.

  14. f

    Digital_Payments_2025_Dataset

    • figshare.com
    csv
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shreyash tiwari (2025). Digital_Payments_2025_Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28873229.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    figshare
    Authors
    shreyash tiwari
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The "Digital Payments 2025 Dataset" is a synthetic dataset representing digital payment transactions across various payment applications in India for the year 2025. It captures monthly transaction data for multiple payment apps, including banks, UPI platforms, and mobile payment services, reflecting the growing adoption of digital payments in India. The dataset was created as part of a college project to simulate realistic transaction patterns for research, education, and analysis in data science, economics, and fintech studies. It includes metrics such as customer transaction counts and values, total transaction counts and values, and temporal data (month and year). The data is synthetic, generated using Python libraries to mimic real-world digital payment trends, and is suitable for academic research, teaching, and exploratory data analysis.

  15. Diwali_Sales_Dataset

    • kaggle.com
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BharathiD8 (2024). Diwali_Sales_Dataset [Dataset]. https://www.kaggle.com/datasets/bharathid8/diwali-sales-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BharathiD8
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Project Overview

    Objective: Analyze Diwali sales data to uncover trends, customer behavior, and sales performance during the festive season. - Tools Used: Python, Pandas, NumPy, Matplotlib, Seaborn

    Data Collection and Preparation

    Dataset: A dataset containing sales data for Diwali, including details like product categories, customer demographics, sales amounts, discounts, etc.

    • **Data Cleaning: **Handle missing values, remove duplicates, and correct any inconsistencies in the data.

    - Feature Engineering: Create new features if necessary, such as total sales per customer, average discount per sale, etc.

    Exploratory Data Analysis (EDA)

    Descriptive Statistics: Calculate basic statistics (mean, median, mode) to get a sense of the data distribution. Visualizations: Sales Trends: Plot sales over time to see how they varied during the Diwali season. Top-Selling Products: Identify the products or categories with the highest sales. Customer Demographics: Analyze sales by age, gender, and location to understand customer behavior. Discount Impact: Evaluate how different discount levels affected sales volume.

    Key Findings

    Customer Behavior: Insights on which customer segments contributed the most to sales. Sales Performance: Which products or categories had the highest sales, and during which days of Diwali sales peaked. Discount Effectiveness: The impact of discounts on sales and whether higher discounts led to significantly higher sales or not.

    Conclusion

    Summarize the key insights derived from the EDA. Discuss any patterns or trends that were unexpected or particularly interesting. Provide recommendations for future sales strategies based on the findings. .

  16. Singapore Street Co-ordinates

    • kaggle.com
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nithesh Karthik (2025). Singapore Street Co-ordinates [Dataset]. https://www.kaggle.com/datasets/nitheshkarthik/singapore-street-co-ordinates/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nithesh Karthik
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Singapore
    Description

    Context This dataset was created as part of the effort for a project in my coursework called Python for Business Analytics (DAO2702). We are performing data analysis on historical resale data of HDBs in Singapore and as a part of this analysis, I had to create this dataset containing the coordinates of the streets that were listed in the resale data.

    Content Most of the streets in the list were geo-coded using Python packages and some of them were manually collected by searching for the streets on Google Maps and copying the latitude and longitude. Currently, the total number of streets geo-coded is 589. The list of street names might increase in the future, as new streets can be formed. The name of the streets might be changed from the past too. Would not guarantee 100% accuracy as there might be slight errors. Do use the dataset after such considering these aspects.

  17. Z

    Data and Code for the paper "GUI Testing of Android Applications:...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luigi Libero Lucio Starace (2023). Data and Code for the paper "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7260111
    Explore at:
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Anna Rita Fasolino
    Porfirio Tramontana
    Luigi Libero Lucio Starace
    Sergio Di Martino
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".

    Abstract

    Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.

    Contents and Instructions

    This package contains:

    apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

    apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.

    students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

    compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.

    branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.

    data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.

  18. Representations of Sound and Music in the Middle Ages: Analysis and...

    • zenodo.org
    json
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho (2025). Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database (Records and Performances) [Dataset]. http://doi.org/10.5281/zenodo.15037823
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the study “Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database”, authored by Edmundo Camacho, Xavier Fresquet, and Frédéric Billiet.

    It contains structured descriptions of musical performances, performers, and instruments extracted from the Musiconis database (December 2024 version). This dataset does not include organological descriptions, which are available in a separate dataset.

    The Musiconis database provides a structured and interoperable framework for studying medieval music iconography. It enables investigations into:

    • The evolution and spread of musical instruments across Europe and the Mediterranean.

    • Performer typologies and their representation in medieval art.

    • The relationships between musical practices and social or religious contexts.

    Contents:

    Musiconis Dataset (JSON format, December 2024 version):

    • Musical scenes and their descriptions

    • Performer metadata (roles, social status, gender, interactions)

    • Instrument classifications (without detailed organological descriptions)

    Colab Notebook (Python):

    • Data processing and structuring

    • Visualization of performer distributions and instrument usage

    • Exploratory statistics and mapping

    Tools Used:

    • Python (Pandas, Seaborn, Matplotlib, Plotly)

    • Statistical and exploratory data analysis

    • Visualization of instrument distributions, performer interactions, and musical context

  19. Train file sizes Google Identify Contrails

    • kaggle.com
    Updated May 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergey Saharovskiy (2023). Train file sizes Google Identify Contrails [Dataset]. https://www.kaggle.com/datasets/sergiosaharovskiy/train-file-sizes-google-identify-contrails/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sergey Saharovskiy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset comprises metadata for 225,819 train files Google Research - Identify Contrails to Reduce Global Warming challenge.

    The code was obtained by using a simple bash script:

    shopt -s globstar dotglob nullglob
    
    for pathname in train/**/*; do
      if [[ -f $pathname ]] && [[ ! -h $pathname ]]; then
        stat -c $'%s\t%n' "$pathname"
      fi
    done >train_file_sizes.csv
    

    After the bash script, the file was preprocessed with the following python code:

    train_sizes = pd.read_csv('data/train_file_sizes.csv', delim_whitespace=True, names=['file_size', 'file_path'])
    train_sizes['record_id'] = train_sizes.file_path.str.split('/', expand=True)[1].astype(int)
    train_sizes.to_csv('data/train_file_sizes.csv', index=False)
    
  20. BlocPower - Summarize, plot and validate

    • redivis.com
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar H (2023). BlocPower - Summarize, plot and validate [Dataset]. https://redivis.com/workflows/tajy-74j9c5jyx
    Explore at:
    Dataset updated
    Oct 22, 2023
    Dataset provided by
    Redivis Inc.
    Authors
    Kumar H
    Description

    This project uses Python to load BlocPower's data for 121 million buildings in the US, summarize it to the spatial unit of interest (state, county or zipcode) and plot key statistics. It also compares and validates the zipcode-level statistics with other independent data sources - Microsoft (for building counts) and Goldstein et al (2022) for energy use.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions
Organization logo

Cyclistic Bike - Data Analysis (Python)

Exploratory Data Analysis of Cyclistic Bike-Share Usage Patterns

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Amirthavarshini
Description

Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

Search
Clear search
Close search
Google apps
Main menu