24 datasets found

World's Air Quality and Water Pollution Dataset
kaggle.com
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VICTOR AHAJI (2023). World's Air Quality and Water Pollution Dataset [Dataset]. https://www.kaggle.com/datasets/victorahaji/worlds-air-quality-and-water-pollution-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
VICTOR AHAJI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
World
Description
The Dataset "World's Air Quality and Water Pollution" was obtained from Jack Jae Hwan Kim Kaggle page. This Dataset is comprized of 5 columns; "City", "Region", "Country", "Air Quality" and "Water Pollution". The last two columns consist of values varying from 0 to 100; Air Quality Column: Air quality varies from 0 (bad quality) to 100 (top good quality) Water Pollution Column: Water pollution varies from 0 (no pollution) to 100 (extreme pollution).
Amazon Sales Data
kaggle.com
Updated Jun 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mithilesh Kale (2024). Amazon Sales Data [Dataset]. https://www.kaggle.com/datasets/mithilesh9/amazon-sales-data-analysis/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mithilesh Kale
Description
https://www.kaggle.com/code/mithilesh9/amazon-sales-data-analysis-using-python

Dataset Description This dataset contains a 100 rows of sales data for Amazon, including the region, country, item type, sales channel, order priority, order date, order ID, ship date, units sold, unit price, unit cost, total revenue, total cost, and total profit.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19501062%2F5d10a624d07eefb2240c474ca00114b6%2FScreenshot%202024-06-25%20135139.png?generation=1719303822906805&alt=media" alt="">
o
IPhone Customer Survey | NLP
opendatabay.com
.undefined
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). IPhone Customer Survey | NLP [Dataset]. https://www.opendatabay.com/data/ai-ml/8496ac33-2bc1-4401-868d-3cc6c5369f16
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Context This dataset offers a treasure trove for conducting sentiment analysis, feature analysis, and topic modeling on customer reviews. It includes vital information like product ASIN, country, and date, which help gauge customer trust and engagement. Each review features a rating score, along with a compelling review title and detailed description, providing a window into customer emotions and preferences. Additionally, the review URL, reviewed language/region, and variant ASIN enrich the analysis, allowing for a deeper understanding of how different product versions resonate with consumers in various markets. This comprehensive approach not only highlights customer sentiments but also reveals key insights that can drive product development and marketing strategies.

Dataset Glossary (Column-wise) productAsin: Unique identifier for the product. country: Location where the review was submitted. date: Date of the review. isVerified: Indicates if the reviewer is a verified purchaser. ratingScore: Numerical score given by the reviewer (typically 1-5). reviewTitle: Brief summary of the review. reviewDescription: Detailed feedback from the reviewer. reviewUrl: Link to the full review online. reviewedIn:Language or region in which the review was written. variant: Specific version of the product reviewed. variantAsin: Unique identifier for the product variant.

License

CC0

Original Data Source:IPhone Customer Survey | NLP
Cyclistic Bike - Data Analysis (Python)
kaggle.com
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Amirthavarshini
Description
Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.
o
Apple IPhone Customer Reviews
opendatabay.com
.undefined
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Apple IPhone Customer Reviews [Dataset]. https://www.opendatabay.com/data/consumer/42533232-0299-4752-8408-4579f2251a34
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 10, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Reviews & Ratings
Description
Based on the dataset of iPhone reviews from Amazon, here are some project areas we can do:

-> Sentiment analysis: Determine overall sentiment and identify trends.

-> Feature analysis: Analyze user satisfaction with specific features.

-> Topic modeling: Discover underlying themes and discussion points.

Original Data Source: Apple IPhone Customer Reviews
f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
o
YouTube Trending Videos of the Day
opendatabay.com
.undefined
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). YouTube Trending Videos of the Day [Dataset]. https://www.opendatabay.com/data/ai-ml/34cfa60b-afac-4753-9409-bc00f9e8fbec
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
YouTube, Data Science and Analytics
Description
The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:

IT - Italy ES - Spain GR - Greece HR - Croatia TR - Turkey AL - Albania DZ - Algeria EG - Egypt LY - Lybia TN - Tunisia MA - Morocco IL - Israel ME - Montenegro LB - Lebanon FR - France BA - Bosnia and Herzegovina MT - Malta SI - Slovenia CY - Cyprus

SY - Syria

The columns are, instead, the following:

country: where is the country in which the video was published. video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'. title: title of the video. publishedAt: publication date of the video. channelId: identification number of the channel who published the video. channelTitle: name of the channel who published the video. categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list. trending_date: trending date of the video. tags: tags present in the video. view_count: view count of the video. comment_count: number of comments in the video. thumbnail_link: the link of the image that appears before clicking the video. -comments_disabled: tells if the comments are disabled or not for a certain video. -ratings_disabled: tells if the rating is disabled or not for that video. -description: description below the video. Inspiration You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions. Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information. Remember to upvote if you found the dataset useful :).

License

CC0

Original Data Source: YouTube Trending Videos of the Day
Preventive Maintenance for Marine Engines
kaggle.com
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2025
Dataset provided by
Kaggle
Authors
Fijabi J. Adekunle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Preventive Maintenance for Marine Engines: Data-Driven Insights

Introduction:

Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
bookstore dataset
kaggle.com
Updated Aug 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sbonelo Ndhlazi (2022). bookstore dataset [Dataset]. https://www.kaggle.com/datasets/sbonelondhlazi/bookstore-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sbonelo Ndhlazi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set was scrapped using python from http://books.toscrape.com/ which is a fictional book store. It contains 1000 books, with different categories, star ratings and prices. This data set can be used by anyone who wants to practice data cleaning and simple data manipulations.

The code I used to scrap this data can be found on my github: https://github.com/Sbonelondhlazi/dummybooks
A
‘COVID-19 dataset in Japan’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/latest
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Japan
Description
Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

--- Dataset description provided by original source is as follows ---

1. Context

This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

This dataset can be retrieved with CovsirPhy (Python library).

pip install covsirphy --upgrade

import covsirphy as cs data_loader = cs.DataLoader() japan_data = data_loader.japan() # The number of cases (Total/each province) clean_df = japan_data.cleaned() # Metadata meta_df = japan_data.meta()

Please refer to CovsirPhy Documentation: Japan-specific dataset.

Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

1.1 Total number of cases in Japan

covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

Manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)

The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸（新型コロナワクチン情報）

1.2 The number of cases at prefecture level

covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)

Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

1.3 Metadata of each prefecture

covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省厚生統計要覧（2017年度）第１－５表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

Hospital_bed: With the primary data of 厚生労働省感染症指定医療機関の指定状況（平成31年4月1日現在）, 厚生労働省第二種感染症指定医療機関の指定状況（平成31年4月1日現在）, 厚生労働省医療施設動態調査（令和２年１月末概数）, 厚生労働省感染症指定医療機関について and secondary data of COVID-19 Japan 都道府県別感染症病床数,

Specific: Hospital beds of medical institutions designated for specific infectious diseases

Type-I: Hospital beds of medical institutions designated for type I infectious diseases

Type-II: Hospital beds of medical institutions designated for type II infectious diseases

Tuberculosis: Hospital beds of medical institutions designated for tuberculosis (outpatient care)

Care: long term care bed of hospitals

Total: Beds of all hospitals

Clinic_bed: With the primary data of 医療施設動態調査（令和２年１月末概数） ,

Care: long term care beds of clinics

Total: Beds of all clinics

Location: Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).

Latitude

Longitude

Admin

Capital: Prefectural capital city. Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).

Region: Region name. Data is from WIkipedia (secondary data). "Kyushu-Okinawa region" was separated to "Kyushu" and "Okinawa" by this datasets' author.

Num: Prefecture code (JIS X 0401: Hokkaido=1,...Okinawa=47). Data is from 国土交通省 GIS HP Pref code. cf. (not source) Japan VIsitor: Japan Prefectures Map.

2. Acknowledgements

To create this dataset, edited and transformed data of the following sites was used.

厚生労働省 Ministry of Health, Labour and Welfare, Japan:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別感染症病床数 (CC BY)

Wikipedia: Wikipedia

LinkData: LinkData (Public Domain)

Inspiration

Changes in number of cases over time

Percentage of patients without symptoms / mild or severe symptoms

What to do next to prevent outbreak

License and how to cite

Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

--- Original source retains full ownership of the source dataset ---
Noon & Amazon
kaggle.com
Updated Apr 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Elghannam (2025). Noon & Amazon [Dataset]. https://www.kaggle.com/datasets/mohamedelghannam15/noon-and-amazon
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohammed Elghannam
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🛍️ Amazon vs Noon: Electronics Price & Discount Comparison This dataset contains scraped product information from two major e-commerce platforms: Amazon and Noon, focusing on electronics. The goal is to compare pricing strategies and discounts offered by each platform.

📌 Dataset Summary Sources: Amazon & Noon (scraped using custom Python scripts) Categories: Electronics (Laptops, Accessories, etc.) Data Fields: Product Title, Brand, Price, Original Price, Discount, Rating, and more Processing: The data needs to be cleaned.
h
Electrical-engineering
huggingface.co
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mod (2024). Electrical-engineering [Dataset]. https://huggingface.co/datasets/STEM-AI-mtl/Electrical-engineering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2024
Authors
mod
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
To the electrical engineering community

This dataset contains Q&A prompts about electrical engineering, Kicad's EDA software features and scripting console Python codes.

Authors

STEM.AI: stem.ai.mtl@gmail.comWilliam Harbec
All Lending Club loan data
kaggle.com
Updated Apr 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan George (2019). All Lending Club loan data [Dataset]. https://www.kaggle.com/wordsforthewise/lending-club/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nathan George
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Update: I probably won't be able to update the data anymore, as LendingClub now has a scary 'TOS' popup when downloading the data. Worst case, they will ask me/Kaggle to take it down from here.

This dataset contains the full LendingClub data available from their site. There are separate files for accepted and rejected loans. The accepted loans also include the FICO scores, which can only be downloaded when you are signed in to LendingClub and download the data.

See the Python and R getting started kernels to get started:

R: https://www.kaggle.com/wordsforthewise/eda-in-r-arggghh

Python: https://www.kaggle.com/wordsforthewise/eda-with-python

I created a git repo for the code which is used to create this data: https://github.com/nateGeorge/preprocess_lending_club_data

Background

I wanted an easy way to share all the lending club data with others. Unfortunately, the data on their site is fragmented into many smaller files. There is another lending club dataset on Kaggle, but it wasn't updated in years. It seems like the "Kaggle Team" is updating it now. I think it also doesn't include the full rejected loans, which are included here. It seems like the other dataset confusingly has some of the rejected loans mixed into the accepted ones. Now there are a ton of other LendingClub datasets on here too, most of which seem to have no documentation or explanation of what the data actually is.

Content

The definitions for the fields are on the LendingClub site, at the bottom of the page. Kaggle won't let me upload the .xlsx file for some reason since it seems to be in multiple other data repos. This file seems to be in the other main repo, but again, it's better to get it directly from the source.

Unfortunately, there is (maybe "was" now?) a limit of 500MB for dataset files, so I had to compress the files with gzip in the Python pandas package.

I cleaned the data a tiny bit: I removed percent symbols (%) from int_rate and revol_util columns in the accepted loans and converted those columns to floats.

Update

The URL column is in the dataset for completeness, as of 2018 Q2.
ML-Based RUL Prediction for NPP Transformers
kaggle.com
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dmitry_Menyailov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

Notebooks

1. Exploratory_Data_Analysis

https://www.kaggle.com/code/idmitri/exploratory-data-analysis

2. RUL_Prediction_Modeling

https://www.kaggle.com/code/idmitri/rul-prediction-modeling

О проекте

Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

Результаты

В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
- gases (концентрации газов)
- trend (трендовые компоненты)
- seasonal (сезонные компоненты)
- resid (остаточные компоненты)
- quantiles (квантили распределений)
- volatility (волатильность концентраций)
- range (размах значений)
- coefficient of variation (коэффициент вариации)
- standard deviation (стандартное отклонение)
- skewness (асимметрия распределения)
- kurtosis (эксцесс распределения)
- category (категориальные признаки неисправностей)

Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

Комментарий

Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

Предлагается гибридный вариант работы:
- Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.
Volunteer NGO Match Dataset
kaggle.com
Updated May 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yash Yadav (2023). Volunteer NGO Match Dataset [Dataset]. https://www.kaggle.com/datasets/yash92328/volunteer-ngo-match-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yash Yadav
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides information about volunteers and their preferences for the type of organization they would like to volunteer for. The dataset is ideal for building a volunteer matching platform or conducting data analysis related to volunteerism and social causes. It contains various attributes of volunteers, including their names, ages, genders, skills, availability, locations, and the types of organizations they are interested in.

The dataset includes 50 rows, with each row representing a volunteer profile. Volunteers have provided information about their skills and availability for volunteering, allowing organizations to match them with suitable opportunities. The dataset also categorizes the preferred types of organizations into three categories: pet and animal service, healthcare, and youth development.

This dataset can be utilized for a variety of purposes, including:

Volunteer Matching: Use this dataset to develop a volunteer matching platform that connects volunteers with organizations based on their skills, availability, and interests.

Data Analysis: Explore the dataset to gain insights into the preferences, skills, and availability of volunteers in different locations. Analyze trends in volunteerism and identify patterns that can inform strategies for engaging volunteers effectively.

Python Projects: Utilize this dataset for practicing data analysis skills using Python libraries such as pandas, NumPy, or scikit-learn. Perform exploratory data analysis, create visualizations, and build predictive models related to volunteerism and social causes.

Web Development: Incorporate this dataset into web development projects to create interactive volunteer matching platforms or visualizations related to volunteer engagement and social causes.

Whether you are a data scientist, a web developer, or someone interested in volunteerism and social causes, this dataset provides a valuable resource for analysis and application development. Start exploring and contributing to the field of volunteer matching and social impact!

Note: The dataset is simulated and does not contain real personal information. It has been generated for educational and illustrative purposes.
Singapore Street Co-ordinates
kaggle.com
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nithesh Karthik (2025). Singapore Street Co-ordinates [Dataset]. https://www.kaggle.com/datasets/nitheshkarthik/singapore-street-co-ordinates/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nithesh Karthik
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Singapore
Description
Context This dataset was created as part of the effort for a project in my coursework called Python for Business Analytics (DAO2702). We are performing data analysis on historical resale data of HDBs in Singapore and as a part of this analysis, I had to create this dataset containing the coordinates of the streets that were listed in the resale data.

Content Most of the streets in the list were geo-coded using Python packages and some of them were manually collected by searching for the streets on Google Maps and copying the latitude and longitude. Currently, the total number of streets geo-coded is 589. The list of street names might increase in the future, as new streets can be formed. The name of the streets might be changed from the past too. Would not guarantee 100% accuracy as there might be slight errors. Do use the dataset after such considering these aspects.
Train file sizes Google Identify Contrails
kaggle.com
Updated May 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergey Saharovskiy (2023). Train file sizes Google Identify Contrails [Dataset]. https://www.kaggle.com/datasets/sergiosaharovskiy/train-file-sizes-google-identify-contrails/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sergey Saharovskiy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset comprises metadata for 225,819 train files Google Research - Identify Contrails to Reduce Global Warming challenge.

The code was obtained by using a simple bash script:

shopt -s globstar dotglob nullglob for pathname in train/**/*; do if [[ -f $pathname ]] && [[ ! -h $pathname ]]; then stat -c $'%s\t%n' "$pathname" fi done >train_file_sizes.csv

After the bash script, the file was preprocessed with the following python code:

train_sizes = pd.read_csv('data/train_file_sizes.csv', delim_whitespace=True, names=['file_size', 'file_path']) train_sizes['record_id'] = train_sizes.file_path.str.split('/', expand=True)[1].astype(int) train_sizes.to_csv('data/train_file_sizes.csv', index=False)
App Store Mobile Games 2008 - 2019
kaggle.com
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayank Singh (2024). App Store Mobile Games 2008 - 2019 [Dataset]. https://www.kaggle.com/datasets/mayanksinghr/app-store-mobile-games-2008-2019
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2024
Dataset provided by
Kaggle
Authors
Mayank Singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset contains 1 excel workbook (.xlsx) with 2 sheets.

Sheet 1 - App Store Games contains the mobile games launched on App Store from 2008 - 2019.

Sheet 2 - Data Dictionary is just the explanation of columns in data.

This data can be used to practice EDA and some data cleaning tasks. Can be used for Data visualization using python Matplotlib and Seaborn libraries.

I used this dataset for a Power BI project also and created a Dashboard on it. Used python inside power query to clean and convert some encoded and Unicode characters from App URL, Name, and Description columns.

Total Columns: 16

App URL

App ID

Name

Subtitle

Icon URL

Average User Rating

User Rating Count

Price per App (USD)

Description

Developer

Age Rating

Languages

Size in Bytes

Primary Genre

Genres

Release Date
A Study of Data Science with Tips Dataset in seabo
kaggle.com
Updated Jul 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spotlightkid (2024). A Study of Data Science with Tips Dataset in seabo [Dataset]. https://www.kaggle.com/datasets/spotlightkid/datasets-from-python-seaborn-library
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Spotlightkid
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
SAMPLE CASE STUDY:

1. Predicting Tip Amount

Objective: Build a model to predict the tip amount based on features like total bill, sex, smoker status, day, time, and size.

Approach: Use regression algorithms (e.g., linear regression, decision trees, or gradient boosting) to predict the tip amount. Evaluate performance with metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE).

2. Classifying Smokers

Objective: Predict whether a customer is a smoker based on other attributes.

Approach: Use classification algorithms (e.g., logistic regression, random forests, or support vector machines) to classify the smoker status. Evaluate with metrics like accuracy, precision, recall, and F1-score.

****3. Clustering Customers****

Objective: Identify different customer segments based on their spending behavior and attributes.

Approach: Apply clustering algorithms (e.g., k-means, hierarchical clustering) to group customers into clusters with similar characteristics. Analyze the clusters to derive insights about different types of customers.

4. Analyzing the Effect of Time on Tips

Objective: Study how the time of day (Lunch vs. Dinner) affects the amount of tip given.

Approach: Perform exploratory data analysis (EDA) and statistical tests to determine if there is a significant difference in tips between different times of day. Visualize the results with plots.

5. Estimating Tip Percentage

Objective: Estimate the tip percentage relative to the total bill amount.

Approach: Create a new feature for tip percentage and use regression models to predict this percentage based on other features. This can also involve feature engineering and creating visualizations to understand the relationship between the tip percentage and other factors.

Facebook

Twitter

Click to copy link

Link copied

Cite

VICTOR AHAJI (2023). World's Air Quality and Water Pollution Dataset [Dataset]. https://www.kaggle.com/datasets/victorahaji/worlds-air-quality-and-water-pollution-dataset/data

World's Air Quality and Water Pollution Dataset

Python: Exploratory Data Analysis

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 30, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

VICTOR AHAJI

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Area covered

World

Description

The Dataset "World's Air Quality and Water Pollution" was obtained from Jack Jae Hwan Kim Kaggle page. This Dataset is comprized of 5 columns; "City", "Region", "Country", "Air Quality" and "Water Pollution". The last two columns consist of values varying from 0 to 100; Air Quality Column: Air quality varies from 0 (bad quality) to 100 (top good quality) Water Pollution Column: Water pollution varies from 0 (no pollution) to 100 (extreme pollution).

Clear search

Close search

Google apps

Main menu

World's Air Quality and Water Pollution Dataset

Amazon Sales Data

IPhone Customer Survey | NLP

License

Cyclistic Bike - Data Analysis (Python)

Apple IPhone Customer Reviews

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

YouTube Trending Videos of the Day

SY - Syria

License

Preventive Maintenance for Marine Engines

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

bookstore dataset

‘COVID-19 dataset in Japan’ analyzed by Analyst-2

1. Context

1.1 Total number of cases in Japan

1.2 The number of cases at prefecture level

1.3 Metadata of each prefecture

2. Acknowledgements

Inspiration

License and how to cite

Noon & Amazon

Electrical-engineering

All Lending Club loan data

Context

Background

Content

Update

ML-Based RUL Prediction for NPP Transformers

Notebooks

1. Exploratory_Data_Analysis

2. RUL_Prediction_Modeling

О проекте

Результаты

Комментарий

Volunteer NGO Match Dataset

Singapore Street Co-ordinates

Train file sizes Google Identify Contrails

App Store Mobile Games 2008 - 2019

A Study of Data Science with Tips Dataset in seabo

World's Air Quality and Water Pollution Dataset

Python: Exploratory Data Analysis