18 datasets found
  1. EDA Using Python

    • kaggle.com
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rashmi Tiwari (2024). EDA Using Python [Dataset]. https://www.kaggle.com/datasets/rosetiwari/eda-using-python/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rashmi Tiwari
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Rashmi Tiwari

    Released under CC0: Public Domain

    Contents

  2. Udemy Dataset - EDA using Python

    • kaggle.com
    Updated Nov 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhagya sree (2022). Udemy Dataset - EDA using Python [Dataset]. https://www.kaggle.com/bhagya20/udemy-dataset-eda-using-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhagya sree
    Description

    Dataset

    This dataset was created by Bhagya sree

    Contents

  3. Chicago_Crimes_2005_to_2007

    • kaggle.com
    Updated Jul 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Chaurasia (2020). Chicago_Crimes_2005_to_2007 [Dataset]. https://www.kaggle.com/shivamchaurasia/chicago-crimes-2005-to-2007/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivam Chaurasia
    Area covered
    Chicago
    Description

    Dataset

    This dataset was created by Shivam Chaurasia

    Contents

  4. A

    ‘COVID-19 dataset in Japan’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan
    Description

    Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    1. Context

    This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

    Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

    This dataset can be retrieved with CovsirPhy (Python library).

    pip install covsirphy --upgrade
    
    import covsirphy as cs
    data_loader = cs.DataLoader()
    japan_data = data_loader.japan()
    # The number of cases (Total/each province)
    clean_df = japan_data.cleaned()
    # Metadata
    meta_df = japan_data.meta()
    

    Please refer to CovsirPhy Documentation: Japan-specific dataset.

    Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

    1.1 Total number of cases in Japan

    covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

    In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

    Manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

    Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)

    1.2 The number of cases at prefecture level

    covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

    Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

    1.3 Metadata of each prefecture

    covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

    2. Acknowledgements

    To create this dataset, edited and transformed data of the following sites was used.

    厚生労働省 Ministry of Health, Labour and Welfare, Japan:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)

    Wikipedia: Wikipedia

    LinkData: LinkData (Public Domain)

    Inspiration

    1. Changes in number of cases over time
    2. Percentage of patients without symptoms / mild or severe symptoms
    3. What to do next to prevent outbreak

    License and how to cite

    Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

    --- Original source retains full ownership of the source dataset ---

  5. h

    Electrical-engineering

    • huggingface.co
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mod (2024). Electrical-engineering [Dataset]. https://huggingface.co/datasets/STEM-AI-mtl/Electrical-engineering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2024
    Authors
    mod
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    To the electrical engineering community

    This dataset contains Q&A prompts about electrical engineering, Kicad's EDA software features and scripting console Python codes.

      Authors
    

    STEM.AI: stem.ai.mtl@gmail.comWilliam Harbec

  6. Preventive Maintenance for Marine Engines

    • kaggle.com
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Fijabi J. Adekunle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Preventive Maintenance for Marine Engines: Data-Driven Insights

    Introduction:

    Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

    Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

    Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

    Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

    Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

    Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

    Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

    Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.

  7. Cyclistic Bike - Data Analysis (Python)

    • kaggle.com
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Amirthavarshini
    Description

    Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

  8. ML-Based RUL Prediction for NPP Transformers

    • kaggle.com
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitry_Menyailov (2025). ML-Based RUL Prediction for NPP Transformers [Dataset]. https://www.kaggle.com/datasets/idmitri/ml-based-rul-prediction-for-npp-transformers
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dmitry_Menyailov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23516597%2F11309e6c4df1437ed2aa6a8fb121daa5%2FScreenshot%202025-04-10%20at%2004.17.42.png?generation=1744233480336962&alt=media" alt="">

    Notebooks

    1. Exploratory_Data_Analysis

    https://www.kaggle.com/code/idmitri/exploratory-data-analysis

    2. RUL_Prediction_Modeling

    https://www.kaggle.com/code/idmitri/rul-prediction-modeling

    О проекте

    Силовые трансформаторы на АЭС могут эксплуатироваться дольше расчетного срока службы (25 лет), что требует усиленного мониторинга их состояния для обеспечения надежности и безопасности эксплуатации.

    Для оценки состояния трансформаторов применяется хроматографический анализ растворенных газов, который позволяет выявлять дефекты по концентрациям газов в масле и прогнозировать остаточный срок службы трансформатора (RUL). Традиционные системы мониторинга ограничиваются фиксированными пороговыми значениями концентраций, снижая точность диагностики и автоматизацию. Методы машинного обучения позволяют выявлять скрытые зависимости и повышать точность прогнозирования. Подробнее: https://habr.com/ru/articles/743682/

    Результаты

    В данном проекте проводится глубокий анализ данных (EDA) с созданием 12 групп признаков:
    - gases (концентрации газов)
    - trend (трендовые компоненты)
    - seasonal (сезонные компоненты)
    - resid (остаточные компоненты)
    - quantiles (квантили распределений)
    - volatility (волатильность концентраций)
    - range (размах значений)
    - coefficient of variation (коэффициент вариации)
    - standard deviation (стандартное отклонение)
    - skewness (асимметрия распределения)
    - kurtosis (эксцесс распределения)
    - category (категориальные признаки неисправностей)

    Использование статистических и декомпозиционных признаков позволило достичь совпадения точности силуэта распределения RUL с автоматической обработкой выбросов, что ранее требовало ручной корректировки.

    Для моделирования использованы алгоритмы машинного обучения (LightGBM, CatBoost, Extra Trees) и их ансамбль. Лучшая точность достигнута моделью LightGBM с оптимизацией гиперпараметров с помощью Optuna: MAE = 61.85, RMSE = 88.21, R2 = 0.8634.

    Комментарий

    Код для проведения разведочного анализа данных (EDA) был разработан и протестирован локально в VSC Jupyter Notebook с использованием окружения Python 3.10.16. И на платформе Kaggle большинство графиков отображается корректно. Но некоторые сложные и комплексные визуализации (например, многомерные графики с цветовой шкалой) не адаптированы из-за ограничений среды. Несмотря на попытки оптимизировать код без существенных изменений, добиться полной совместимости не удалось. Основная проблема заключалась в конфликте версий библиотек и значительном снижении производительности — расчет занимал примерно в 10 раз больше времени по сравнению с локальной машиной MacBook M3 Pro. На Kaggle либо корректно выполнялись операции с использованием PyCaret, либо работали модели машинного обучения, но не обе части одновременно.

    Предлагается гибридный вариант работы:
    - Публикация и вывод метрик на Kaggle для визуализации результатов. - Локальный расчет и обучение моделей с использованием предварительно настроенного окружения Python 3.10.16. Для воспроизведения экспериментов подготовлена папка Codes с кодами VSC EDA, RUL и файлом libraries_for_modeling, содержащим список версий всех используемых библиотек.

    Готов ответить в комментариях на все вопросы по настройке и запуску кода. И буду признателен за советы по предотвращению подобных проблем.

  9. Subreddit Interactions for 25,000 Users

    • kaggle.com
    Updated Feb 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    colemaclean (2017). Subreddit Interactions for 25,000 Users [Dataset]. https://www.kaggle.com/datasets/colemaclean/subreddit-interactions/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    colemaclean
    Description

    Context

    The dataset is a csv file compiled using a python scrapper developed using Reddit's PRAW API. The raw data is a list of 3-tuples of [username,subreddit,utc timestamp]. Each row represents a single comment made by the user, representing about 5 days worth of Reddit data. Note that the actual comment text is not included, only the user, subreddit and comment timestamp of the users comment. The goal of the dataset is to provide a lens in discovering user patterns from reddit meta-data alone. The original use case was to compile a dataset suitable for training a neural network in developing a subreddit recommender system. That final system can be found here

    A very unpolished EDA for the dataset can be found here. Note the published dataset is only half of the one used in the EDA and recommender system, to meet kaggle's 500MB size limitation.

    Content

    user - The username of the person submitting the comment
    subreddit - The title of the subreddit the user made the comment in
    utc_stamp - the utc timestamp of when the user made the comment

    Acknowledgements

    The dataset was compiled as part of a school project. The final project report, with my collaborators, can be found here

    Inspiration

    We were able to build a pretty cool subreddit recommender with the dataset. A blog post for it can be found here, and the stand alone jupyter notebook for it here. Our final model is very undertuned, so there's definitely improvements to be made there, but I think there are many other cool data projects and visualizations that could be built from this dataset. One example would be to analyze the spread of users through the Reddit ecosystem, whether the average user clusters in close communities, or traverses wide and far to different corners. If you do end up building something on this, please share! And have fun!

    Released under Reddit's API licence

  10. All Lending Club loan data

    • kaggle.com
    Updated Apr 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan George (2019). All Lending Club loan data [Dataset]. https://www.kaggle.com/wordsforthewise/lending-club/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nathan George
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Update: I probably won't be able to update the data anymore, as LendingClub now has a scary 'TOS' popup when downloading the data. Worst case, they will ask me/Kaggle to take it down from here.

    This dataset contains the full LendingClub data available from their site. There are separate files for accepted and rejected loans. The accepted loans also include the FICO scores, which can only be downloaded when you are signed in to LendingClub and download the data.

    See the Python and R getting started kernels to get started:

    I created a git repo for the code which is used to create this data: https://github.com/nateGeorge/preprocess_lending_club_data

    Background

    I wanted an easy way to share all the lending club data with others. Unfortunately, the data on their site is fragmented into many smaller files. There is another lending club dataset on Kaggle, but it wasn't updated in years. It seems like the "Kaggle Team" is updating it now. I think it also doesn't include the full rejected loans, which are included here. It seems like the other dataset confusingly has some of the rejected loans mixed into the accepted ones. Now there are a ton of other LendingClub datasets on here too, most of which seem to have no documentation or explanation of what the data actually is.

    Content

    The definitions for the fields are on the LendingClub site, at the bottom of the page. Kaggle won't let me upload the .xlsx file for some reason since it seems to be in multiple other data repos. This file seems to be in the other main repo, but again, it's better to get it directly from the source.

    Unfortunately, there is (maybe "was" now?) a limit of 500MB for dataset files, so I had to compress the files with gzip in the Python pandas package.

    I cleaned the data a tiny bit: I removed percent symbols (%) from int_rate and revol_util columns in the accepted loans and converted those columns to floats.

    Update

    The URL column is in the dataset for completeness, as of 2018 Q2.

  11. Phonpe Pulse Master

    • kaggle.com
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HARSHAVARDHAN AEITY (2025). Phonpe Pulse Master [Dataset]. https://www.kaggle.com/datasets/harshavardhan0022/phonpe-pulse-master
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HARSHAVARDHAN AEITY
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Github

    PhonePe Insurance, Transaction and User Engagement Analysis

    Project Overview

    This project analyzes insurance transactions and user engagement trends on PhonePe, a leading digital payments platform in India. The goal is to provide data-driven insights into geographical, brand-specific, and user engagement performance metrics, helping to optimize insurance transaction efficiency and customer engagement on the platform.

    Business Problem Statement

    With PhonePe expanding its financial services, particularly in insurance, it's essential to: - Analyze geographical performance (state-wise and district-wise). - Identify trends across various brands. - Examine user engagement to understand regions with low app activity despite high registration numbers.

    These insights are valuable in developing targeted strategies for market penetration, revenue growth, and user re-engagement.

    Data Processing

    The raw data was initially available in JSON format. Using Python libraries—os, json, and pandas—the data was converted to CSV files to facilitate easier manipulation and analysis.

    Steps: 1. Data Loading and Transformation: Read JSON data, clean, and structure it into CSV format. 2. Data Storage: Store processed data in a format compatible with analytical tools like Power BI.

    Exploratory Data Analysis (EDA)

    EDA was performed to understand the dataset and discover patterns and correlations in: - State-wise and district-wise insurance transaction trends. - Brand-wise transaction volumes. - User engagement metrics, correlating registered users with app opens.

    Data Visualization

    The project includes interactive visualizations for decision-making, developed in two tools: - Power BI Dashboard: Displays transaction metrics across quarters, years, states, districts, and brands, with engagement insights. - Streamlit Application: Provides a user-friendly, web-based interface for real-time data insights.

    Power BI Dashboard Visuals

    1. Quarterly and Yearly Transaction Trends: Transaction counts and amounts over time.
    2. Geographic Insights: State-wise and district-wise transaction performance.
    3. User Engagement Metrics: Comparison of registered users and app opens to identify engagement opportunities.

    Technologies Used

    • Data Processing: Python (pandas, json, os)
    • Visualization: Power BI, Streamlit
  12. LCK Spring 2024 Players Statistics

    • kaggle.com
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Rozado (2024). LCK Spring 2024 Players Statistics [Dataset]. https://www.kaggle.com/datasets/lukasrozado/lck-spring-2024-players-statistics/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lukas Rozado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides an in-depth look at the League of Legends Champions Korea (LCK) Spring 2024 season. It includes detailed metrics for players, champions, and matches, meticulously cleaned and organized for easy analysis and modeling.

    Data Collection

    The data was collected using a combination of manual efforts and automated web scraping tools. Specifically:

    Source: Data was gathered from Gol.gg, a well-known platform for League of Legends statistics. Automation: Web scraping was performed using Python libraries like BeautifulSoup and Selenium to extract information on players, matches, and champions efficiently. Focus: The scripts were designed to capture relevant performance metrics for each player and champion used during the Spring 2024 split.

    Data Cleaning and Processing

    The raw data obtained from web scraping required significant preprocessing to ensure its usability. The following steps were taken:

    Handling Raw Data:

    Extracted key performance indicators like KDA, Win Rate, Games Played, and Match Durations from the source. Normalized inconsistent formats for metrics such as win rates (e.g., removing %) and durations (e.g., converting MM:SS to total seconds).

    Data Cleaning:

    Removed duplicate rows and ensured no missing values. Fixed inconsistencies in player and champion names to maintain uniformity. Checked for outliers in numerical metrics (e.g., unrealistically high KDA values).

    Data Organization:

    Created three separate tables for better data management:

    Player Statistics: General player performance metrics like KDA, win rates, and average kills. Champion Statistics: Data on games played, win rates, and KDA for each champion. Match List: Details of each match, including players, champions, and results. Added sequential Player IDs to connect the three datasets, facilitating relational analysis. Date Formatting: Converted all date fields to the DD/MM/YYYY format for consistency. Removed irrelevant time data to focus solely on match dates.

    Tools and Libraries Used

    The following tools were used throughout the project:

    Python: Libraries: Pandas, NumPy for data manipulation; BeautifulSoup, Selenium for web scraping. Visualization: Matplotlib, Seaborn, Plotly for potential analysis. Excel: Consolidated final datasets into a structured Excel file with multiple sheets. Data Validation: Used Python scripts to check for missing data, validate numerical columns, and ensure data consistency. Kaggle Integration: Cleaned datasets and a comprehensive README file were prepared for direct upload to Kaggle.

    Applications

    This dataset is ready for use in: Exploratory Data Analysis (EDA): Visualize player and champion performance trends across matches. Machine Learning: Develop models to predict match outcomes based on player and champion statistics. Sports Analytics: Gain insights into champion picks, win rates, and individual player strategies.

    Acknowledgments

    This dataset was made possible by the extensive statistics available on Gol.gg and the use of Python-based web scraping and data cleaning methodologies. It is shared under the CC BY 4.0 License to encourage reuse and collaboration.

  13. App Store Mobile Games 2008 - 2019

    • kaggle.com
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayank Singh (2024). App Store Mobile Games 2008 - 2019 [Dataset]. https://www.kaggle.com/datasets/mayanksinghr/app-store-mobile-games-2008-2019
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Kaggle
    Authors
    Mayank Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset contains 1 excel workbook (.xlsx) with 2 sheets.

    • Sheet 1 - App Store Games contains the mobile games launched on App Store from 2008 - 2019.
    • Sheet 2 - Data Dictionary is just the explanation of columns in data.

    This data can be used to practice EDA and some data cleaning tasks. Can be used for Data visualization using python Matplotlib and Seaborn libraries.

    I used this dataset for a Power BI project also and created a Dashboard on it. Used python inside power query to clean and convert some encoded and Unicode characters from App URL, Name, and Description columns.

    Total Columns: 16

    • App URL
    • App ID
    • Name
    • Subtitle
    • Icon URL
    • Average User Rating
    • User Rating Count
    • Price per App (USD)
    • Description
    • Developer
    • Age Rating
    • Languages
    • Size in Bytes
    • Primary Genre
    • Genres
    • Release Date
  14. Phone Price Predict 2020-2024

    • kaggle.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerowai (2024). Phone Price Predict 2020-2024 [Dataset]. https://www.kaggle.com/datasets/jerowai/phone-price-predict-2020-2024/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jerowai
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Overview This dataset provides a curated, example-based snapshot of selected Samsung smartphones released (or expected to be released) between 2020 and 2024. It includes various technical specifications such as camera details, processor type, RAM, internal storage, display size, GPU, battery capacity, operating system, and pricing. Note that these values are illustrative and may not reflect actual market data.

    What’s Inside?

    Phone Name & Release Year: Quickly reference the time frame and model. Camera Specs: Understand the rear camera configurations (e.g., “108+10+10+12 MP”) and compare imaging capabilities across models. Processor & GPU: Gain insights into the performance capabilities by checking the processor and graphics chip. Memory & Storage: Review RAM and internal storage options (e.g., “8 GB RAM” and “128 GB Internal Storage”). Display & Battery: Compare screen sizes (from 6.1 to over 7 inches) and battery capacities (e.g., 5000 mAh) to gauge device longevity and usability. Operating System: Note the Android version at release. Price (USD): Examine relative pricing trends over the years. How to Use This Dataset

    Exploratory Data Analysis (EDA): Use Python libraries like Pandas and Matplotlib to explore pricing trends over time, changes in camera configurations, or the evolution of battery capacities.

    Example: df.groupby('Release Year')['Price (USD)'].mean().plot(kind='bar') can show how average prices have fluctuated year to year. Feature Comparison & Filtering: Easily filter models based on specs. For instance, query phones with at least 8 GB RAM and a 5000 mAh battery to identify devices suitable for power users.

    Example: df[(df['RAM (GB)'] >= 8) & (df['Battery Capacity (mAh)'] >= 5000)] Machine Learning & Predictive Analysis: Although this dataset is example-based and not suitable for precise forecasting, you could still practice predictive modeling. For example, create a simple regression model to predict price based on features like RAM and display size.

    Example: Train a regression model (e.g., LinearRegression in scikit-learn) to see if increasing RAM or battery capacity correlates with higher prices. Comparing Release Trends: Investigate how flagship and mid-range specifications have evolved. See if there’s a noticeable shift towards larger displays, bigger batteries, or higher camera megapixels over the years.

    Recommended Tools & Libraries

    Python & Pandas: For data cleaning, manipulation, and initial analysis. Matplotlib & Seaborn: For creating visualizations to understand trends and distributions. scikit-learn: For modeling and basic predictive tasks, if you choose to use these example values as a training ground. Jupyter Notebooks or Kaggle Kernels: For interactive analysis and iterative exploration. Disclaimer This dataset is a synthetic, illustrative example and may not match real-world specifications, prices, or release timelines. It’s intended for learning, experimentation, and demonstration of various data analysis and machine learning techniques rather than as a factual source.

  15. Salary_time

    • kaggle.com
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gaurav9712 (2020). Salary_time [Dataset]. https://www.kaggle.com/datasets/gaurav9712/salary-time/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    gaurav9712
    Description

    1) Salary_hike -> Build a prediction model for Salary_hike

    Build a simple linear regression model by performing EDA and do necessary transformations and select the best model using R or Python.

  16. Delivery Time

    • kaggle.com
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gaurav9712 (2020). Delivery Time [Dataset]. https://www.kaggle.com/gaurav9712/delivery-time/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    gaurav9712
    Description

    This is for the Beginners, Who are just starting Machine Learning. 1) Delivery_time -> Predict delivery time using sorting time.

    Build a simple linear regression model by performing EDA and do necessary transformations and select the best model using R or Python.

  17. Sephora Skincare Reviews

    • kaggle.com
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Monfared (2024). Sephora Skincare Reviews [Dataset]. https://www.kaggle.com/datasets/melissamonfared/sephora-skincare-reviews
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Melissa Monfared
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note that the original dataset was uploaded by nadyinky on Kaggle and is accessible through the following link: https://www.kaggle.com/datasets/nadyinky/sephora-products-and-skincare-reviews

    In this dataset, the skincare products have been separated from other products in the reviews datasets, such as cosmetics and makeup, for use in the intended project.

    This dataset was collected via Python scraper in March 2023 BY https://www.kaggle.com/nadyinky and contains:

    information about all beauty products (over 8,000) from the Sephora online store, including product and brand names, prices, ingredients, ratings, and all features. user reviews (about 1 million on over 2,000 products) of all products from the Skincare category, including user appearances, and review ratings by other users Dataset Usage Examples: - Exploratory Data Analysis (EDA): Explore product categories, regular and discount prices, brand popularity, the impact of different characteristics on price, and ingredient trends - Sentiment Analysis: Is the emotional tone of the review positive, negative, or neutral? Which brands or products have the most positive or negative reviews? - Text Analysis: What do customers say most often in their negative and positive reviews? Do customers have any common problems with their skincare? Recommender System: Analyzing the customer's past purchase history and reviews, suggest products that are likely to be of interest to them - Data Visualization: What are the most popular brands and products? What is the distribution of prices? Which products are closest to each other in ingredients? What does the cloud of the most frequently used words look like?

  18. Forbes billionaires 2022

    • kaggle.com
    Updated May 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gsusAguirreArz (2022). Forbes billionaires 2022 [Dataset]. https://www.kaggle.com/datasets/jjdaguirre/forbes-billionaires-2022/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 2, 2022
    Dataset provided by
    Kaggle
    Authors
    gsusAguirreArz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset containing the list of 2500+ people with fortunes valued at least 1 Billion USD.

    Dset Features

    • Rank
    • Name
    • Net Worth
    • Age
    • Country
    • Source
    • Industry

    Source

    Scrapping python script here

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rashmi Tiwari (2024). EDA Using Python [Dataset]. https://www.kaggle.com/datasets/rosetiwari/eda-using-python/suggestions
Organization logo

EDA Using Python

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 6, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rashmi Tiwari
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by Rashmi Tiwari

Released under CC0: Public Domain

Contents

Search
Clear search
Close search
Google apps
Main menu