13 datasets found
  1. Household Energy Consumption

    • kaggle.com
    zip
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samharison (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/samxsam/household-energy-consumption
    Explore at:
    zip(748210 bytes)Available download formats
    Dataset updated
    Apr 5, 2025
    Authors
    Samharison
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🏡 Household Energy Consumption - April 2025 (90,000 Records)

    📌 Overview

    This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

    Column NameData Type CategoryDescription
    Household_IDCategorical (Nominal)Unique identifier for each household
    DateDatetimeThe date of the energy usage record
    Energy_Consumption_kWhNumerical (Continuous)Total energy consumed by the household in kWh
    Household_SizeNumerical (Discrete)Number of individuals living in the household
    Avg_Temperature_CNumerical (Continuous)Average daily temperature in degrees Celsius
    Has_ACCategorical (Binary)Indicates if the household has air conditioning (Yes/No)
    Peak_Hours_Usage_kWhNumerical (Continuous)Energy consumed during peak hours in kWh

    📂 Dataset Summary

    • Rows: 90,000
    • Time Range: April 1, 2025 – April 30, 2025
    • Data Granularity: Daily per household
    • Location: Simulated global coverage
    • Format: CSV (Comma-Separated Values)

    📚 Libraries Used for Working with household_energy_consumption_2025.csv

    🔍 1. Data Manipulation & Analysis

    LibraryPurpose
    pandasReading, cleaning, and transforming tabular data
    numpyNumerical operations, working with arrays

    📊 2. Data Visualization

    LibraryPurpose
    matplotlibCreating static plots (line, bar, histograms, etc.)
    seabornStatistical visualizations, heatmaps, boxplots, etc.
    plotlyInteractive charts (time series, pie, bar, scatter, etc.)

    📈 3. Machine Learning / Modeling

    LibraryPurpose
    scikit-learnPreprocessing, regression, classification, clustering
    xgboost / lightgbmGradient boosting models for better accuracy

    🧹 4. Data Preprocessing

    LibraryPurpose
    sklearn.preprocessingEncoding categorical features, scaling, normalization
    datetime / pandasDate-time conversion and manipulation

    🧪 5. Model Evaluation

    LibraryPurpose
    sklearn.metricsAccuracy, MAE, RMSE, R² score, confusion matrix, etc.

    ✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

    📈 Potential Use Cases

    This dataset is ideal for a wide variety of analytics and machine learning projects:

    🔮 Forecasting & Time Series Analysis

    • Predict future household energy consumption based on previous trends and weather conditions.
    • Identify seasonal and daily consumption patterns.

    💡 Energy Efficiency Analysis

    • Analyze differences in energy consumption between households with and without air conditioning.
    • Compare energy usage efficiency across varying household sizes.

    🌡️ Climate Impact Studies

    • Investigate how temperature affects electricity usage across households.
    • Model the potential impact of climate change on residential energy demand.

    🔌 Peak Load Management

    • Build models to predict and manage energy demand during peak hours.
    • Support research on smart grid technologies and dynamic pricing.

    🧠 Machine Learning Projects

    • Supervised learning (regression/classification) to predict energy consumption.
    • Clustering households by usage patterns for targeted energy programs.
    • Anomaly detection in energy usage for fault detection.

    🛠️ Example Starter Projects

    • Time-series forecasting using Facebook Prophet or ARIMA
    • Regression modeling using XGBoost or LightGBM
    • Classification of AC vs. non-AC household behavior
    • Energy-saving recommendation systems
    • Heatmaps of temperature vs. energy usage
  2. f

    Table_2_XCast: A python climate forecasting toolkit.docx

    • frontiersin.figshare.com
    docx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Joseph Chen Hall; Nachiketa Acharya (2023). Table_2_XCast: A python climate forecasting toolkit.docx [Dataset]. http://doi.org/10.3389/fclim.2022.953262.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Kyle Joseph Chen Hall; Nachiketa Acharya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Climate forecasts, both experimental and operational, are often made by calibrating Global Climate Model (GCM) outputs with observed climate variables using statistical and machine learning models. Often, machine learning techniques are applied to gridded data independently at each gridpoint. However, the implementation of these gridpoint-wise operations is a significant barrier to entry to climate data science. Unfortunately, there is a significant disconnect between the Python data science ecosystem and the gridded earth data ecosystem. Traditional Python data science tools are not designed to be used with gridded datasets, like those commonly used in climate forecasting. Heavy data preprocessing is needed: gridded data must be aggregated, reshaped, or reduced in dimensionality in order to fit the strict formatting requirements of Python's data science tools. Efficiently implementing this gridpoint-wise workflow is a time-consuming logistical burden which presents a high barrier to entry to earth data science. A set of high-performance, easy-to-use Python climate forecasting tools is needed to bridge the gap between Python's data science ecosystem and its gridded earth data ecosystem. XCast, an Xarray-based climate forecasting Python library developed by the authors, bridges this gap. XCast wraps underlying two-dimensional data science methods, like those of Scikit-Learn, with data structures that allow them to be applied to each gridpoint independently. XCast uses high-performance computing libraries to efficiently parallelize the gridpoint-wise application of data science utilities and make Python's traditional data science toolkits compatible with multidimensional gridded data. XCast also implements a diverse set of climate forecasting tools including traditional statistical methods, state-of-the-art machine learning approaches, preprocessing functionality (regridding, rescaling, smoothing), and postprocessing modules (cross validation, forecast verification, visualization). These tools are useful for producing and analyzing both experimental and operational climate forecasts. In this study, we describe the development of XCast, and present in-depth technical details on how XCast brings highly parallelized gridpoint-wise versions of traditional Python data science tools into Python's gridded earth data ecosystem. We also demonstrate a case study where XCast was used to generate experimental real-time deterministic and probabilistic forecasts for South Asian Summer Monsoon Rainfall in 2022 using different machine learning-based multi-model ensembles.

  3. Prediction of Personality Traits using the Big 5 Framework

    • zenodo.org
    csv, text/x-python
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
    Explore at:
    text/x-python, csvAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neelima Brahmbhatt; Neelima Brahmbhatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

    1. Acquire Personality Dataset

    The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

    2. Data preprocessing

    After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

    3. Feature Extraction

    The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

            EXT1 I am the life of the party.
            EXT2  I don't talk a lot.
            EXT3  I feel comfortable around people.
            EXT4  I am quiet around strangers.
            EST1  I get stressed out easily.
            EST2  I get irritated easily.
            EST3  I worry about things.
            EST4  I change my mood a lot.
            AGR1  I have a soft heart.
            AGR2  I am interested in people.
            AGR3  I insult people.
            AGR4  I am not really interested in others.
            CSN1  I am always prepared.
            CSN2  I leave my belongings around.
            CSN3  I follow a schedule.
            CSN4  I make a mess of things.
            OPN1  I have a rich vocabulary.
            OPN2  I have difficulty understanding abstract ideas.
            OPN3  I do not have a good imagination.
            OPN4  I use difficult words.

    4. Training the Model

    Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

    5. Personality Prediction Output

    After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.

  4. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  5. t

    Sentiment Prediction Outputs for Twitter Dataset

    • test.researchdata.tuwien.at
    bin, csv, png, txt
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hachem Bouhamidi; Hachem Bouhamidi; Hachem Bouhamidi; Hachem Bouhamidi (2025). Sentiment Prediction Outputs for Twitter Dataset [Dataset]. http://doi.org/10.70124/c8v83-0sy11
    Explore at:
    bin, csv, png, txtAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    TU Wien
    Authors
    Hachem Bouhamidi; Hachem Bouhamidi; Hachem Bouhamidi; Hachem Bouhamidi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Context and Methodology:

    This dataset was created as part of a sentiment analysis project using enriched Twitter data. The objective was to train and test a machine learning model to automatically classify the sentiment of tweets (e.g., Positive, Negative, Neutral).
    The data was generated using tweets that were sentiment-scored with a custom sentiment scorer. A machine learning pipeline was applied, including text preprocessing, feature extraction with CountVectorizer, and prediction with a HistGradientBoostingClassifier.

    Technical Details:

    The dataset includes five main files:

    • test_predictions_full.csv – Predicted sentiment labels for the test set.

    • sentiment_model.joblib – Trained machine learning model.

    • count_vectorizer.joblib – Text feature extraction model (CountVectorizer).

    • model_performance.txt – Evaluation metrics and performance report of the trained model.

    • confusion_matrix.png – Visualization of the model’s confusion matrix.

    The files follow standard naming conventions based on their purpose.
    The .joblib files can be loaded into Python using the joblib and scikit-learn libraries.
    The .csv,.txt, and .png files can be opened with any standard text reader, spreadsheet software, or image viewer.
    Additional performance documentation is included within the model_performance.txt file.

    Additional Details:

    • The data was constructed to ensure reproducibility.

    • No personal or sensitive information is present.

    • It can be reused by researchers, data scientists, and students interested in Natural Language Processing (NLP), machine learning classification, and sentiment analysis tasks.

  6. Data Set for Probabilistic Indoor Temperature Forecasting

    • zenodo.org
    bin
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert; Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert (2024). Data Set for Probabilistic Indoor Temperature Forecasting [Dataset]. http://doi.org/10.5281/zenodo.11911791
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert; Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. Dataset Manifest

    This text provides a description of the dataset used for model training and evaluation in our study "A Tutorial on Deep Learning for Probabilistic Indoor Temperature Forecasting". The dataset consists of various simulated thermal and environmental parameters for different room configurations. Below, you will find a table detailing each column in the dataset along with its description and unit of measurement.

    1.1. Columns Description

    Column NameDescriptionUnit
    timeTime stamp of the measurement-
    ZweiPersonenBuero.TAirAir temperature inside a two-person office°C
    heatStat.Heat.Q_flowHeating rate in the roomW
    weaDat.AirPressureAtmospheric pressurePa
    weaDat.AirTempOutside air temperature°C
    weaDat.SkyRadiationLongwave sky radiationW/m²
    weaDat.TerrestrialRadiationTerrestrial radiationW/m²
    weaDat.WaterInAirAbsolute humidityg/kg
    VAirAir volume in the room
    AExt0Exterior wall area facing the south
    AExt1Exterior wall area facing the north
    AIntTotal interior wall area
    AFloorFloor area of the room
    AWin0Window area facing the south
    AWin1Window area facing the north
    azi0Azimuth (direction) of the first exterior wallrad
    azi1Azimuth (direction) of the second exterior wallrad
    idUnique identifier for the room configuration-
    is_holidayIndicator whether the day is a holiday (1 for yes, 0 for no)-

    1.2. Note on Multi-Value Columns

    For rooms with multiple exterior walls (rooms 15-30):

    • AExt: {Exterior wall 1 area, Exterior wall 2 area}
    • AWin: {Window area on exterior wall 1, Window area on exterior wall 2}
    • azi: {Azimuth of exterior wall 1, Azimuth of exterior wall 2}

    Example:

    • AExt = {10, 15}
    • AWin = {2, 0}
    • azi = {0, 3.1415}

    This indicates two exterior walls with areas of 10 m² and 15 m² facing south (0 rad) and north (3.1415 rad), respectively. The south-facing wall has a window of 2 m², while the north-facing wall has no window.

    1.3. Data Sources

    • Room Model: Simulated using the reduced-order package of the Modelica Buildings Library.
    • Weather Data: Provided by the German Meteorological Service (DWD) in Test Reference Year (TRY) format.

    This comprehensive dataset provides crucial parameters required to train and evaluate thermal models for different room configurations. The simulation data ensures a diverse range of environmental and occupancy conditions, enhancing the robustness of the models.

    1.4. Data scaling

    The data set contains the raw data as well as the scaled data used for training and testing the model. The scaling was carried out using the StandardScaler package.

    1.5. Weather data license

    This data set contains weather data recorded by the DWD under license „Datenlizenz Deutschland – Namensnennung – Version 2.0" (URL). The data is provided by "Bundesinstitut für Bau-, Stadt- und Raumforschung". The data can be downloaded from here. We use data from the year 2015 from Heilbronn. We have added the weather data to the data set unchanged.

  7. working with pipeline

    • kaggle.com
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fiza Aslam1 (2025). working with pipeline [Dataset]. https://www.kaggle.com/datasets/fizaaslam12/working-with-pipeline
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fiza Aslam1
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🚀 Feature Engineering with Scikit-Learn (Titanic Case Study)

    This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
    It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.

    📌 About

    Feature Engineering is a crucial step in Machine Learning.
    In this project, I show: - Handling missing values with SimpleImputer - Encoding categorical variables with OneHotEncoder - Building models manually vs using Pipeline - Saving models and pipelines with pickle - Making predictions with and without pipelines

    📂 Content

    • train.csv → Titanic dataset
    • withpipeline.ipynb → End-to-end pipeline workflow
    • withoutpipeline.ipynb → Manual preprocessing workflow
    • predictusingpipeline.ipynb → Predictions with saved pipeline (pipe.pkl)
    • predictwithoutpipeline.ipynb → Predictions with classifier + encoders
    • models/
      • pipe.pkl → Complete ML pipeline (recommended for predictions)
      • clf.pkl → Classifier without pipeline
      • ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features

    ⚡ Usage

    1️⃣ Load and Use Pipeline

    import pickle
    
    pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
    sample = [[22, 1, 0, 7.25, 'male', 'S']]
    print(pipe.predict(sample))
    Predict with pipeline
    import pickle
    
    clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
    ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
    ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
    
    # Preprocess input manually using the encoders, then predict with clf
    🎯 Inspiration
    
    Learn difference between manual feature engineering and pipeline-based workflows
    
    Understand how to avoid data leakage using Pipeline
    
    Explore cross-validation with pipelines
    
    Practice model persistence and deployment strategies
    
    ✅ Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step!
    
    
    ---
    
    👉 This version is **Kaggle-friendly** (short, structured, with code examples). 
    Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
    
  8. 3.Preprocessing

    • kaggle.com
    zip
    Updated Aug 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    omerkrbck (2023). 3.Preprocessing [Dataset]. https://www.kaggle.com/datasets/omerkrbck/3preprocessing
    Explore at:
    zip(104440219 bytes)Available download formats
    Dataset updated
    Aug 24, 2023
    Authors
    omerkrbck
    Description

    This project is about predicting if a flight will be delayed by over 15 minutes upon arrival, with Scikit-learn Decision Tree Classifier, using US flight data in 2022. Here is the URL of the dataset and variables description: https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGK&QO_fu146_anzr=b0-gvzr

    Context The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. This dataset is collected from the Bureau of Transportation Statistics, Govt. of the USA. This data is open-sourced under U.S. Govt. Works. I dowload 12 csv file that represents each month of 2022. This dataset contains all US domestic flights in 2022.

    Description of Columns • Quarter Quarter (1-4) • Month Month • DayofMonth Day of Month • DayOfWeek Day of Week • FlightDate Date of the Flight • Marketing_Airline_Network Airline Identifier • OriginCityName Origin Airport, City Name • DestCityName Destination Airport, City Name • DepDelay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers • ArrDelay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers • Cancelled Cancelled Flight (1=Yes) • Diverted Diverted Flight (1=Yes) • AirTime Flight Time, in Minutes • Distance Distance between airports (miles) • CarrierDelay Delay caused by the airline in minutes • WeatherDelay Delay caused by weather • NASDelay Delay caused by air system • SecurityDelay Delay caused by security reasons • LateAircraftDelay Delay caused as a result of another flight on the same aircraft delayed

  9. Cleaned ISIC Skin Cancer Dataset (6 Classes)

    • kaggle.com
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aayyyyyyuuussshhh (2025). Cleaned ISIC Skin Cancer Dataset (6 Classes) [Dataset]. https://www.kaggle.com/datasets/aayyyyyyuuussshhh/cleaned-isic-skin-cancer-dataset-6-classes
    Explore at:
    zip(538458444 bytes)Available download formats
    Dataset updated
    Feb 10, 2025
    Authors
    Aayyyyyyuuussshhh
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains dermatoscopic images of skin lesions organized into six classes:

    Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.

    Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing.

    profile Qwen2.5-Max 1:26 am Here’s a polished and professional description you can use for your Kaggle dataset. It’s concise, informative, and includes all the necessary details about the dataset, its purpose, preprocessing steps, and usage guidelines.

    Dataset Description Title: Cleaned Skin Cancer Dataset (6 Classes)

    Description:

    This dataset contains dermatoscopic images of skin lesions organized into six classes:

    Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.

    Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing. Use Case: This dataset is ideal for training and evaluating models for skin cancer classification. It can be used with frameworks like TensorFlow, PyTorch, or scikit-learn. The cleaned structure ensures that the dataset is free from duplicates and ready for immediate use.

    Acknowledgments: The original dataset was sourced from the International Skin Imaging Collaboration (ISIC) . Cleaning and preprocessing were performed to remove duplicates and prepare the dataset for modeling. Please refer to the ISIC website for more information about the original dataset: ISIC Archive .

    License: This dataset is derived from the ISIC dataset and is made available under the CC BY-NC-SA license. Any use of this dataset must comply with the original licensing terms, including non-commercial use and attribution.

  10. Product_Return_Risk_Prediction

    • kaggle.com
    zip
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soumendu Ray (2025). Product_Return_Risk_Prediction [Dataset]. https://www.kaggle.com/datasets/soumenduray99/product-return-risk-prediction
    Explore at:
    zip(139556 bytes)Available download formats
    Dataset updated
    May 6, 2025
    Authors
    Soumendu Ray
    Description

    🚀 Project Overview The project focuses on analyzing product return risks in e-commerce using machine learning and data visualization tools. It includes:

    Data Generation: Simulates product-related data (e.g., categories, brands, prices, ratings) for analysis. Model Development: Builds a predictive model using XGBoost to classify whether a product has a high return risk. Dashboard Creation: Implements an interactive Streamlit dashboard for data exploration, model predictions, and insights. Document Q&A Assistant: Integrates a Retrieval-Augmented Generation (RAG) chatbot to answer user queries based on uploaded documents. 🛠️ Setup and Execution Instructions 1. Environment Setup Install required Python libraries:

    • pandas
    • numpy
    • matplotlib
    • seaborn
    • scikit-learn
    • xgboost
    • streamlit
    • imblearn
    • plotly
    • fitz (PyMuPDF)
    • pinecone-client
    • sentence-transformers
    • transformers
    • Configure Pinecone API key for document indexing.
    1. Execution Steps Run the script to generate synthetic product data using the data_gen function. Train the XGBoost model using the model_train function, including preprocessing (outlier treatment, label encoding, scaling, and SMOTE-Tomek). Save the trained model using pickle for future use. Launch the Streamlit dashboard to interact with the data: Explore product return risk data in tabular format or via visualizations. Predict return risk for individual products or bulk datasets. Use the Q&A assistant to query uploaded documents.

    📱 Streamlit Dashboard Features Tabs for: Product_Return_Table (Data Exploration) Dashboard (Visual Analytics) Model_Prediction (Return Risk Prediction) Q&A (Document-Based Assistant) Options to export datasets and predictions as CSV files

    🧠 Model and Tool Explanation 1. Machine Learning Model XGBoost Classifier Hyperparameters include learning rate, gamma, and regularization Handles imbalanced data using SMOTE-Tomek Outputs binary classification for return risk 2. Data Preprocessing Tools StandardScaler: Normalizes numerical features LabelEncoder: Converts categorical variables to numeric 3. Visualization Tools matplotlib and seaborn: Bar charts, pie charts, histograms plotly.express: Interactive line plots for trend analysis 4. Document Q&A Assistant Uses Pinecone for vector indexing Uses SentenceTransformer for creating text embeddings Uses GPT-2 to generate answers to user queries 5. Interactive Dashboard Built using Streamlit Provides user-friendly access to data insights, model predictions, and document-based Q&A

    ✅ Conclusion Improve return forecasting by leveraging ML modeling Reduce return-related losses by targeting high-risk products Enhance support with smart Q&A assistant for document-based decision making Enable business teams to monitor return trends through visual dashboards

  11. synthetic but realistic salary prediction dataset

    • kaggle.com
    zip
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). synthetic but realistic salary prediction dataset [Dataset]. https://www.kaggle.com/datasets/miadul/synthetic-but-realistic-salary-prediction-dataset
    Explore at:
    zip(38665 bytes)Available download formats
    Dataset updated
    Oct 29, 2025
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📊 Synthetic Salary Prediction Dataset (with Missing Values & Outliers)

    🧠 Overview

    This dataset is a synthetic but realistic salary prediction dataset designed to simulate real-world employee compensation data. It is ideal for practicing data preprocessing, EDA, machine learning model building, and deployment (e.g., Flask or Streamlit apps).

    The dataset captures a range of demographic, educational, and professional attributes that typically influence salary outcomes, along with intentional missing values and outliers to provide a challenging and practical experience for learners and researchers.

    🧩 Key Features

    ColumnDescription
    ageEmployee’s age (20–60 years)
    genderGender of the employee (Male, Female, Other)
    educationHighest educational qualification
    experience_yearsTotal years of work experience
    role_seniorityCurrent job level (Junior, Mid, Senior, Lead)
    company_sizeSize of the organization (Startup, SME, Enterprise)
    location_tierJob location category (Tier-1, Tier-2, Tier-3, Remote)
    skills_countNumber of professional/technical skills
    certificationsCount of relevant certifications
    worked_remoteWhether the employee works remotely (0 = No, 1 = Yes)
    last_promotion_years_agoYears since last promotion
    recent_project_description_lengthWord count of recent project summary
    recent_noteShort note describing work experience or project type
    survey_dateSynthetic date when data was recorded
    salary_bdtTarget variable: Monthly salary in Bangladeshi Taka (BDT)

    🧮 Dataset Summary

    • Total Rows: 2000
    • Total Columns: 15
    • Missing Values: Yes (intentionally introduced)
    • Outliers: Yes (~1% high-salary records to mimic real-world noise)
    • Use Case: Regression (Salary Prediction), EDA, Feature Engineering, Data Cleaning Practice

    💡 Possible Use Cases

    • Predict employee salary based on experience and education
    • Handle missing values and perform imputation
    • Detect and treat outliers
    • Explore correlation between experience and salary
    • Build ML models using scikit-learn, TensorFlow, or PyTorch
    • Deploy salary prediction apps with Streamlit or Flask

    🧰 Tech Stack for Analysis (Recommended)

    • Python, Pandas, NumPy, Matplotlib, Seaborn, Plotly
    • Scikit-learn, TensorFlow, PyTorch
    • Streamlit / Flask for app deployment

    🧑‍💻 Author

    Name: Arif Miah Background: Final Year B.Sc. Student (Computer Science and Engineering) at Port City International University Focus Areas: Machine Learning, Deep Learning, NLP, Streamlit Apps, Data Science Projects Contact: arifmiahcse@gmail.com GitHub: github.com/your-github-username

    ⚠️ Disclaimer

    This dataset is synthetic and generated for educational and research purposes only. It does not represent any real individuals or organizations.

  12. Rotten Tomatoes Movie Reviews

    • kaggle.com
    Updated Nov 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Rotten Tomatoes Movie Reviews [Dataset]. https://www.kaggle.com/datasets/thedevastator/movie-review-data-set-from-rotten-tomatoes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Rotten Tomatoes Movie Reviews

    Predicting Movie Review Sentiment

    Source

    Huggingface Hub: link

    About this dataset

    The Rotten Tomatoes Movie Review Sentiment Analysis Dataset contains a set of 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. Bo Pang and Lillian Lee first used this data in their paper Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, which was published in Proceedings of the ACL in 2005. All of the data fields are identical in every single one of the splits.The text column contains the review itself, and the label column indicates whether the review is positive or negative

    How to use the dataset

    The Performance of Sentiment Analysis In this post we take a look at the performance of different sentiment analysis systems on a movie review dataset from Rotten Tomatoes. This data was first used in Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales., Proceedings of the ACL, 2005. The data fields are the same among all splits

    We will be using three different libraries for this post: 1) Scikit-learn, 2) NLTK, and 3) TextBlob. We will also compare the results of these systems with those from human raters. Each library takes different amounts of time and resources to run, so we will also be considering these factors in our comparisons.

    NLTK

    NLTK is a popular library for working with text data in Python. It includes many useful features for pre-processing text data, including tokenization, lemmatization, and part-of-speech tagging. NLTK also includes a number of helpful classes for building and evaluating predictive models (such as decision trees and maximum entropy classifiers).

    TextBlob

    TextBlob is a relatively new library that attempts to provide an easy-to-use interface for common text processing tasks (such as part-of-speech tagging, sentence parsing, spelling correction, etc). TextBlob is built on top of NLTK and Pattern, another Python library for web mining (see below).

    Scikit-learn

    Scikit-learn is a popular machine learning library for Python that provides efficient implementations of common algorithms such as support vector machines, random forests, and k-nearest neighbors classifiers. It also includes helpful utilities for pre-processing data and assessing model performance

    Research Ideas

    • Identify positive and negative sentiment in movie reviews
    • Categorize movie reviews by rating
    • Cluster movie reviews to group together similar reviews

    Acknowledgements

    Huggingface Hub: link

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |

    File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |

    File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |

  13. CroppedYaleFaces

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Rehan (2025). CroppedYaleFaces [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/croppedyalefaces
    Explore at:
    zip(58366379 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    Omar Rehan
    Description

    Cropped Yale Face Dataset (Grayscale Images)

    The Cropped Yale Face Dataset is a widely used benchmark in computer vision and machine learning for face recognition tasks. It consists of grayscale images of human faces captured under varying lighting conditions and expressions. The dataset is well-suited for research in facial recognition, image preprocessing, and machine learning model evaluation.

    Dataset Overview

    FeatureDescription
    Number of subjects38 individuals
    Number of images2,414 images
    Image size192 × 168 pixels
    ColorGrayscale (single channel)
    VariationsLighting conditions, facial expressions, and slight head rotations
    Format.pgm images (can be converted to .png or .jpg)
    Common usageFace recognition, PCA/LDA experiments, image classification

    Example of Dataset Structure

    CroppedYale/
    ├── yaleB01/
    │  ├── yaleB01_P00A+000E+00.pgm
    │  ├── yaleB01_P00A+000E+05.pgm
    │  └── ...
    ├── yaleB02/
    │  └── ...
    └── ...
    
    • Each folder corresponds to a single subject.
    • File naming convention: yaleB<subject_id>_P<pose>A<ambient>E<expression>.pgm.

    Example Use Cases

    1. Face Recognition

    The dataset is perfect for evaluating facial recognition algorithms under controlled lighting and expression variations.

    from sklearn.decomposition import PCA
    from sklearn.svm import SVC
    import numpy as np
    
    # Load images and flatten
    X = images.reshape(len(images), -1)
    y = labels
    
    # Reduce dimensions using PCA
    pca = PCA(n_components=100)
    X_pca = pca.fit_transform(X)
    
    # Train classifier
    clf = SVC(kernel='linear')
    clf.fit(X_pca, y)
    

    2. Dimensionality Reduction

    Due to its moderate image size, the dataset is ideal for testing dimensionality reduction methods like PCA, LDA, or t-SNE.

    from sklearn.manifold import TSNE
    import matplotlib.pyplot as plt
    
    X_embedded = TSNE(n_components=2).fit_transform(X_pca)
    plt.scatter(X_embedded[:,0], X_embedded[:,1], c=y)
    plt.show()
    

    3. Lighting & Expression Robustness

    Researchers can use this dataset to study the effect of lighting conditions and facial expressions on recognition accuracy.

    • yaleB01_P00A+000E+00.pgm → Normal expression
    • yaleB01_P00A+000E+05.pgm → Smiling expression
    • yaleB01_P00A+010E+00.pgm → Slightly rotated face

    Key Advantages

    • Controlled environment: Minimal background noise, making it easier to focus on the face features.
    • Diverse lighting conditions: Excellent for testing illumination-invariant algorithms.
    • Compact size: Easy to load and experiment with on most machines without high computational cost.
    • Grayscale: Simplifies preprocessing while still retaining critical facial features.
  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Samharison (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/samxsam/household-energy-consumption
Organization logo

Household Energy Consumption

Track daily household energy usage, temperature, and peak hour consumption

Explore at:
zip(748210 bytes)Available download formats
Dataset updated
Apr 5, 2025
Authors
Samharison
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🏡 Household Energy Consumption - April 2025 (90,000 Records)

📌 Overview

This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

Column NameData Type CategoryDescription
Household_IDCategorical (Nominal)Unique identifier for each household
DateDatetimeThe date of the energy usage record
Energy_Consumption_kWhNumerical (Continuous)Total energy consumed by the household in kWh
Household_SizeNumerical (Discrete)Number of individuals living in the household
Avg_Temperature_CNumerical (Continuous)Average daily temperature in degrees Celsius
Has_ACCategorical (Binary)Indicates if the household has air conditioning (Yes/No)
Peak_Hours_Usage_kWhNumerical (Continuous)Energy consumed during peak hours in kWh

📂 Dataset Summary

  • Rows: 90,000
  • Time Range: April 1, 2025 – April 30, 2025
  • Data Granularity: Daily per household
  • Location: Simulated global coverage
  • Format: CSV (Comma-Separated Values)

📚 Libraries Used for Working with household_energy_consumption_2025.csv

🔍 1. Data Manipulation & Analysis

LibraryPurpose
pandasReading, cleaning, and transforming tabular data
numpyNumerical operations, working with arrays

📊 2. Data Visualization

LibraryPurpose
matplotlibCreating static plots (line, bar, histograms, etc.)
seabornStatistical visualizations, heatmaps, boxplots, etc.
plotlyInteractive charts (time series, pie, bar, scatter, etc.)

📈 3. Machine Learning / Modeling

LibraryPurpose
scikit-learnPreprocessing, regression, classification, clustering
xgboost / lightgbmGradient boosting models for better accuracy

🧹 4. Data Preprocessing

LibraryPurpose
sklearn.preprocessingEncoding categorical features, scaling, normalization
datetime / pandasDate-time conversion and manipulation

🧪 5. Model Evaluation

LibraryPurpose
sklearn.metricsAccuracy, MAE, RMSE, R² score, confusion matrix, etc.

✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

📈 Potential Use Cases

This dataset is ideal for a wide variety of analytics and machine learning projects:

🔮 Forecasting & Time Series Analysis

  • Predict future household energy consumption based on previous trends and weather conditions.
  • Identify seasonal and daily consumption patterns.

💡 Energy Efficiency Analysis

  • Analyze differences in energy consumption between households with and without air conditioning.
  • Compare energy usage efficiency across varying household sizes.

🌡️ Climate Impact Studies

  • Investigate how temperature affects electricity usage across households.
  • Model the potential impact of climate change on residential energy demand.

🔌 Peak Load Management

  • Build models to predict and manage energy demand during peak hours.
  • Support research on smart grid technologies and dynamic pricing.

🧠 Machine Learning Projects

  • Supervised learning (regression/classification) to predict energy consumption.
  • Clustering households by usage patterns for targeted energy programs.
  • Anomaly detection in energy usage for fault detection.

🛠️ Example Starter Projects

  • Time-series forecasting using Facebook Prophet or ARIMA
  • Regression modeling using XGBoost or LightGBM
  • Classification of AC vs. non-AC household behavior
  • Energy-saving recommendation systems
  • Heatmaps of temperature vs. energy usage
Search
Clear search
Close search
Google apps
Main menu