Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.
| Column Name | Data Type Category | Description |
|---|---|---|
| Household_ID | Categorical (Nominal) | Unique identifier for each household |
| Date | Datetime | The date of the energy usage record |
| Energy_Consumption_kWh | Numerical (Continuous) | Total energy consumed by the household in kWh |
| Household_Size | Numerical (Discrete) | Number of individuals living in the household |
| Avg_Temperature_C | Numerical (Continuous) | Average daily temperature in degrees Celsius |
| Has_AC | Categorical (Binary) | Indicates if the household has air conditioning (Yes/No) |
| Peak_Hours_Usage_kWh | Numerical (Continuous) | Energy consumed during peak hours in kWh |
| Library | Purpose |
|---|---|
pandas | Reading, cleaning, and transforming tabular data |
numpy | Numerical operations, working with arrays |
| Library | Purpose |
|---|---|
matplotlib | Creating static plots (line, bar, histograms, etc.) |
seaborn | Statistical visualizations, heatmaps, boxplots, etc. |
plotly | Interactive charts (time series, pie, bar, scatter, etc.) |
| Library | Purpose |
|---|---|
scikit-learn | Preprocessing, regression, classification, clustering |
xgboost / lightgbm | Gradient boosting models for better accuracy |
| Library | Purpose |
|---|---|
sklearn.preprocessing | Encoding categorical features, scaling, normalization |
datetime / pandas | Date-time conversion and manipulation |
| Library | Purpose |
|---|---|
sklearn.metrics | Accuracy, MAE, RMSE, R² score, confusion matrix, etc. |
✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.
This dataset is ideal for a wide variety of analytics and machine learning projects:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Climate forecasts, both experimental and operational, are often made by calibrating Global Climate Model (GCM) outputs with observed climate variables using statistical and machine learning models. Often, machine learning techniques are applied to gridded data independently at each gridpoint. However, the implementation of these gridpoint-wise operations is a significant barrier to entry to climate data science. Unfortunately, there is a significant disconnect between the Python data science ecosystem and the gridded earth data ecosystem. Traditional Python data science tools are not designed to be used with gridded datasets, like those commonly used in climate forecasting. Heavy data preprocessing is needed: gridded data must be aggregated, reshaped, or reduced in dimensionality in order to fit the strict formatting requirements of Python's data science tools. Efficiently implementing this gridpoint-wise workflow is a time-consuming logistical burden which presents a high barrier to entry to earth data science. A set of high-performance, easy-to-use Python climate forecasting tools is needed to bridge the gap between Python's data science ecosystem and its gridded earth data ecosystem. XCast, an Xarray-based climate forecasting Python library developed by the authors, bridges this gap. XCast wraps underlying two-dimensional data science methods, like those of Scikit-Learn, with data structures that allow them to be applied to each gridpoint independently. XCast uses high-performance computing libraries to efficiently parallelize the gridpoint-wise application of data science utilities and make Python's traditional data science toolkits compatible with multidimensional gridded data. XCast also implements a diverse set of climate forecasting tools including traditional statistical methods, state-of-the-art machine learning approaches, preprocessing functionality (regridding, rescaling, smoothing), and postprocessing modules (cross validation, forecast verification, visualization). These tools are useful for producing and analyzing both experimental and operational climate forecasts. In this study, we describe the development of XCast, and present in-depth technical details on how XCast brings highly parallelized gridpoint-wise versions of traditional Python data science tools into Python's gridded earth data ecosystem. We also demonstrate a case study where XCast was used to generate experimental real-time deterministic and probabilistic forecasts for South Asian Summer Monsoon Rainfall in 2022 using different machine learning-based multi-model ensembles.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:
1. Acquire Personality Dataset
The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.
2. Data preprocessing
After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.
3. Feature Extraction
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I get irritated easily.
EST3 I worry about things.
EST4 I change my mood a lot.
AGR1 I have a soft heart.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I am not really interested in others.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I follow a schedule.
CSN4 I make a mess of things.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I do not have a good imagination.
OPN4 I use difficult words.
4. Training the Model
Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package
5. Personality Prediction Output
After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created as part of a sentiment analysis project using enriched Twitter data. The objective was to train and test a machine learning model to automatically classify the sentiment of tweets (e.g., Positive, Negative, Neutral).
The data was generated using tweets that were sentiment-scored with a custom sentiment scorer. A machine learning pipeline was applied, including text preprocessing, feature extraction with CountVectorizer, and prediction with a HistGradientBoostingClassifier.
The dataset includes five main files:
test_predictions_full.csv – Predicted sentiment labels for the test set.
sentiment_model.joblib – Trained machine learning model.
count_vectorizer.joblib – Text feature extraction model (CountVectorizer).
model_performance.txt – Evaluation metrics and performance report of the trained model.
confusion_matrix.png – Visualization of the model’s confusion matrix.
The files follow standard naming conventions based on their purpose.
The .joblib files can be loaded into Python using the joblib and scikit-learn libraries.
The .csv,.txt, and .png files can be opened with any standard text reader, spreadsheet software, or image viewer.
Additional performance documentation is included within the model_performance.txt file.
The data was constructed to ensure reproducibility.
No personal or sensitive information is present.
It can be reused by researchers, data scientists, and students interested in Natural Language Processing (NLP), machine learning classification, and sentiment analysis tasks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This text provides a description of the dataset used for model training and evaluation in our study "A Tutorial on Deep Learning for Probabilistic Indoor Temperature Forecasting". The dataset consists of various simulated thermal and environmental parameters for different room configurations. Below, you will find a table detailing each column in the dataset along with its description and unit of measurement.
| Column Name | Description | Unit |
|---|---|---|
time | Time stamp of the measurement | - |
ZweiPersonenBuero.TAir | Air temperature inside a two-person office | °C |
heatStat.Heat.Q_flow | Heating rate in the room | W |
weaDat.AirPressure | Atmospheric pressure | Pa |
weaDat.AirTemp | Outside air temperature | °C |
weaDat.SkyRadiation | Longwave sky radiation | W/m² |
weaDat.TerrestrialRadiation | Terrestrial radiation | W/m² |
weaDat.WaterInAir | Absolute humidity | g/kg |
VAir | Air volume in the room | m³ |
AExt0 | Exterior wall area facing the south | m² |
AExt1 | Exterior wall area facing the north | m² |
AInt | Total interior wall area | m² |
AFloor | Floor area of the room | m² |
AWin0 | Window area facing the south | m² |
AWin1 | Window area facing the north | m² |
azi0 | Azimuth (direction) of the first exterior wall | rad |
azi1 | Azimuth (direction) of the second exterior wall | rad |
id | Unique identifier for the room configuration | - |
is_holiday | Indicator whether the day is a holiday (1 for yes, 0 for no) | - |
For rooms with multiple exterior walls (rooms 15-30):
Example:
This indicates two exterior walls with areas of 10 m² and 15 m² facing south (0 rad) and north (3.1415 rad), respectively. The south-facing wall has a window of 2 m², while the north-facing wall has no window.
This comprehensive dataset provides crucial parameters required to train and evaluate thermal models for different room configurations. The simulation data ensures a diverse range of environmental and occupancy conditions, enhancing the robustness of the models.
The data set contains the raw data as well as the scaled data used for training and testing the model. The scaling was carried out using the StandardScaler package.
This data set contains weather data recorded by the DWD under license „Datenlizenz Deutschland – Namensnennung – Version 2.0" (URL). The data is provided by "Bundesinstitut für Bau-, Stadt- und Raumforschung". The data can be downloaded from here. We use data from the year 2015 from Heilbronn. We have added the weather data to the data set unchanged.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.
Feature Engineering is a crucial step in Machine Learning.
In this project, I show:
- Handling missing values with SimpleImputer
- Encoding categorical variables with OneHotEncoder
- Building models manually vs using Pipeline
- Saving models and pipelines with pickle
- Making predictions with and without pipelines
pipe.pkl) pipe.pkl → Complete ML pipeline (recommended for predictions) clf.pkl → Classifier without pipeline ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features import pickle
pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
sample = [[22, 1, 0, 7.25, 'male', 'S']]
print(pipe.predict(sample))
Predict with pipeline
import pickle
clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
# Preprocess input manually using the encoders, then predict with clf
🎯 Inspiration
Learn difference between manual feature engineering and pipeline-based workflows
Understand how to avoid data leakage using Pipeline
Explore cross-validation with pipelines
Practice model persistence and deployment strategies
✅ Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step!
---
👉 This version is **Kaggle-friendly** (short, structured, with code examples).
Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
Facebook
TwitterThis project is about predicting if a flight will be delayed by over 15 minutes upon arrival, with Scikit-learn Decision Tree Classifier, using US flight data in 2022. Here is the URL of the dataset and variables description: https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGK&QO_fu146_anzr=b0-gvzr
Context The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. This dataset is collected from the Bureau of Transportation Statistics, Govt. of the USA. This data is open-sourced under U.S. Govt. Works. I dowload 12 csv file that represents each month of 2022. This dataset contains all US domestic flights in 2022.
Description of Columns • Quarter Quarter (1-4) • Month Month • DayofMonth Day of Month • DayOfWeek Day of Week • FlightDate Date of the Flight • Marketing_Airline_Network Airline Identifier • OriginCityName Origin Airport, City Name • DestCityName Destination Airport, City Name • DepDelay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers • ArrDelay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers • Cancelled Cancelled Flight (1=Yes) • Diverted Diverted Flight (1=Yes) • AirTime Flight Time, in Minutes • Distance Distance between airports (miles) • CarrierDelay Delay caused by the airline in minutes • WeatherDelay Delay caused by weather • NASDelay Delay caused by air system • SecurityDelay Delay caused by security reasons • LateAircraftDelay Delay caused as a result of another flight on the same aircraft delayed
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains dermatoscopic images of skin lesions organized into six classes:
Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.
Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing.
profile Qwen2.5-Max 1:26 am Here’s a polished and professional description you can use for your Kaggle dataset. It’s concise, informative, and includes all the necessary details about the dataset, its purpose, preprocessing steps, and usage guidelines.
Dataset Description Title: Cleaned Skin Cancer Dataset (6 Classes)
Description:
This dataset contains dermatoscopic images of skin lesions organized into six classes:
Melanoma Nevus (Mole) Basal Cell Carcinoma Actinic Keratosis Benign Keratosis Vascular Lesion The dataset has been preprocessed to remove duplicate images and ensure consistency between the training and test sets. It is structured into train and test folders, with subfolders for each class. This makes it ready for use in machine learning and deep learning projects.
Key Features: Total Images: 1888 (1820 train, 68 test) Classes: 6 Image Size: Variable (can be resized during preprocessing) Preprocessing: Duplicate images removed using perceptual hashing. Use Case: This dataset is ideal for training and evaluating models for skin cancer classification. It can be used with frameworks like TensorFlow, PyTorch, or scikit-learn. The cleaned structure ensures that the dataset is free from duplicates and ready for immediate use.
Acknowledgments: The original dataset was sourced from the International Skin Imaging Collaboration (ISIC) . Cleaning and preprocessing were performed to remove duplicates and prepare the dataset for modeling. Please refer to the ISIC website for more information about the original dataset: ISIC Archive .
License: This dataset is derived from the ISIC dataset and is made available under the CC BY-NC-SA license. Any use of this dataset must comply with the original licensing terms, including non-commercial use and attribution.
Facebook
Twitter🚀 Project Overview The project focuses on analyzing product return risks in e-commerce using machine learning and data visualization tools. It includes:
Data Generation: Simulates product-related data (e.g., categories, brands, prices, ratings) for analysis. Model Development: Builds a predictive model using XGBoost to classify whether a product has a high return risk. Dashboard Creation: Implements an interactive Streamlit dashboard for data exploration, model predictions, and insights. Document Q&A Assistant: Integrates a Retrieval-Augmented Generation (RAG) chatbot to answer user queries based on uploaded documents. 🛠️ Setup and Execution Instructions 1. Environment Setup Install required Python libraries:
📱 Streamlit Dashboard Features Tabs for: Product_Return_Table (Data Exploration) Dashboard (Visual Analytics) Model_Prediction (Return Risk Prediction) Q&A (Document-Based Assistant) Options to export datasets and predictions as CSV files
🧠 Model and Tool Explanation 1. Machine Learning Model XGBoost Classifier Hyperparameters include learning rate, gamma, and regularization Handles imbalanced data using SMOTE-Tomek Outputs binary classification for return risk 2. Data Preprocessing Tools StandardScaler: Normalizes numerical features LabelEncoder: Converts categorical variables to numeric 3. Visualization Tools matplotlib and seaborn: Bar charts, pie charts, histograms plotly.express: Interactive line plots for trend analysis 4. Document Q&A Assistant Uses Pinecone for vector indexing Uses SentenceTransformer for creating text embeddings Uses GPT-2 to generate answers to user queries 5. Interactive Dashboard Built using Streamlit Provides user-friendly access to data insights, model predictions, and document-based Q&A
✅ Conclusion Improve return forecasting by leveraging ML modeling Reduce return-related losses by targeting high-risk products Enhance support with smart Q&A assistant for document-based decision making Enable business teams to monitor return trends through visual dashboards
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic but realistic salary prediction dataset designed to simulate real-world employee compensation data. It is ideal for practicing data preprocessing, EDA, machine learning model building, and deployment (e.g., Flask or Streamlit apps).
The dataset captures a range of demographic, educational, and professional attributes that typically influence salary outcomes, along with intentional missing values and outliers to provide a challenging and practical experience for learners and researchers.
| Column | Description |
|---|---|
age | Employee’s age (20–60 years) |
gender | Gender of the employee (Male, Female, Other) |
education | Highest educational qualification |
experience_years | Total years of work experience |
role_seniority | Current job level (Junior, Mid, Senior, Lead) |
company_size | Size of the organization (Startup, SME, Enterprise) |
location_tier | Job location category (Tier-1, Tier-2, Tier-3, Remote) |
skills_count | Number of professional/technical skills |
certifications | Count of relevant certifications |
worked_remote | Whether the employee works remotely (0 = No, 1 = Yes) |
last_promotion_years_ago | Years since last promotion |
recent_project_description_length | Word count of recent project summary |
recent_note | Short note describing work experience or project type |
survey_date | Synthetic date when data was recorded |
salary_bdt | Target variable: Monthly salary in Bangladeshi Taka (BDT) |
scikit-learn, TensorFlow, or PyTorchName: Arif Miah Background: Final Year B.Sc. Student (Computer Science and Engineering) at Port City International University Focus Areas: Machine Learning, Deep Learning, NLP, Streamlit Apps, Data Science Projects Contact: arifmiahcse@gmail.com GitHub: github.com/your-github-username
This dataset is synthetic and generated for educational and research purposes only. It does not represent any real individuals or organizations.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
The Rotten Tomatoes Movie Review Sentiment Analysis Dataset contains a set of 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. Bo Pang and Lillian Lee first used this data in their paper Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, which was published in Proceedings of the ACL in 2005. All of the data fields are identical in every single one of the splits.The text column contains the review itself, and the label column indicates whether the review is positive or negative
The Performance of Sentiment Analysis In this post we take a look at the performance of different sentiment analysis systems on a movie review dataset from Rotten Tomatoes. This data was first used in Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales., Proceedings of the ACL, 2005. The data fields are the same among all splits
We will be using three different libraries for this post: 1) Scikit-learn, 2) NLTK, and 3) TextBlob. We will also compare the results of these systems with those from human raters. Each library takes different amounts of time and resources to run, so we will also be considering these factors in our comparisons.
NLTK
NLTK is a popular library for working with text data in Python. It includes many useful features for pre-processing text data, including tokenization, lemmatization, and part-of-speech tagging. NLTK also includes a number of helpful classes for building and evaluating predictive models (such as decision trees and maximum entropy classifiers).
TextBlob
TextBlob is a relatively new library that attempts to provide an easy-to-use interface for common text processing tasks (such as part-of-speech tagging, sentence parsing, spelling correction, etc). TextBlob is built on top of NLTK and Pattern, another Python library for web mining (see below).
Scikit-learn
Scikit-learn is a popular machine learning library for Python that provides efficient implementations of common algorithms such as support vector machines, random forests, and k-nearest neighbors classifiers. It also includes helpful utilities for pre-processing data and assessing model performance
- Identify positive and negative sentiment in movie reviews
- Categorize movie reviews by rating
- Cluster movie reviews to group together similar reviews
Huggingface Hub: link
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |
File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |
File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (String) | | label | The label of the review. (String) |
Facebook
TwitterThe Cropped Yale Face Dataset is a widely used benchmark in computer vision and machine learning for face recognition tasks. It consists of grayscale images of human faces captured under varying lighting conditions and expressions. The dataset is well-suited for research in facial recognition, image preprocessing, and machine learning model evaluation.
| Feature | Description |
|---|---|
| Number of subjects | 38 individuals |
| Number of images | 2,414 images |
| Image size | 192 × 168 pixels |
| Color | Grayscale (single channel) |
| Variations | Lighting conditions, facial expressions, and slight head rotations |
| Format | .pgm images (can be converted to .png or .jpg) |
| Common usage | Face recognition, PCA/LDA experiments, image classification |
CroppedYale/
├── yaleB01/
│ ├── yaleB01_P00A+000E+00.pgm
│ ├── yaleB01_P00A+000E+05.pgm
│ └── ...
├── yaleB02/
│ └── ...
└── ...
yaleB<subject_id>_P<pose>A<ambient>E<expression>.pgm.The dataset is perfect for evaluating facial recognition algorithms under controlled lighting and expression variations.
from sklearn.decomposition import PCA
from sklearn.svm import SVC
import numpy as np
# Load images and flatten
X = images.reshape(len(images), -1)
y = labels
# Reduce dimensions using PCA
pca = PCA(n_components=100)
X_pca = pca.fit_transform(X)
# Train classifier
clf = SVC(kernel='linear')
clf.fit(X_pca, y)
Due to its moderate image size, the dataset is ideal for testing dimensionality reduction methods like PCA, LDA, or t-SNE.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
X_embedded = TSNE(n_components=2).fit_transform(X_pca)
plt.scatter(X_embedded[:,0], X_embedded[:,1], c=y)
plt.show()
Researchers can use this dataset to study the effect of lighting conditions and facial expressions on recognition accuracy.
yaleB01_P00A+000E+00.pgm → Normal expressionyaleB01_P00A+000E+05.pgm → Smiling expressionyaleB01_P00A+010E+00.pgm → Slightly rotated face Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.
| Column Name | Data Type Category | Description |
|---|---|---|
| Household_ID | Categorical (Nominal) | Unique identifier for each household |
| Date | Datetime | The date of the energy usage record |
| Energy_Consumption_kWh | Numerical (Continuous) | Total energy consumed by the household in kWh |
| Household_Size | Numerical (Discrete) | Number of individuals living in the household |
| Avg_Temperature_C | Numerical (Continuous) | Average daily temperature in degrees Celsius |
| Has_AC | Categorical (Binary) | Indicates if the household has air conditioning (Yes/No) |
| Peak_Hours_Usage_kWh | Numerical (Continuous) | Energy consumed during peak hours in kWh |
| Library | Purpose |
|---|---|
pandas | Reading, cleaning, and transforming tabular data |
numpy | Numerical operations, working with arrays |
| Library | Purpose |
|---|---|
matplotlib | Creating static plots (line, bar, histograms, etc.) |
seaborn | Statistical visualizations, heatmaps, boxplots, etc. |
plotly | Interactive charts (time series, pie, bar, scatter, etc.) |
| Library | Purpose |
|---|---|
scikit-learn | Preprocessing, regression, classification, clustering |
xgboost / lightgbm | Gradient boosting models for better accuracy |
| Library | Purpose |
|---|---|
sklearn.preprocessing | Encoding categorical features, scaling, normalization |
datetime / pandas | Date-time conversion and manipulation |
| Library | Purpose |
|---|---|
sklearn.metrics | Accuracy, MAE, RMSE, R² score, confusion matrix, etc. |
✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.
This dataset is ideal for a wide variety of analytics and machine learning projects: