Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains cleaned 2,245 ageing test traces (time vs. MPPT PCE/ maximum power point tracking power conversion efficiency) for perovskite solar cells with various device stacks and architectures in the pickle (.pkl) format.
The dataset can be loaded with the following commands on Python.
import pickle5 as pickle
import pandas as pd
import numpy as np
with open('20230303_mySeriesDrop.pkl', "rb") as fh:
mySeriesDrop = pickle.load(fh)
The following command can be used to call a specific row (row 0) within the dataset.
mySeriesDrop[0]
The next steps to use the dataset is using scaling/ normalisation (for instance using sklearn.preprocessing.MaxAbsScaler) and smoothing (for instance using Savitzky-Golay filter).
The code to run the complete analysis, including self-organising map clustering, can be accessed here: https://doi.org/10.5281/zenodo.8181602.
Facebook
Twitterimport pandas as pd import numpy as np
PERFORMING EDA
data.head() data.info()
attributes_data = data.iloc[:, 1:] attributes_data
attributes_data.describe() attributes_data.corr()
import seaborn as sns import matplotlib.pyplot as plt
correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()
CHECKING IF DATASET IS LINEAR OR NON-LINEAR
correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')
plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()
CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM
print(data.isnull().sum())
print(data.isna().sum())
LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold
X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
parameters = {"alpha": np.arange(0.00001, 10, 500)}
kfold = KFold(n_splits = 10, shuffle=True, random_state = 42)
lassoReg = Lasso()
lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold)
lasso_cv.fit(X, y)
print("Best Params {}".format(lasso_cv.best_params_))
column_names = list(data) column_names = column_names[1:] column_names
lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()
RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)
num_features_selected = len(rfecv.rankin_)
cv_scores = rfecv.ranking_
plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()
print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])
PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler
X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]
df1=pd.DataFrame(data = data,columns=data.columns) print(df1)
scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)
principal.components_
plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')
print(principal.explained_variance_ratio_)
T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns
tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)
df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This text provides a description of the dataset used for model training and evaluation in our study "A Tutorial on Deep Learning for Probabilistic Indoor Temperature Forecasting". The dataset consists of various simulated thermal and environmental parameters for different room configurations. Below, you will find a table detailing each column in the dataset along with its description and unit of measurement.
| Column Name | Description | Unit |
|---|---|---|
time | Time stamp of the measurement | - |
ZweiPersonenBuero.TAir | Air temperature inside a two-person office | °C |
heatStat.Heat.Q_flow | Heating rate in the room | W |
weaDat.AirPressure | Atmospheric pressure | Pa |
weaDat.AirTemp | Outside air temperature | °C |
weaDat.SkyRadiation | Longwave sky radiation | W/m² |
weaDat.TerrestrialRadiation | Terrestrial radiation | W/m² |
weaDat.WaterInAir | Absolute humidity | g/kg |
VAir | Air volume in the room | m³ |
AExt0 | Exterior wall area facing the south | m² |
AExt1 | Exterior wall area facing the north | m² |
AInt | Total interior wall area | m² |
AFloor | Floor area of the room | m² |
AWin0 | Window area facing the south | m² |
AWin1 | Window area facing the north | m² |
azi0 | Azimuth (direction) of the first exterior wall | rad |
azi1 | Azimuth (direction) of the second exterior wall | rad |
id | Unique identifier for the room configuration | - |
is_holiday | Indicator whether the day is a holiday (1 for yes, 0 for no) | - |
For rooms with multiple exterior walls (rooms 15-30):
Example:
This indicates two exterior walls with areas of 10 m² and 15 m² facing south (0 rad) and north (3.1415 rad), respectively. The south-facing wall has a window of 2 m², while the north-facing wall has no window.
This comprehensive dataset provides crucial parameters required to train and evaluate thermal models for different room configurations. The simulation data ensures a diverse range of environmental and occupancy conditions, enhancing the robustness of the models.
The data set contains the raw data as well as the scaled data used for training and testing the model. The scaling was carried out using the StandardScaler package.
This data set contains weather data recorded by the DWD under license „Datenlizenz Deutschland – Namensnennung – Version 2.0" (URL). The data is provided by "Bundesinstitut für Bau-, Stadt- und Raumforschung". The data can be downloaded from here. We use data from the year 2015 from Heilbronn. We have added the weather data to the data set unchanged.
Facebook
TwitterTo use Facial expresssion dataset 1.Just clone it. 2.create a python v3.9 envoirnment . 3.install tensorflow,cv2,sklearn and keras in the system using pip 4.Run in same kernal enviornment (its takes time). 5.process start collecting and preprocessing datasets of facial expressions captured in different contexts.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.
| Column Name | Data Type Category | Description |
|---|---|---|
| Household_ID | Categorical (Nominal) | Unique identifier for each household |
| Date | Datetime | The date of the energy usage record |
| Energy_Consumption_kWh | Numerical (Continuous) | Total energy consumed by the household in kWh |
| Household_Size | Numerical (Discrete) | Number of individuals living in the household |
| Avg_Temperature_C | Numerical (Continuous) | Average daily temperature in degrees Celsius |
| Has_AC | Categorical (Binary) | Indicates if the household has air conditioning (Yes/No) |
| Peak_Hours_Usage_kWh | Numerical (Continuous) | Energy consumed during peak hours in kWh |
| Library | Purpose |
|---|---|
pandas | Reading, cleaning, and transforming tabular data |
numpy | Numerical operations, working with arrays |
| Library | Purpose |
|---|---|
matplotlib | Creating static plots (line, bar, histograms, etc.) |
seaborn | Statistical visualizations, heatmaps, boxplots, etc. |
plotly | Interactive charts (time series, pie, bar, scatter, etc.) |
| Library | Purpose |
|---|---|
scikit-learn | Preprocessing, regression, classification, clustering |
xgboost / lightgbm | Gradient boosting models for better accuracy |
| Library | Purpose |
|---|---|
sklearn.preprocessing | Encoding categorical features, scaling, normalization |
datetime / pandas | Date-time conversion and manipulation |
| Library | Purpose |
|---|---|
sklearn.metrics | Accuracy, MAE, RMSE, R² score, confusion matrix, etc. |
✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.
This dataset is ideal for a wide variety of analytics and machine learning projects:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:
1. Acquire Personality Dataset
The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.
2. Data preprocessing
After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.
3. Feature Extraction
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I get irritated easily.
EST3 I worry about things.
EST4 I change my mood a lot.
AGR1 I have a soft heart.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I am not really interested in others.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I follow a schedule.
CSN4 I make a mess of things.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I do not have a good imagination.
OPN4 I use difficult words.
4. Training the Model
Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package
5. Personality Prediction Output
After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Facebook
TwitterDataset Creation Code:
import pandas as pd
import numpy as np
import gc
from sklearn.preprocessing import LabelEncoder
import pickle
pickle.HIGHEST_PROTOCOL = 4
BASE_PATH = "./datasets/amex-default-prediction"
# Index file
train_data_index = pd.read_csv(f"{BASE_PATH}/train_labels.csv")
test_data_index = pd.read_csv(f"{BASE_PATH}/sample_submission.csv")
print(train_data_index.shape, test_data_index.shape)
all_ids = np.concatenate([train_data_index["customer_ID"], test_data_index["customer_ID"]])
print(len(all_ids))
# Train an id encoder and save it.
id_encoder = LabelEncoder()
id_encoder.fit(all_ids)
np.save("id_encodings.npy", id_encoder.classes_)
# Make sure we can load it back
loaded_encoder = LabelEncoder()
loaded_encoder.classes_ = np.load("id_encodings.npy", allow_pickle=True)
assert (id_encoder.classes_ == loaded_encoder.classes_).all()
# Make sure we can reverse it (1-index)
print(loaded_encoder.inverse_transform([1, 2]))
print(train_data_index["customer_ID"].values[0: 2])
del loaded_encoder
train_data_index["customer_ID"] = id_encoder.transform(train_data_index["customer_ID"])
test_data_index["customer_ID"] = id_encoder.transform(test_data_index["customer_ID"])
# Encode the index files
train_data_index.to_pickle("id_encoded_train_labels.pkl", protocol=4)
test_data_index.to_pickle("id_encoded_sample_submission.pkl", protocol=4)
del train_data_index
del test_data_index
gc.collect()
# Test files are too large for a Kaggle Notebook
main_train = pd.read_csv(
f"{BASE_PATH}/train_data.csv"
)
main_test = pd.read_csv(
f"{BASE_PATH}/test_data.csv"
)
main_files = [
main_train,
main_test,
]
for main_file in main_files:
print(main_file.shape)
main_file["customer_ID"] = id_encoder.transform(main_file["customer_ID"])
main_train.to_pickle("id_encoded_train_data.pkl", protocol=4)
main_test.to_pickle("id_encoded_test_data.pkl", protocol=4)
def reduce_mem_usage(df, use_fp16=False):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
if use_fp16:
df[col] = df[col].astype(np.float16)
else:
df[col] = df[col].astype(np.float32)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
main_train = pd.read_pickle("id_encoded_train_data.pkl")
main_test = pd.read_pickle("id_encoded_test_data.pkl")
reduce_mem_usage(main_train)
reduce_mem_usage(main_test)
main_train.to_pickle("id_encoded_fp32_train_data.pkl", protocol=4)
main_test.to_pickle("id_encoded_fp32_test_data.pkl", protocol=4)
main_train = pd.read_pickle("id_encoded_train_data.pkl")
main_test = pd.read_pickle("id_encoded_test_data.pkl")
reduce_mem_usage(main_train, use_fp16=True)
reduce_mem_usage(main_test, use_fp16=True)
main_train.to_pickle("id_encoded_fp16_train_data.pkl", protocol=4)
main_test.to_pickle("id_encoded_fp16_test_data.pkl", protocol=4)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Ukrainian translation of the GoEmotions dataset for emotion classification in text.
This dataset is a high-quality Ukrainian translation of Google's GoEmotions dataset, which contains Reddit comments labeled with 28 emotion categories.
The dataset includes 28 emotion categories:
| Category | Ukrainian | Category | Ukrainian |
|---|---|---|---|
| admiration | захоплення | amusement | розвага |
| anger | гнів | annoyance | роздратування |
| approval | схвалення | caring | турбота |
| confusion | розгубленість | curiosity | цікавість |
| desire | бажання | disappointment | розчарування |
| disapproval | несхвалення | disgust | відраза |
| embarrassment | збентеження | excitement | збудження |
| fear | страх | gratitude | вдячність |
| grief | горе | joy | радість |
| love | любов | nervousness | нервозність |
| optimism | оптимізм | pride | гордість |
| realization | усвідомлення | relief | полегшення |
| remorse | каяття | sadness | сум |
| surprise | здивування | neutral | нейтрально |
The dataset is provided in CSV format with the following columns:
text,text_uk,labels,id,split
text,text_uk,labels,id,split
"My favourite food is anything I didn't have to cook myself.","Моя улюблена їжа - це все, що я не мусив сам готувати.",[27],eebbqej,train
import pandas as pd
# Load dataset
df = pd.read_csv('goemotions_uk.csv')
# Parse labels
import ast
df['labels'] = df['labels'].apply(ast.literal_eval)
# Split by data split
train_df = df[df['split'] == 'train']
val_df = df[df['split'] == 'validation']
test_df = df[df['split'] == 'test']
from sklearn.preprocessing import MultiLabelBinarizer
# Convert labels to multi-hot encoding
mlb = MultiLabelBinarizer(classes=list(range(28)))
mlb.fit([list(range(28))])
train_labels = mlb.transform(train_df['labels'])
val_labels = mlb.transform(val_df['labels'])
test_labels = mlb.transform(test_df['labels'])
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Use multilingual models
model_name = "xlm-roberta-base" # or "TurkuNLP/bert-base-ukrainian-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=28,
problem_type="multi_label_classification"
)
# Tokenize Ukrainian text
encodings = tokenizer(
train_df['text_uk'].tolist(),
truncation=True,
padding=True,
max_length=128
)
This dataset is suitable for:
If you use this dataset, please cite the original GoEmotions paper:
@inproceedings{demszky2020goemotions,
title={{GoEmotions: A Dataset of Fine-Grained Emotions}},
author={Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
booktitle={58th Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2020}
}
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.
Feature Engineering is a crucial step in Machine Learning.
In this project, I show:
- Handling missing values with SimpleImputer
- Encoding categorical variables with OneHotEncoder
- Building models manually vs using Pipeline
- Saving models and pipelines with pickle
- Making predictions with and without pipelines
pipe.pkl) pipe.pkl → Complete ML pipeline (recommended for predictions) clf.pkl → Classifier without pipeline ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features import pickle
pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
sample = [[22, 1, 0, 7.25, 'male', 'S']]
print(pipe.predict(sample))
Predict with pipeline
import pickle
clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
# Preprocess input manually using the encoders, then predict with clf
🎯 Inspiration
Learn difference between manual feature engineering and pipeline-based workflows
Understand how to avoid data leakage using Pipeline
Explore cross-validation with pipelines
Practice model persistence and deployment strategies
✅ Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step!
---
👉 This version is **Kaggle-friendly** (short, structured, with code examples).
Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Daily Machine Learning Practice – 1 Commit per Day
Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/
Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.
Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.
**Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack
Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle
Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.
Facebook
TwitterThe Cropped Yale Face Dataset is a widely used benchmark in computer vision and machine learning for face recognition tasks. It consists of grayscale images of human faces captured under varying lighting conditions and expressions. The dataset is well-suited for research in facial recognition, image preprocessing, and machine learning model evaluation.
| Feature | Description |
|---|---|
| Number of subjects | 38 individuals |
| Number of images | 2,414 images |
| Image size | 192 × 168 pixels |
| Color | Grayscale (single channel) |
| Variations | Lighting conditions, facial expressions, and slight head rotations |
| Format | .pgm images (can be converted to .png or .jpg) |
| Common usage | Face recognition, PCA/LDA experiments, image classification |
CroppedYale/
├── yaleB01/
│ ├── yaleB01_P00A+000E+00.pgm
│ ├── yaleB01_P00A+000E+05.pgm
│ └── ...
├── yaleB02/
│ └── ...
└── ...
yaleB<subject_id>_P<pose>A<ambient>E<expression>.pgm.The dataset is perfect for evaluating facial recognition algorithms under controlled lighting and expression variations.
from sklearn.decomposition import PCA
from sklearn.svm import SVC
import numpy as np
# Load images and flatten
X = images.reshape(len(images), -1)
y = labels
# Reduce dimensions using PCA
pca = PCA(n_components=100)
X_pca = pca.fit_transform(X)
# Train classifier
clf = SVC(kernel='linear')
clf.fit(X_pca, y)
Due to its moderate image size, the dataset is ideal for testing dimensionality reduction methods like PCA, LDA, or t-SNE.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
X_embedded = TSNE(n_components=2).fit_transform(X_pca)
plt.scatter(X_embedded[:,0], X_embedded[:,1], c=y)
plt.show()
Researchers can use this dataset to study the effect of lighting conditions and facial expressions on recognition accuracy.
yaleB01_P00A+000E+00.pgm → Normal expressionyaleB01_P00A+000E+05.pgm → Smiling expressionyaleB01_P00A+010E+00.pgm → Slightly rotated face
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6372737%2F2d0c8c299f63bb8a5823683346ba1ba8%2FImage2.jpg?generation=1659570752665846&alt=media">
The dataset contains 2 folders: one with the test data and the other one with train data. The test-train-split ratio is 0.14, with the test dataset containing 114 images and the train dataset containing 711. The images have a resolution of 240x240 pixels in RGB color model. Both the folders contain 3 classes:
This dataset is ideal for performing multiclass classification with deep neural networks like CNNs or simpler machine learning classification models.
You can use Tensorflow, his high-level API keras, Sklearn, PyTorch or other deep/machine learning libraries to building the model from scratch or, as an alternative, fetching pretrained models as well as fine-tuning them.
It is also possible to modify the size of the images or preprocessing them using OpenCV , and check if the accuracy of the model improves.
Remember to upvote if you found the dataset useful :).
The dataset was obtained downloading images from Google images.
The images with a .webp format were transformed into .jpg images. The obtained images were randomly shuffled and resized so that all the images had a resolution of 240x240 pixels.
Then, they were split into train and test datasets and saved.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitter