Facebook
Twitter
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:
1. Acquire Personality Dataset
The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.
2. Data preprocessing
After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.
3. Feature Extraction
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I get irritated easily.
EST3 I worry about things.
EST4 I change my mood a lot.
AGR1 I have a soft heart.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I am not really interested in others.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I follow a schedule.
CSN4 I make a mess of things.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I do not have a good imagination.
OPN4 I use difficult words.
4. Training the Model
Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package
5. Personality Prediction Output
After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains cleaned 2,245 ageing test traces (time vs. MPPT PCE/ maximum power point tracking power conversion efficiency) for perovskite solar cells with various device stacks and architectures in the pickle (.pkl) format.
The dataset can be loaded with the following commands on Python.
import pickle5 as pickle import pandas as pd import numpy as np
with open('20230303_mySeriesDrop.pkl', "rb") as fh: mySeriesDrop = pickle.load(fh)
The following command can be used to call a specific row (row 0) within the dataset.
mySeriesDrop[0]
The next steps to use the dataset is using scaling/ normalisation (for instance using sklearn.preprocessing.MaxAbsScaler) and smoothing (for instance using Savitzky-Golay filter).
The code to run the complete analysis, including self-organising map clustering, can be accessed here: https://doi.org/10.5281/zenodo.8181602.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.
| Column Name | Data Type Category | Description |
|---|---|---|
| Household_ID | Categorical (Nominal) | Unique identifier for each household |
| Date | Datetime | The date of the energy usage record |
| Energy_Consumption_kWh | Numerical (Continuous) | Total energy consumed by the household in kWh |
| Household_Size | Numerical (Discrete) | Number of individuals living in the household |
| Avg_Temperature_C | Numerical (Continuous) | Average daily temperature in degrees Celsius |
| Has_AC | Categorical (Binary) | Indicates if the household has air conditioning (Yes/No) |
| Peak_Hours_Usage_kWh | Numerical (Continuous) | Energy consumed during peak hours in kWh |
| Library | Purpose |
|---|---|
pandas | Reading, cleaning, and transforming tabular data |
numpy | Numerical operations, working with arrays |
| Library | Purpose |
|---|---|
matplotlib | Creating static plots (line, bar, histograms, etc.) |
seaborn | Statistical visualizations, heatmaps, boxplots, etc. |
plotly | Interactive charts (time series, pie, bar, scatter, etc.) |
| Library | Purpose |
|---|---|
scikit-learn | Preprocessing, regression, classification, clustering |
xgboost / lightgbm | Gradient boosting models for better accuracy |
| Library | Purpose |
|---|---|
sklearn.preprocessing | Encoding categorical features, scaling, normalization |
datetime / pandas | Date-time conversion and manipulation |
| Library | Purpose |
|---|---|
sklearn.metrics | Accuracy, MAE, RMSE, R² score, confusion matrix, etc. |
✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.
This dataset is ideal for a wide variety of analytics and machine learning projects:
Facebook
TwitterTo use Facial expresssion dataset 1.Just clone it. 2.create a python v3.9 envoirnment . 3.install tensorflow,cv2,sklearn and keras in the system using pip 4.Run in same kernal enviornment (its takes time). 5.process start collecting and preprocessing datasets of facial expressions captured in different contexts.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Daily Machine Learning Practice – 1 Commit per Day
Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/
Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.
Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.
**Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack
Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle
Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.
Feature Engineering is a crucial step in Machine Learning.
In this project, I show:
- Handling missing values with SimpleImputer
- Encoding categorical variables with OneHotEncoder
- Building models manually vs using Pipeline
- Saving models and pipelines with pickle
- Making predictions with and without pipelines
pipe.pkl) pipe.pkl → Complete ML pipeline (recommended for predictions) clf.pkl → Classifier without pipeline ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features import pickle
pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
sample = [[22, 1, 0, 7.25, 'male', 'S']]
print(pipe.predict(sample))
Predict with pipeline
import pickle
clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
# Preprocess input manually using the encoders, then predict with clf
🎯 Inspiration
Learn difference between manual feature engineering and pipeline-based workflows
Understand how to avoid data leakage using Pipeline
Explore cross-validation with pipelines
Practice model persistence and deployment strategies
✅ Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step!
---
👉 This version is **Kaggle-friendly** (short, structured, with code examples).
Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
Facebook
TwitterDataset Creation Code:
import pandas as pd
import numpy as np
import gc
from sklearn.preprocessing import LabelEncoder
import pickle
pickle.HIGHEST_PROTOCOL = 4
BASE_PATH = "./datasets/amex-default-prediction"
# Index file
train_data_index = pd.read_csv(f"{BASE_PATH}/train_labels.csv")
test_data_index = pd.read_csv(f"{BASE_PATH}/sample_submission.csv")
print(train_data_index.shape, test_data_index.shape)
all_ids = np.concatenate([train_data_index["customer_ID"], test_data_index["customer_ID"]])
print(len(all_ids))
# Train an id encoder and save it.
id_encoder = LabelEncoder()
id_encoder.fit(all_ids)
np.save("id_encodings.npy", id_encoder.classes_)
# Make sure we can load it back
loaded_encoder = LabelEncoder()
loaded_encoder.classes_ = np.load("id_encodings.npy", allow_pickle=True)
assert (id_encoder.classes_ == loaded_encoder.classes_).all()
# Make sure we can reverse it (1-index)
print(loaded_encoder.inverse_transform([1, 2]))
print(train_data_index["customer_ID"].values[0: 2])
del loaded_encoder
train_data_index["customer_ID"] = id_encoder.transform(train_data_index["customer_ID"])
test_data_index["customer_ID"] = id_encoder.transform(test_data_index["customer_ID"])
# Encode the index files
train_data_index.to_pickle("id_encoded_train_labels.pkl", protocol=4)
test_data_index.to_pickle("id_encoded_sample_submission.pkl", protocol=4)
del train_data_index
del test_data_index
gc.collect()
# Test files are too large for a Kaggle Notebook
main_train = pd.read_csv(
f"{BASE_PATH}/train_data.csv"
)
main_test = pd.read_csv(
f"{BASE_PATH}/test_data.csv"
)
main_files = [
main_train,
main_test,
]
for main_file in main_files:
print(main_file.shape)
main_file["customer_ID"] = id_encoder.transform(main_file["customer_ID"])
main_train.to_pickle("id_encoded_train_data.pkl", protocol=4)
main_test.to_pickle("id_encoded_test_data.pkl", protocol=4)
def reduce_mem_usage(df, use_fp16=False):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
if use_fp16:
df[col] = df[col].astype(np.float16)
else:
df[col] = df[col].astype(np.float32)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
main_train = pd.read_pickle("id_encoded_train_data.pkl")
main_test = pd.read_pickle("id_encoded_test_data.pkl")
reduce_mem_usage(main_train)
reduce_mem_usage(main_test)
main_train.to_pickle("id_encoded_fp32_train_data.pkl", protocol=4)
main_test.to_pickle("id_encoded_fp32_test_data.pkl", protocol=4)
main_train = pd.read_pickle("id_encoded_train_data.pkl")
main_test = pd.read_pickle("id_encoded_test_data.pkl")
reduce_mem_usage(main_train, use_fp16=True)
reduce_mem_usage(main_test, use_fp16=True)
main_train.to_pickle("id_encoded_fp16_train_data.pkl", protocol=4)
main_test.to_pickle("id_encoded_fp16_test_data.pkl", protocol=4)
Facebook
TwitterThe Cropped Yale Face Dataset is a widely used benchmark in computer vision and machine learning for face recognition tasks. It consists of grayscale images of human faces captured under varying lighting conditions and expressions. The dataset is well-suited for research in facial recognition, image preprocessing, and machine learning model evaluation.
| Feature | Description |
|---|---|
| Number of subjects | 38 individuals |
| Number of images | 2,414 images |
| Image size | 192 × 168 pixels |
| Color | Grayscale (single channel) |
| Variations | Lighting conditions, facial expressions, and slight head rotations |
| Format | .pgm images (can be converted to .png or .jpg) |
| Common usage | Face recognition, PCA/LDA experiments, image classification |
CroppedYale/
├── yaleB01/
│ ├── yaleB01_P00A+000E+00.pgm
│ ├── yaleB01_P00A+000E+05.pgm
│ └── ...
├── yaleB02/
│ └── ...
└── ...
yaleB<subject_id>_P<pose>A<ambient>E<expression>.pgm.The dataset is perfect for evaluating facial recognition algorithms under controlled lighting and expression variations.
from sklearn.decomposition import PCA
from sklearn.svm import SVC
import numpy as np
# Load images and flatten
X = images.reshape(len(images), -1)
y = labels
# Reduce dimensions using PCA
pca = PCA(n_components=100)
X_pca = pca.fit_transform(X)
# Train classifier
clf = SVC(kernel='linear')
clf.fit(X_pca, y)
Due to its moderate image size, the dataset is ideal for testing dimensionality reduction methods like PCA, LDA, or t-SNE.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
X_embedded = TSNE(n_components=2).fit_transform(X_pca)
plt.scatter(X_embedded[:,0], X_embedded[:,1], c=y)
plt.show()
Researchers can use this dataset to study the effect of lighting conditions and facial expressions on recognition accuracy.
yaleB01_P00A+000E+00.pgm → Normal expressionyaleB01_P00A+000E+05.pgm → Smiling expressionyaleB01_P00A+010E+00.pgm → Slightly rotated face Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitter