100+ datasets found

Feature Engineering Data
kaggle.com
Updated Jul 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 23, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mat Leonard
Description
This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

File descriptions

train_sample.csv - Sampled data

Data fields

Each row of the training data contains a click record, with the following features.

ip: ip address of click.

app: app id for marketing.

device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)

os: os version id of user mobile phone

channel: channel id of mobile ad publisher

click_time: timestamp of click (UTC)

attributed_time: if user download the app for after clicking an ad, this is the time of the app download

is_attributed: the target that is to be predicted, indicating the app was downloaded

Note that ip, app, device, os, and channel are encoded.

I'm also including Parquet files with various features for use within the course.
PS3E23 | EDA | Feature Engineering | Ensemble
kaggle.com
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chokaew Phonhan (2023). PS3E23 | EDA | Feature Engineering | Ensemble [Dataset]. https://www.kaggle.com/datasets/chokaewphonhan/ps3e23-eda-feature-engineering-ensemble
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chokaew Phonhan
Description
Dataset

This dataset was created by Chokaew Phonhan

Contents
IceCube_FeatureEngineering
kaggle.com
Updated Feb 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
utm529f (2023). IceCube_FeatureEngineering [Dataset]. https://www.kaggle.com/datasets/utm529fg/icecube-featureengineering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
utm529f
Description
sensor_geometory_add_feature.csv
Added basic statistics and a flag for whether or not the sensor is DeepCore to 5160 sensors.

train_meta_add_feature.parquet
Added basic statistics for each event.
MAG Papers
kaggle.com
Updated Sep 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleks Mashanski (2021). MAG Papers [Dataset]. https://www.kaggle.com/alexmaszanski/mag-papers/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 24, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aleks Mashanski
Description
Context

There's a data from the 9th chapter of the Feature Engineering for Machine Learning book by Alice Zheng and Amanda Casari. The data is suitable for the last project os this book: Academic Papers Recommendation System.
[Otto]Feature-engineering
kaggle.com
Updated Mar 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
furu-nag (2023). [Otto]Feature-engineering [Dataset]. https://www.kaggle.com/datasets/kunihikofurugori/ottofeatureengineering/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
furu-nag
Description
Dataset

This dataset was created by furu-nag

Contents
titanic_preprocess
kaggle.com
Updated Dec 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
blue7red (2021). titanic_preprocess [Dataset]. https://www.kaggle.com/rhythmcam/titanic-preprocess/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
blue7red
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
how to use library load data and preprocess data at the same time

use below code:

train,test = loadAndPreprocess(train_path,test_path)
mystery feature engineering
kaggle.com
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksandr Razin (2025). mystery feature engineering [Dataset]. https://www.kaggle.com/datasets/alndralndr/mystery-feature-engineering/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aleksandr Razin
Description
Dataset

This dataset was created by Aleksandr Razin

Contents
Null Data Feature Engineering Util
kaggle.com
Updated Mar 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
blue7red (2022). Null Data Feature Engineering Util [Dataset]. https://www.kaggle.com/datasets/rhythmcam/null-data-feature-engineering-util
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 20, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
blue7red
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by blue7red

Released under CC0: Public Domain

Contents
House Prices With Advanced Feature Engineering
kaggle.com
Updated Jan 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ItaiHarpaz (2022). House Prices With Advanced Feature Engineering [Dataset]. https://www.kaggle.com/datasets/itai2468/house-prices-with-advanced-feature-engineering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ItaiHarpaz
Description
Dataset

This dataset was created by ItaiHarpaz

Contents
Feature Extraction
kaggle.com
Updated Sep 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason (2019). Feature Extraction [Dataset]. https://www.kaggle.com/jclchan/feature-extraction/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 4, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jason
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.

Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.

There are three files in this dataset:

Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.

train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.

Columns in train_features.RDS & test_features.RDS:

id_code - image id

diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS

n - number of persistent homology components detected from the image

fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]

l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.
amex-feature-engineering-dataset
kaggle.com
Updated Jul 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongyi Shao (2022). amex-feature-engineering-dataset [Dataset]. https://www.kaggle.com/datasets/hongyishao/amexfeatureengineeringdataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hongyi Shao
Description
Dataset

This dataset was created by Hongyi Shao

Contents
DimensionalityReduction&FeatureSelection
kaggle.com
Updated Mar 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mukesh Manral (2022). DimensionalityReduction&FeatureSelection [Dataset]. https://www.kaggle.com/mukeshmanral/dimensionalityreductionfeatureselection/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mukesh Manral
Description
Dataset

This dataset was created by Mukesh Manral

Contents
Predicting Tweet Sentiments
kaggle.com
Updated Jun 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhuwanesh Tripathi (2020). Predicting Tweet Sentiments [Dataset]. https://www.kaggle.com/datasets/bhuwanesh340/predicting-tweet-sentiments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 10, 2020
Dataset provided by
Kaggle
Authors
Bhuwanesh Tripathi
Description
Dataset

This dataset was created by Bhuwanesh Tripathi

Released under Other (specified in description)

Contents
google_feature_engineering
kaggle.com
Updated Oct 18, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhou Hong (2019). google_feature_engineering [Dataset]. https://www.kaggle.com/zhouhong0/google-feature-engineering/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zhou Hong
Description
Dataset

This dataset was created by Zhou Hong

Contents
Feature Importance Analysis- Anagha Joshi
kaggle.com
Updated Mar 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anagha Joshi (2019). Feature Importance Analysis- Anagha Joshi [Dataset]. https://www.kaggle.com/anajoshi/heartrate/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 22, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anagha Joshi
Description
Dataset

This dataset was created by Anagha Joshi

Contents
SmilesStrings
kaggle.com
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D Friday (2025). SmilesStrings [Dataset]. https://www.kaggle.com/datasets/dfriday/smilesstrings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 20, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
D Friday
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is just a list of SMILES strings for 9 molecules.

The attached code is an example for how to generate new molecular features from these (or other) smiles strings.
pseudolabeling-features-engineering
kaggle.com
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathurin Aché (2021). pseudolabeling-features-engineering [Dataset]. https://www.kaggle.com/datasets/mathurinache/pseudolabelingfeaturesengineering/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 17, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mathurin Aché
Description
Dataset

This dataset was created by Mathurin Aché

Contents
youtubecommentsdataset
kaggle.com
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afryan Fernando (2025). youtubecommentsdataset [Dataset]. https://www.kaggle.com/datasets/afryanfernando/youtubecommentsdataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Afryan Fernando
Description
This dataset comprises user comments collected from YouTube videos discussing Prabowo Subianto’s speech in relation to former U.S. President Donald Trump’s tariff policies. The data is organized into three separate Excel files, each representing a different sentiment distribution:

Balanced Dataset: Contains an equal number of comments across all three sentiment classes — positive, negative, and neutral — to support unbiased model training and evaluation.

Unbalanced Dataset: Reflects the natural distribution of sentiments as observed in the raw data, providing a realistic scenario for real-world sentiment analysis.

Neutral-Inclusive Dataset: A version of the dataset that includes comments labeled as neutral, in addition to positive and negative sentiments, offering a more comprehensive view of public opinion.

This dataset is suitable for sentiment classification tasks, public opinion mining, and research in political discourse analysis, particularly in the context of sentiment analysis
Tabular_5-folds
kaggle.com
Updated Aug 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Proloy Pal (2021). Tabular_5-folds [Dataset]. https://www.kaggle.com/proloypal/tabular-5folds/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Proloy Pal
Description
Dataset

This dataset was created by Proloy Pal

Contents
Loan Approval Dataset
kaggle.com
Updated Dec 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Tayde (2024). Loan Approval Dataset [Dataset]. https://www.kaggle.com/datasets/shrishtitayde/loan-approval-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shrishti Tayde
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Shrishti Tayde

Released under MIT

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata

Feature Engineering Data

Data for the Feature Engineering Mini-Course

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 23, 2019

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mat Leonard

Description

This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

File descriptions

train_sample.csv - Sampled data

Data fields

Each row of the training data contains a click record, with the following features.

ip: ip address of click.
app: app id for marketing.
device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
os: os version id of user mobile phone
channel: channel id of mobile ad publisher
click_time: timestamp of click (UTC)
attributed_time: if user download the app for after clicking an ad, this is the time of the app download
is_attributed: the target that is to be predicted, indicating the app was downloaded

Note that ip, app, device, os, and channel are encoded.

I'm also including Parquet files with various features for use within the course.

Clear search

Close search

Google apps

Main menu

Feature Engineering Data

File descriptions

Data fields

PS3E23 | EDA | Feature Engineering | Ensemble

Dataset

Contents

IceCube_FeatureEngineering

MAG Papers

Context

[Otto]Feature-engineering

Dataset

Contents

titanic_preprocess

mystery feature engineering

Dataset

Contents

Null Data Feature Engineering Util

Dataset

Contents

House Prices With Advanced Feature Engineering

Dataset

Contents

Feature Extraction

amex-feature-engineering-dataset

Dataset

Contents

DimensionalityReduction&FeatureSelection

Dataset

Contents

Predicting Tweet Sentiments

Dataset

Contents

google_feature_engineering

Dataset

Contents

Feature Importance Analysis- Anagha Joshi

Dataset

Contents

SmilesStrings

pseudolabeling-features-engineering

Dataset

Contents

youtubecommentsdataset

Tabular_5-folds

Dataset

Contents

Loan Approval Dataset

Dataset

Contents

Feature Engineering Data

Data for the Feature Engineering Mini-Course

File descriptions

Data fields