100+ datasets found
  1. Feature Engineering Data

    • kaggle.com
    Updated Jul 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 23, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mat Leonard
    Description

    This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

    For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

    File descriptions

    train_sample.csv - Sampled data

    Data fields

    Each row of the training data contains a click record, with the following features.

    • ip: ip address of click.
    • app: app id for marketing.
    • device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
    • os: os version id of user mobile phone
    • channel: channel id of mobile ad publisher
    • click_time: timestamp of click (UTC)
    • attributed_time: if user download the app for after clicking an ad, this is the time of the app download
    • is_attributed: the target that is to be predicted, indicating the app was downloaded

    Note that ip, app, device, os, and channel are encoded.

    I'm also including Parquet files with various features for use within the course.

  2. PS3E23 | EDA | Feature Engineering | Ensemble

    • kaggle.com
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chokaew Phonhan (2023). PS3E23 | EDA | Feature Engineering | Ensemble [Dataset]. https://www.kaggle.com/datasets/chokaewphonhan/ps3e23-eda-feature-engineering-ensemble
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Chokaew Phonhan
    Description

    Dataset

    This dataset was created by Chokaew Phonhan

    Contents

  3. IceCube_FeatureEngineering

    • kaggle.com
    Updated Feb 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    utm529f (2023). IceCube_FeatureEngineering [Dataset]. https://www.kaggle.com/datasets/utm529fg/icecube-featureengineering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    utm529f
    Description
    • sensor_geometory_add_feature.csv
      Added basic statistics and a flag for whether or not the sensor is DeepCore to 5160 sensors.
    • train_meta_add_feature.parquet
      Added basic statistics for each event.
  4. MAG Papers

    • kaggle.com
    Updated Sep 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleks Mashanski (2021). MAG Papers [Dataset]. https://www.kaggle.com/alexmaszanski/mag-papers/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 24, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aleks Mashanski
    Description

    Context

    There's a data from the 9th chapter of the Feature Engineering for Machine Learning book by Alice Zheng and Amanda Casari. The data is suitable for the last project os this book: Academic Papers Recommendation System.

  5. [Otto]Feature-engineering

    • kaggle.com
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    furu-nag (2023). [Otto]Feature-engineering [Dataset]. https://www.kaggle.com/datasets/kunihikofurugori/ottofeatureengineering/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    furu-nag
    Description

    Dataset

    This dataset was created by furu-nag

    Contents

  6. titanic_preprocess

    • kaggle.com
    Updated Dec 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    blue7red (2021). titanic_preprocess [Dataset]. https://www.kaggle.com/rhythmcam/titanic-preprocess/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 26, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    blue7red
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    how to use library load data and preprocess data at the same time

    use below code:

    train,test = loadAndPreprocess(train_path,test_path)

  7. mystery feature engineering

    • kaggle.com
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksandr Razin (2025). mystery feature engineering [Dataset]. https://www.kaggle.com/datasets/alndralndr/mystery-feature-engineering/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aleksandr Razin
    Description

    Dataset

    This dataset was created by Aleksandr Razin

    Contents

  8. Null Data Feature Engineering Util

    • kaggle.com
    Updated Mar 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    blue7red (2022). Null Data Feature Engineering Util [Dataset]. https://www.kaggle.com/datasets/rhythmcam/null-data-feature-engineering-util
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    blue7red
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by blue7red

    Released under CC0: Public Domain

    Contents

  9. House Prices With Advanced Feature Engineering

    • kaggle.com
    Updated Jan 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ItaiHarpaz (2022). House Prices With Advanced Feature Engineering [Dataset]. https://www.kaggle.com/datasets/itai2468/house-prices-with-advanced-feature-engineering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ItaiHarpaz
    Description

    Dataset

    This dataset was created by ItaiHarpaz

    Contents

  10. Feature Extraction

    • kaggle.com
    Updated Sep 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason (2019). Feature Extraction [Dataset]. https://www.kaggle.com/jclchan/feature-extraction/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 4, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jason
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.

    Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.

    There are three files in this dataset:

    1. Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.

    2. train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.

    Columns in train_features.RDS & test_features.RDS:

    1. id_code - image id

    2. diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS

    3. n - number of persistent homology components detected from the image

    4. fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]

    5. l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.

  11. amex-feature-engineering-dataset

    • kaggle.com
    Updated Jul 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongyi Shao (2022). amex-feature-engineering-dataset [Dataset]. https://www.kaggle.com/datasets/hongyishao/amexfeatureengineeringdataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hongyi Shao
    Description

    Dataset

    This dataset was created by Hongyi Shao

    Contents

  12. DimensionalityReduction&FeatureSelection

    • kaggle.com
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mukesh Manral (2022). DimensionalityReduction&FeatureSelection [Dataset]. https://www.kaggle.com/mukeshmanral/dimensionalityreductionfeatureselection/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mukesh Manral
    Description

    Dataset

    This dataset was created by Mukesh Manral

    Contents

  13. Predicting Tweet Sentiments

    • kaggle.com
    Updated Jun 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhuwanesh Tripathi (2020). Predicting Tweet Sentiments [Dataset]. https://www.kaggle.com/datasets/bhuwanesh340/predicting-tweet-sentiments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2020
    Dataset provided by
    Kaggle
    Authors
    Bhuwanesh Tripathi
    Description

    Dataset

    This dataset was created by Bhuwanesh Tripathi

    Released under Other (specified in description)

    Contents

  14. google_feature_engineering

    • kaggle.com
    Updated Oct 18, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhou Hong (2019). google_feature_engineering [Dataset]. https://www.kaggle.com/zhouhong0/google-feature-engineering/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zhou Hong
    Description

    Dataset

    This dataset was created by Zhou Hong

    Contents

  15. Feature Importance Analysis- Anagha Joshi

    • kaggle.com
    Updated Mar 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anagha Joshi (2019). Feature Importance Analysis- Anagha Joshi [Dataset]. https://www.kaggle.com/anajoshi/heartrate/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anagha Joshi
    Description

    Dataset

    This dataset was created by Anagha Joshi

    Contents

  16. SmilesStrings

    • kaggle.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D Friday (2025). SmilesStrings [Dataset]. https://www.kaggle.com/datasets/dfriday/smilesstrings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    D Friday
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is just a list of SMILES strings for 9 molecules.

    The attached code is an example for how to generate new molecular features from these (or other) smiles strings.

  17. pseudolabeling-features-engineering

    • kaggle.com
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathurin Aché (2021). pseudolabeling-features-engineering [Dataset]. https://www.kaggle.com/datasets/mathurinache/pseudolabelingfeaturesengineering/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 17, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mathurin Aché
    Description

    Dataset

    This dataset was created by Mathurin Aché

    Contents

  18. youtubecommentsdataset

    • kaggle.com
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afryan Fernando (2025). youtubecommentsdataset [Dataset]. https://www.kaggle.com/datasets/afryanfernando/youtubecommentsdataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Afryan Fernando
    Description

    This dataset comprises user comments collected from YouTube videos discussing Prabowo Subianto’s speech in relation to former U.S. President Donald Trump’s tariff policies. The data is organized into three separate Excel files, each representing a different sentiment distribution:

    1. Balanced Dataset: Contains an equal number of comments across all three sentiment classes — positive, negative, and neutral — to support unbiased model training and evaluation.

    2. Unbalanced Dataset: Reflects the natural distribution of sentiments as observed in the raw data, providing a realistic scenario for real-world sentiment analysis.

    3. Neutral-Inclusive Dataset: A version of the dataset that includes comments labeled as neutral, in addition to positive and negative sentiments, offering a more comprehensive view of public opinion.

    This dataset is suitable for sentiment classification tasks, public opinion mining, and research in political discourse analysis, particularly in the context of sentiment analysis

  19. Tabular_5-folds

    • kaggle.com
    Updated Aug 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Proloy Pal (2021). Tabular_5-folds [Dataset]. https://www.kaggle.com/proloypal/tabular-5folds/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 22, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Proloy Pal
    Description

    Dataset

    This dataset was created by Proloy Pal

    Contents

  20. Loan Approval Dataset

    • kaggle.com
    Updated Dec 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Tayde (2024). Loan Approval Dataset [Dataset]. https://www.kaggle.com/datasets/shrishtitayde/loan-approval-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shrishti Tayde
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Shrishti Tayde

    Released under MIT

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata
Organization logo

Feature Engineering Data

Data for the Feature Engineering Mini-Course

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 23, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mat Leonard
Description

This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

File descriptions

train_sample.csv - Sampled data

Data fields

Each row of the training data contains a click record, with the following features.

  • ip: ip address of click.
  • app: app id for marketing.
  • device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
  • os: os version id of user mobile phone
  • channel: channel id of mobile ad publisher
  • click_time: timestamp of click (UTC)
  • attributed_time: if user download the app for after clicking an ad, this is the time of the app download
  • is_attributed: the target that is to be predicted, indicating the app was downloaded

Note that ip, app, device, os, and channel are encoded.

I'm also including Parquet files with various features for use within the course.

Search
Clear search
Close search
Google apps
Main menu