This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1
), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.
For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.
train_sample.csv
- Sampled data
Each row of the training data contains a click record, with the following features.
ip
: ip address of click.app
: app id for marketing.device
: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)os
: os version id of user mobile phonechannel
: channel id of mobile ad publisherclick_time
: timestamp of click (UTC)attributed_time
: if user download the app for after clicking an ad, this is the time of the app downloadis_attributed
: the target that is to be predicted, indicating the app was downloadedNote that ip, app, device, os, and channel are encoded.
I'm also including Parquet files with various features for use within the course.
This dataset was created by Chokaew Phonhan
There's a data from the 9th chapter of the Feature Engineering for Machine Learning book by Alice Zheng and Amanda Casari. The data is suitable for the last project os this book: Academic Papers Recommendation System.
This dataset was created by furu-nag
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
how to use library load data and preprocess data at the same time
use below code:
train,test = loadAndPreprocess(train_path,test_path)
This dataset was created by Aleksandr Razin
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by blue7red
Released under CC0: Public Domain
This dataset was created by ItaiHarpaz
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.
Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.
There are three files in this dataset:
Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.
train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.
Columns in train_features.RDS & test_features.RDS:
id_code - image id
diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS
n - number of persistent homology components detected from the image
fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]
l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.
This dataset was created by Hongyi Shao
This dataset was created by Mukesh Manral
This dataset was created by Bhuwanesh Tripathi
Released under Other (specified in description)
This dataset was created by Zhou Hong
This dataset was created by Anagha Joshi
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is just a list of SMILES strings for 9 molecules.
The attached code is an example for how to generate new molecular features from these (or other) smiles strings.
This dataset was created by Mathurin Aché
This dataset comprises user comments collected from YouTube videos discussing Prabowo Subianto’s speech in relation to former U.S. President Donald Trump’s tariff policies. The data is organized into three separate Excel files, each representing a different sentiment distribution:
Balanced Dataset: Contains an equal number of comments across all three sentiment classes — positive, negative, and neutral — to support unbiased model training and evaluation.
Unbalanced Dataset: Reflects the natural distribution of sentiments as observed in the raw data, providing a realistic scenario for real-world sentiment analysis.
Neutral-Inclusive Dataset: A version of the dataset that includes comments labeled as neutral, in addition to positive and negative sentiments, offering a more comprehensive view of public opinion.
This dataset is suitable for sentiment classification tasks, public opinion mining, and research in political discourse analysis, particularly in the context of sentiment analysis
This dataset was created by Proloy Pal
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Shrishti Tayde
Released under MIT
This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1
), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.
For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.
train_sample.csv
- Sampled data
Each row of the training data contains a click record, with the following features.
ip
: ip address of click.app
: app id for marketing.device
: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)os
: os version id of user mobile phonechannel
: channel id of mobile ad publisherclick_time
: timestamp of click (UTC)attributed_time
: if user download the app for after clicking an ad, this is the time of the app downloadis_attributed
: the target that is to be predicted, indicating the app was downloadedNote that ip, app, device, os, and channel are encoded.
I'm also including Parquet files with various features for use within the course.