100+ datasets found

Car Price Prediction Challenge
kaggle.com
Updated Jul 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deep Contractor
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Assignment

Your notebooks must contain the following steps:

Perform data cleaning and pre-processing.

What steps did you use in this process and how did you clean your data.

Perform exploratory data analysis on the given dataset.

Explain each and every graphs that you make.

Train a ml-model and evaluate it using different metrics.

Why did you choose that particular model? What was the accuracy?

Hyperparameter optimization and feature selection is a plus.

Model deployment and use of ml-flow is a plus.

Perform model interpretation and show feature importance for your model.

Provide some explanation for the above point.

Future steps. Note: try to have your notebooks as presentable as possible.

Dataset Description

CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

Attributes

ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

Confused or have any doubts in the data column values? Check the dataset discussion tab!
ISBI Challenge Dataset
kaggle.com
zip
Updated Sep 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soumik Rakshit (2019). ISBI Challenge Dataset [Dataset]. https://www.kaggle.com/datasets/soumikrakshit/isbi-challenge-dataset
Explore at:
zip(14971789 bytes)Available download formats
Dataset updated
Sep 6, 2019
Authors
Soumik Rakshit
Description
Dataset

This dataset was created by Soumik Rakshit

Contents
Z
Solution #4 for Predicting Molecular Properties Kaggle Competition
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popovic, Milos (2020). Solution #4 for Predicting Molecular Properties Kaggle Competition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3406153
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Stojanovic, Luka
Tijanic, Nebojsa
Rakocevic, Goran
Popovic, Milos
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Code and additional data for solution #4 in Predicting Molecular Properties competition, described in #4 Solution [Hyperspatial Engineers].
DeepfakeArt Challenge
kaggle.com
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Mao (2023). DeepfakeArt Challenge [Dataset]. https://www.kaggle.com/datasets/danielmao2019/deepfakeart
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Daniel Mao
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://github.com/h-aboutalebi/DeepfakeArt/raw/main/images/all.jpg">

DeepfakeArt Challenge Benchmark Dataset for Generative AI Art Forgery and Data Poisoning Detection

The tremendous recent advances in generative artificial intelligence techniques have led to significant successes and promise in a wide range of different applications ranging from conversational agents and textual content generation to voice and visual synthesis. Amid the rise in generative AI and its increasing widespread adoption, there has been significant growing concern over the use of generative AI for malicious purposes. In the realm of visual content synthesis using generative AI, key areas of significant concern has been image forgery (e.g., generation of images containing or derived from copyright content), and data poisoning (i.e., generation of adversarially contaminated images). Motivated to address these key concerns to encourage responsible generative AI, we introduce the DeepfakeArt Challenge, a large-scale challenge benchmark dataset designed specifically to aid in the building of machine learning algorithms for generative AI art forgery and data poisoning detection. Comprising of over 32,000 records across a variety of generative forgery and data poisoning techniques, each entry consists of a pair of images that are either forgeries / adversarially contaminated or not. Each of the generated images in the DeepfakeArt Challenge benchmark dataset has been quality checked in a comprehensive manner. The DeepfakeArt Challenge is a core part of GenAI4Good, a global open source initiative for accelerating machine learning for promoting responsible creation and deployment of generative AI for good.

The generative forgery and data poisoning methods leveraged in the DeepfakeArt Challenge benchmark dataset include: - Inpainting - Style Transfer - Adversarial data poisoning - Cutmix

Team Members: - Hossein Aboutalebi - Dayou Mao - Carol Xu - Alexander Wong

The Github repo associated with the DeepfakeArt Challenge benchmark dataset is available here

The DeepfakeArt Challenge paper is available here

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F8198772%2Fa36c126fe0329c6478a3fce34ad6c138%2Flogo.jpg?generation=1683556962692093&alt=media" alt="logo"> Part of https://github.com/h-aboutalebi/DeepfakeArt/raw/main/images/genai4good.png">

Data distribution

The DeepfakeArt Challenge benchmark dataset, as proposed, encompasses over 32,000 records, incorporating a wide spectrum of generative forgery and data poisoning techniques. Each record is represented by a pair of images, which could be either forgeries, adversarially compromised, or not. Fig. 2 (a) depicts the overall distribution of data, differentiating between forgery/adversarially contaminated records and untainted ones. The dispersion of data across various generative forgery and data poisoning techniques is demonstrated in Fig. 2 (b). Notably, as presented Fig. in 2 (a), the dataset contains almost double the number of dissimilar pairs compared to similar pairs, making the identification of similar pairs substantially more challenging given that two-thirds of the dataset comprises dissimilar pairs.

https://raw.githubusercontent.com/h-aboutalebi/DeepfakeArt/main/images/dist.png">

Inpainting Category

The source dataset for the inpainting category is WikiArt (ref). Each image is sampled randomly from the dataset as the source image to generate forgery images. Each record in this category consists of three images:

source image: The source image used to create a forgery image from

inpainting image: The inpainting image generated by Stable Diffusion 2 model (ref)

masking image: black-white image which white parts depicts which parts of original image is inpainted by Stable Diffusion 2 to generate inpainting image

The prompt used for the generation of the inpainting image is: "generate a painting compatible with the rest of the image"

This category consists of more than 5063 records. The original images are masked between 40%-60%. We applied one of the followed masking schema randomly:

side masking: where the top side, bottom side, right side or left side of the source image is maked

diagonal masking: where the upper right, upper left, lower right, or lower left diagonal side of thw source image is masked

random masking: where randomly selected parts of the source image are masked

The code for the data generation in this category can be found here

Style Tran...
h
BirdCLEF-Challenge2023-Kaggle
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernardo Cecchetto, BirdCLEF-Challenge2023-Kaggle [Dataset]. https://huggingface.co/datasets/bernardocecchetto/BirdCLEF-Challenge2023-Kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Bernardo Cecchetto
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains audios of 264 species of birds singing that were all processed. It was processed as follows:

Stereo to Mono Resampled 16kHz High Pass Filter (1500Hz and filter order of 16) Normalized

The raw dataset was provided by the BirdCLEF 2023 challenge from Kaggle. You can access it in https://www.kaggle.com/competitions/birdclef-2023/data
h
Eedi-competition-kaggle-prompt-formats
huggingface.co
Updated Sep 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EVANGELOS PAPAMITSOS (2024). Eedi-competition-kaggle-prompt-formats [Dataset]. https://huggingface.co/datasets/VaggP/Eedi-competition-kaggle-prompt-formats
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2024
Authors
EVANGELOS PAPAMITSOS
Description
VaggP/Eedi-competition-kaggle-prompt-formats dataset hosted on Hugging Face and contributed by the HF Datasets community
Data characteristics for the Kaggle.com seizure forecasting contest.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo (2023). Data characteristics for the Kaggle.com seizure forecasting contest. [Dataset]. http://doi.org/10.1371/journal.pone.0178808.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0178808.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source: [9].
DL challenge dataset
kaggle.com
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riccardo Inverardi Galli (2024). DL challenge dataset [Dataset]. https://www.kaggle.com/datasets/ricvigi/dl-challenge-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Riccardo Inverardi Galli
Description
Dataset

This dataset was created by Riccardo Inverardi Galli

Contents
FSDKaggle2019
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Manoj Plakal; Frederic Font; Frederic Font; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Manoj Plakal (2020). FSDKaggle2019 [Dataset]. http://doi.org/10.5281/zenodo.3612637
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3612637
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Manoj Plakal; Frederic Font; Frederic Font; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Manoj Plakal
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public
Arm UNICEF Disaster Vulnerability Challenge
kaggle.com
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taki Hasan (2024). Arm UNICEF Disaster Vulnerability Challenge [Dataset]. https://www.kaggle.com/datasets/takihasan/arm-unicef-disaster-vulnerability-challenge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Taki Hasan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a dataset for the competition named - "Arm UNICEF Disaster Vulnerability Challenge" hosted in Zindi. It starts from 15th Mar, 2024 and runs till 23rd June 2024. You can visit the official page of the challenge here. All rights belong to the original competition hosts. It has been submitted here so that anyone who wants to build their models or want to use the convenience of kaggle api to download the respective dataset can do so.
h
olympiad-math-contest-llama3-20k
huggingface.co
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Amiri (2024). olympiad-math-contest-llama3-20k [Dataset]. https://huggingface.co/datasets/kevin009/olympiad-math-contest-llama3-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2024
Authors
Kevin Amiri
Description
AMC/AIME Mathematics Problem and Solution Dataset

Dataset Details

Dataset Name: AMC/AIME Mathematics Problem and Solution Dataset Version: 1.0 Release Date: 2024-06-1 Authors: Kevin Amiri

Intended Use

Primary Use: The dataset is created and intended for research and an AI Mathematical Olympiad Kaggle competition. Intended Users: Researchers in AI & mathematics or science.

Dataset Composition

Number of Examples: 20,300 problems and solution sets… See the full description on the dataset page: https://huggingface.co/datasets/kevin009/olympiad-math-contest-llama3-20k.
P
Criteo Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Criteo Dataset [Dataset]. https://paperswithcode.com/dataset/criteo
Explore at:
Description
Criteo contains 7 days of click-through data, which is widely used for CTR prediction benchmarking. There are 26 anonymous categorical fields and 13 continuous fields in Criteo dataset.
ISIC-2017
kaggle.com
Updated Feb 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John C (2023). ISIC-2017 [Dataset]. https://www.kaggle.com/datasets/johnchfr/isic-2017
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
John C
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Source: https://challenge.isic-archive.com/data/#2017 License: CC0 (public domain)
Google Landmarks Dataset v2
github.com
paperswithcode.com
+1more
Updated Sep 27, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
Explore at:
Dataset updated
Sep 27, 2019
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
o
Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jun 13, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakshitha Godahewa; Christoph Bergmeir; Geoff Webb (2020). Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values) [Dataset]. http://doi.org/10.5281/zenodo.3898473
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3898473
Dataset updated
Jun 13, 2020
Authors
Rakshitha Godahewa; Christoph Bergmeir; Geoff Webb
Description
This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10. The original dataset contains missing values. They have been simply replaced by zeros. {"references": ["Google, 2017. Web traffic time series forecasting. URL https://www.kaggle.com/c/web-traffic-time-series-forecasting"]}
f
Comparison of GA-XGBoost with XGBoost and LightGBM test results.
figshare.com
xls
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ke Peng; Yan Peng; Wenguang Li (2023). Comparison of GA-XGBoost with XGBoost and LightGBM test results. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289724.t008
Dataset updated
Dec 8, 2023
Dataset provided by
PLOS ONE
Authors
Ke Peng; Yan Peng; Wenguang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of GA-XGBoost with XGBoost and LightGBM test results.
CAFA Protein Function Annotation Challenges
kaggle.com
zip
Updated May 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Chervov (2023). CAFA Protein Function Annotation Challenges [Dataset]. https://www.kaggle.com/datasets/alexandervc/cafa-protein-function-annotation-challenges
Explore at:
zip(415515112 bytes)Available download formats
Dataset updated
May 29, 2023
Authors
Alexander Chervov
Description
Dataset

This dataset was created by Alexander Chervov

Contents
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
From Scores to Seats: The Grad School ML Challenge
kaggle.com
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ifeoluwa Adewumi (2025). From Scores to Seats: The Grad School ML Challenge [Dataset]. https://www.kaggle.com/datasets/ifeadewumi/from-scores-to-seats-the-grad-school-ml-challenge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ifeoluwa Adewumi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Ifeoluwa Adewumi

Released under MIT

Contents
ML Challenge Data
kaggle.com
Updated Jul 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ved Dubey (2021). ML Challenge Data [Dataset]. https://www.kaggle.com/datasets/veddubey/ml-challenge-data/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ved Dubey
Description
Dataset

This dataset was created by Ved Dubey

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge

Car Price Prediction Challenge

A dataset to practice regression by predicting the prices of different cars.

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 6, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Deep Contractor

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Assignment

Your notebooks must contain the following steps:

Perform data cleaning and pre-processing.
- What steps did you use in this process and how did you clean your data.
Perform exploratory data analysis on the given dataset.
- Explain each and every graphs that you make.
Train a ml-model and evaluate it using different metrics.
- Why did you choose that particular model? What was the accuracy?
Hyperparameter optimization and feature selection is a plus.
Model deployment and use of ml-flow is a plus.
Perform model interpretation and show feature importance for your model.
- Provide some explanation for the above point.
Future steps. Note: try to have your notebooks as presentable as possible.

Dataset Description

CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

Attributes

ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

Confused or have any doubts in the data column values? Check the dataset discussion tab!

Clear search

Close search

Google apps

Main menu

Car Price Prediction Challenge

Assignment

Dataset Description

Attributes

ISBI Challenge Dataset

Dataset

Contents

Solution #4 for Predicting Molecular Properties Kaggle Competition

DeepfakeArt Challenge

DeepfakeArt Challenge Benchmark Dataset for Generative AI Art Forgery and Data Poisoning Detection

Data distribution

Inpainting Category

Style Tran...

BirdCLEF-Challenge2023-Kaggle

Eedi-competition-kaggle-prompt-formats

Data characteristics for the Kaggle.com seizure forecasting contest.

DL challenge dataset

Dataset

Contents

FSDKaggle2019

Arm UNICEF Disaster Vulnerability Challenge

olympiad-math-contest-llama3-20k

Criteo Dataset

ISIC-2017

Google Landmarks Dataset v2

Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)

Comparison of GA-XGBoost with XGBoost and LightGBM test results.

CAFA Protein Function Annotation Challenges

Dataset

Contents

CT-FAN-21 corpus: A dataset for Fake News Detection

From Scores to Seats: The Grad School ML Challenge

Dataset

Contents

ML Challenge Data

Dataset

Contents

Car Price Prediction Challenge

A dataset to practice regression by predicting the prices of different cars.

Assignment

Dataset Description

Attributes