100+ datasets found
  1. Fake News Challenge

    • kaggle.com
    zip
    Updated Apr 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinav Kumar Jha (2021). Fake News Challenge [Dataset]. https://www.kaggle.com/datasets/abhinavkrjha/fake-news-challenge
    Explore at:
    zip(5340415 bytes)Available download formats
    Dataset updated
    Apr 4, 2021
    Authors
    Abhinav Kumar Jha
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The issue of “fake news” has arisen recently as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge was organized in early 2017 to encourage development of machine learning-based classification systems that perform “stance detection” -- i.e. identifying whether a particular news headline “agrees” with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to allow journalists and others to more easily find and investigate possible instances of “fake news.”

    Content

    The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:

    train_bodies.csv

    This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)

    train_stances.csv

    This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).

    Distribution of the data

    The distribution of Stance classes in train_stances.csv is as follows:

    rowsunrelateddiscussagreedisagree
    499720.731310.178280.07360120.0168094

    There are 4 possible classifications: 1. The article text agrees with the headline. 2. The article text disagrees with the headline. 3. The article text is a discussion of the headline, without taking a position on it. 4. The article text is unrelated to the headline (i.e. it doesn’t address the same topic).

    Acknowledgements

    For details of the task, see FakeNewsChallenge.org

  2. f

    Kaggle Display Advertising Challenge dataset

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Dec 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiang, Zilong (2017). Kaggle Display Advertising Challenge dataset [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001796020
    Explore at:
    Dataset updated
    Dec 24, 2017
    Authors
    Jiang, Zilong
    Description

    Criteo Display Advertising Challenge dataset, which is provided by the Criteo company on the famous machine learning website Kaggle for advertising CTR .

  3. Google Universal Embedding Challenge Github Repo

    • kaggle.com
    zip
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2022). Google Universal Embedding Challenge Github Repo [Dataset]. https://www.kaggle.com/datasets/dschettler8845/google-universal-embedding-challenge-github-repo
    Explore at:
    zip(13561 bytes)Available download formats
    Dataset updated
    Jul 12, 2022
    Authors
    Darien Schettler
    Description

    Universal Embedding Challenge baseline model implementation.

    This folder contains the baseline model implementation for the Kaggle universal image embedding challenge based on

    Following the above ideas, we also add a 64 projection layer on top of the Vision Transformer base model as the final embedding, since the competition requires embeddings of at most 64 dimensions. Please find more details in image_classification.py.

    To use the code, please firstly install the prerequisites

    pip install -r universal_embedding_challenge/requirements.txt
    
    git clone https://github.com/tensorflow/models.git /tmp/models
    export PYTHONPATH=$PYTHONPATH:/tmp/models
    pip install --user -r /tmp/models/official/requirements.txt
    

    Secondly, please download the imagenet1k data in TFRecord format from https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0 and https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-1, and merge them together under folder imagenet-2012-tfrecord/. As a result, the paths to the training datasets and the validation datasets should be imagenet-2012-tfrecord/train* and imagenet-2012-tfrecord/validation*, respectively.

    The trainer for the model is implemented in train.py, and the following example launches the training

    python -m universal_embedding_challenge.train \
     --experiment=vit_with_bottleneck_imagenet_pretrain \
     --mode=train_and_eval \
     --model_dir=/tmp/imagenet1k_test
    

    The trained model checkpoints could be further converted to savedModel format using export_saved_model.py for Kaggle submission.

    The code to compute metrics for Universal Embedding Challenge is implemented in metrics.py and the code to read the solution file is implemented in read_retrieval_solution.py.

  4. 2018 Kaggle Machine Learning Challenge dataset

    • kaggle.com
    zip
    Updated Nov 28, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sreenanda Sai Dasari (2021). 2018 Kaggle Machine Learning Challenge dataset [Dataset]. https://www.kaggle.com/datasets/sreenandasaidasari/2021-kaggle-machine-learning-challenge
    Explore at:
    zip(4127154 bytes)Available download formats
    Dataset updated
    Nov 28, 2021
    Authors
    Sreenanda Sai Dasari
    Description

    Dataset

    This dataset was created by Sreenanda Sai Dasari

    Contents

  5. CAFA Protein Function Annotation Challenges

    • kaggle.com
    zip
    Updated May 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Chervov (2023). CAFA Protein Function Annotation Challenges [Dataset]. https://www.kaggle.com/datasets/alexandervc/cafa-protein-function-annotation-challenges
    Explore at:
    zip(415515112 bytes)Available download formats
    Dataset updated
    May 29, 2023
    Authors
    Alexander Chervov
    Description

    Dataset

    This dataset was created by Alexander Chervov

    Contents

  6. issues-kaggle-notebooks

    • huggingface.co
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  7. Competitions Shake-up

    • kaggle.com
    zip
    Updated Sep 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniboy370 (2020). Competitions Shake-up [Dataset]. https://www.kaggle.com/daniboy370/competitions-shakeup
    Explore at:
    zip(388789 bytes)Available download formats
    Dataset updated
    Sep 27, 2020
    Authors
    Daniboy370
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Shake-what ?!

    The Shake phenomenon occurs when the competition is shifting between two different datasets :

    \[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]

    The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.

    Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :

                 <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">
    

    From the starter kernel :

                   <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">
    

    Content

    Seven datasets of competitions which were scraped from Kaggle :

    CompetitionName of file
    Elo Merchant Category Recommendationdf_{Elo}
    Human Protein Atlas Image Classificationdf_{Protein}
    Humpback Whale Identificationdf_{Humpback}
    Microsoft Malware Predictiondf_{Microsoft}
    Quora Insincere Questions Classificationdf_{Quora}
    TGS Salt Identification Challengedf_{TGS}
    VSB Power Line Fault Detectiondf_{VSB}

    As an example, consider the following dataframe from the Quora competition : Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public --- | --- The Zoo |1|7|6|0.71323|0.71123 ...| ...| ...| ...| ...| ... D.J. Trump|1401|65|-1336|0.000|0.70573

    I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !

    \[ \text{Enjoy !}\]

  8. t

    Kaggle display advertising challenge dataset - Dataset - LDM

    • service.tib.eu
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Kaggle display advertising challenge dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/kaggle-display-advertising-challenge-dataset
    Explore at:
    Dataset updated
    Jan 3, 2025
    Description

    The Kaggle display advertising challenge dataset.

  9. h

    kaggle-nlp-getting-start

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    hui
    Description

    Dataset Summary

    Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

    Columns

    id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.

  10. R

    Kaggle Dataset

    • universe.roboflow.com
    zip
    Updated Oct 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    k (2022). Kaggle Dataset [Dataset]. https://universe.roboflow.com/k-5hqao/kaggle-wlshw
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 2, 2022
    Dataset authored and provided by
    k
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    K Bounding Boxes
    Description

    Kaggle

    ## Overview
    
    Kaggle is a dataset for object detection tasks - it contains K annotations for 779 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  11. DSN Comedy Kaggle Challenge

    • kaggle.com
    zip
    Updated Sep 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gbolahan (2018). DSN Comedy Kaggle Challenge [Dataset]. https://www.kaggle.com/datasets/gbolahack/dsn-comedy-kaggle-challenge
    Explore at:
    zip(7811552 bytes)Available download formats
    Dataset updated
    Sep 21, 2018
    Authors
    Gbolahan
    Description

    Dataset

    This dataset was created by Gbolahan

    Contents

  12. Kaggle Survey Challenge - All Kernels

    • kaggle.com
    zip
    Updated Nov 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KlemenVodopivec (2022). Kaggle Survey Challenge - All Kernels [Dataset]. https://www.kaggle.com/datasets/klemenvodopivec/kaggle-survey-challenge-all-kernels/data
    Explore at:
    zip(206438 bytes)Available download formats
    Dataset updated
    Nov 22, 2022
    Authors
    KlemenVodopivec
    Description

    Collections of kernels submissions for the Kaggle survey competitions from 2017 to 2022. As this data was collected during the 2022 survey competition, it does not contain all the kernels for year 2022 .

  13. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(167219625372 bytes)Available download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  14. Deepfake Detection Challenge Dataset - Face Images

    • kaggle.com
    zip
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VIJAY DEVANE (2024). Deepfake Detection Challenge Dataset - Face Images [Dataset]. https://www.kaggle.com/datasets/vijaydevane/deepfake-detection-challenge-dataset-face-images
    Explore at:
    zip(126132864 bytes)Available download formats
    Dataset updated
    Sep 11, 2024
    Authors
    VIJAY DEVANE
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by VIJAY DEVANE

    Released under Apache 2.0

    Contents

  15. KaggleX skill assessment challenge

    • kaggle.com
    zip
    Updated Jun 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khadijat agboola (2024). KaggleX skill assessment challenge [Dataset]. https://www.kaggle.com/datasets/khadijatagboola/kagglex-skill-assessment-challenge
    Explore at:
    zip(2272868 bytes)Available download formats
    Dataset updated
    Jun 5, 2024
    Authors
    khadijat agboola
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset for this competition (both train and test) was generated from a deep learning model fine-tuned on the Used Car Price Prediction Dataset. While feature distributions are similar to the original, they are not identical. You are welcome to use the original dataset to explore differences and to see if incorporating it into your training improves model performance, though it is not mandatory.

    Files:

    train.csv: The training dataset; refer to the original dataset link above for column descriptions. test.csv: The test dataset; your objective is to predict the target value, Price. sample_submission.csv: A sample submission file in the correct format.

  16. AI Trust Level Prediction Challenge

    • kaggle.com
    zip
    Updated Sep 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2025). AI Trust Level Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/ai-trust-level-prediction-challenge
    Explore at:
    zip(290111 bytes)Available download formats
    Dataset updated
    Sep 14, 2025
    Authors
    Gaurav Dutta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Challenge Details: In this data-driven hackathon, participants will develop machine learning models to predict the AI Trust Level (%) based on AI Perception Data.

    Submission and Evaluation Submission Format: Participants must submit their predictions in the format specified in submission.csv. Evaluation Metric: Submissions will be evaluated based on the R2_Score , measuring how well the model predict the AI Trust Level (%). Leaderboard: Track your progress and aim for the top spot on the leaderboard.

  17. Top 1000 Kaggle Datasets

    • kaggle.com
    zip
    Updated Jan 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trrishan (2022). Top 1000 Kaggle Datasets [Dataset]. https://www.kaggle.com/datasets/notkrishna/top-1000-kaggle-datasets
    Explore at:
    zip(34269 bytes)Available download formats
    Dataset updated
    Jan 3, 2022
    Authors
    Trrishan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

  18. Open AI Caribbean Challenge

    • kaggle.com
    zip
    Updated Nov 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayantan Das (2019). Open AI Caribbean Challenge [Dataset]. https://www.kaggle.com/datasets/sayantandas30011998/open-ai-caribbean-challenge
    Explore at:
    zip(133121 bytes)Available download formats
    Dataset updated
    Nov 15, 2019
    Authors
    Sayantan Das
    Area covered
    Caribbean
    Description

    Dataset

    This dataset was created by Sayantan Das

    Contents

  19. Arcade Natural Language to Code Challenge

    • kaggle.com
    zip
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google AI (2023). Arcade Natural Language to Code Challenge [Dataset]. https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset
    Explore at:
    zip(3921922 bytes)Available download formats
    Dataset updated
    Feb 22, 2023
    Dataset authored and provided by
    Google AI
    Description

    Arcade: Natural Language to Code Generation in Interactive Computing Notebooks

    Arcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.

    Note👉 This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.

    Folder Structure

    Below is the structure of its content:

    └── ./
      ├── existing_tasks # Problems derived from existing data science notebooks on Github/
      │  ├── metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
      │  ├── artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
      │  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      ├── new_tasks/
      │  ├── dataset.json # Original, unprepossessed dataset
      │  ├── kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
      │  ├── artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
      │  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      └── checksums.txt # Table of MD5 checksums of dataset files.
    

    Dataset File Structure

    All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:

    {
      "notebook_name": "Name of the notebook.",
      "work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
      "annotator": "Anonymized annotator Id."
      "turns": [
        # A list of natural language to code examples using the current notebook context.
        {
          "input": "Prompt to a code generation model.",
          "turn": {
            "intent": {
              "value": "Annotated NL intent for the current turn.",
              "is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
              "cell_idx": "Index of the intent Markdown cell.",
              "line_span": "Line span of the intent.",
              "not_sure": "Annotation confidence.",
              "output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
            },
            "code": {
              "value": "Reference code solution.",
              "cell_idx": "Cell index of the code cell containing the solution.",
              "num_lines": "Number of lines in the reference solution.",
              "line_span": "Line span.",
            },
            "code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
            "delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
            "metadata": {
              "annotator_id": "Annotator Id",
              "num_code_lines": "Metadata, please ignore.",
              "utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
            },
          },
          "notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
          "metadata": {
            # A dict of metadata of this turn.
            "context_cells": [ # A list of context cells before the problem.
              {
                "cell_type": "code|markdown",
                "source": "Cell content."
              },
            ],
            "delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
            # The following fields only occur in datasets inlined with schema descriptions.
            "context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
            "inten...
    
  20. Mini RF Challenge Evaluation

    • kaggle.com
    zip
    Updated Aug 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gary CF Lee (2023). Mini RF Challenge Evaluation [Dataset]. https://www.kaggle.com/datasets/garycflee/mini-rf-challenge-evaluation
    Explore at:
    zip(1522095544 bytes)Available download formats
    Dataset updated
    Aug 25, 2023
    Authors
    Gary CF Lee
    Description

    Dataset

    This dataset was created by Gary CF Lee

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abhinav Kumar Jha (2021). Fake News Challenge [Dataset]. https://www.kaggle.com/datasets/abhinavkrjha/fake-news-challenge
Organization logo

Fake News Challenge

Detecting abnormal news articles

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
zip(5340415 bytes)Available download formats
Dataset updated
Apr 4, 2021
Authors
Abhinav Kumar Jha
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

The issue of “fake news” has arisen recently as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge was organized in early 2017 to encourage development of machine learning-based classification systems that perform “stance detection” -- i.e. identifying whether a particular news headline “agrees” with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to allow journalists and others to more easily find and investigate possible instances of “fake news.”

Content

The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:

train_bodies.csv

This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)

train_stances.csv

This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).

Distribution of the data

The distribution of Stance classes in train_stances.csv is as follows:

rowsunrelateddiscussagreedisagree
499720.731310.178280.07360120.0168094

There are 4 possible classifications: 1. The article text agrees with the headline. 2. The article text disagrees with the headline. 3. The article text is a discussion of the headline, without taking a position on it. 4. The article text is unrelated to the headline (i.e. it doesn’t address the same topic).

Acknowledgements

For details of the task, see FakeNewsChallenge.org

Search
Clear search
Close search
Google apps
Main menu