100+ datasets found
  1. Fake News Challenge

    • kaggle.com
    zip
    Updated Apr 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhinav Kumar Jha (2021). Fake News Challenge [Dataset]. https://www.kaggle.com/datasets/abhinavkrjha/fake-news-challenge
    Explore at:
    zip(5340415 bytes)Available download formats
    Dataset updated
    Apr 4, 2021
    Authors
    Abhinav Kumar Jha
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The issue of “fake news” has arisen recently as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge was organized in early 2017 to encourage development of machine learning-based classification systems that perform “stance detection” -- i.e. identifying whether a particular news headline “agrees” with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to allow journalists and others to more easily find and investigate possible instances of “fake news.”

    Content

    The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:

    train_bodies.csv

    This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)

    train_stances.csv

    This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).

    Distribution of the data

    The distribution of Stance classes in train_stances.csv is as follows:

    rowsunrelateddiscussagreedisagree
    499720.731310.178280.07360120.0168094

    There are 4 possible classifications: 1. The article text agrees with the headline. 2. The article text disagrees with the headline. 3. The article text is a discussion of the headline, without taking a position on it. 4. The article text is unrelated to the headline (i.e. it doesn’t address the same topic).

    Acknowledgements

    For details of the task, see FakeNewsChallenge.org

  2. Competition on Kaggle

    • kaggle.com
    zip
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satyam Kr (2024). Competition on Kaggle [Dataset]. https://www.kaggle.com/datasets/sarty077/competition-on-kaggle
    Explore at:
    zip(2272868 bytes)Available download formats
    Dataset updated
    Jun 14, 2024
    Authors
    Satyam Kr
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Satyam Kr

    Released under MIT

    Contents

  3. f

    Kaggle Display Advertising Challenge dataset

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Dec 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiang, Zilong (2017). Kaggle Display Advertising Challenge dataset [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001796020
    Explore at:
    Dataset updated
    Dec 24, 2017
    Authors
    Jiang, Zilong
    Description

    Criteo Display Advertising Challenge dataset, which is provided by the Criteo company on the famous machine learning website Kaggle for advertising CTR .

  4. Google Universal Embedding Challenge Github Repo

    • kaggle.com
    zip
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2022). Google Universal Embedding Challenge Github Repo [Dataset]. https://www.kaggle.com/datasets/dschettler8845/google-universal-embedding-challenge-github-repo
    Explore at:
    zip(13561 bytes)Available download formats
    Dataset updated
    Jul 12, 2022
    Authors
    Darien Schettler
    Description

    Universal Embedding Challenge baseline model implementation.

    This folder contains the baseline model implementation for the Kaggle universal image embedding challenge based on

    Following the above ideas, we also add a 64 projection layer on top of the Vision Transformer base model as the final embedding, since the competition requires embeddings of at most 64 dimensions. Please find more details in image_classification.py.

    To use the code, please firstly install the prerequisites

    pip install -r universal_embedding_challenge/requirements.txt
    
    git clone https://github.com/tensorflow/models.git /tmp/models
    export PYTHONPATH=$PYTHONPATH:/tmp/models
    pip install --user -r /tmp/models/official/requirements.txt
    

    Secondly, please download the imagenet1k data in TFRecord format from https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0 and https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-1, and merge them together under folder imagenet-2012-tfrecord/. As a result, the paths to the training datasets and the validation datasets should be imagenet-2012-tfrecord/train* and imagenet-2012-tfrecord/validation*, respectively.

    The trainer for the model is implemented in train.py, and the following example launches the training

    python -m universal_embedding_challenge.train \
     --experiment=vit_with_bottleneck_imagenet_pretrain \
     --mode=train_and_eval \
     --model_dir=/tmp/imagenet1k_test
    

    The trained model checkpoints could be further converted to savedModel format using export_saved_model.py for Kaggle submission.

    The code to compute metrics for Universal Embedding Challenge is implemented in metrics.py and the code to read the solution file is implemented in read_retrieval_solution.py.

  5. Online Kaggle Competition Points Calculator

    • kaggle.com
    zip
    Updated Oct 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Ahmed (2020). Online Kaggle Competition Points Calculator [Dataset]. https://www.kaggle.com/datasets/muhammad4hmed/online-kaggle-competition-points-calculator
    Explore at:
    zip(30400 bytes)Available download formats
    Dataset updated
    Oct 5, 2020
    Authors
    Muhammad Ahmed
    Description

    Dataset

    This dataset was created by Muhammad Ahmed

    Contents

  6. 2021 Kaggle Machine Learning Challenge dataset

    • kaggle.com
    zip
    Updated Nov 28, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sreenanda Sai Dasari (2021). 2021 Kaggle Machine Learning Challenge dataset [Dataset]. https://www.kaggle.com/sreenandasaidasari/2021-kaggle-machine-learning-challenge-dataset
    Explore at:
    zip(2999272 bytes)Available download formats
    Dataset updated
    Nov 28, 2021
    Authors
    Sreenanda Sai Dasari
    Description

    Dataset

    This dataset was created by Sreenanda Sai Dasari

    Contents

  7. h

    Eedi-competition-kaggle-prompt-formats

    • huggingface.co
    Updated Sep 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EVANGELOS PAPAMITSOS (2024). Eedi-competition-kaggle-prompt-formats [Dataset]. https://huggingface.co/datasets/VaggP/Eedi-competition-kaggle-prompt-formats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2024
    Authors
    EVANGELOS PAPAMITSOS
    Description

    VaggP/Eedi-competition-kaggle-prompt-formats dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. t

    Kaggle display advertising challenge dataset - Dataset - LDM

    • service.tib.eu
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Kaggle display advertising challenge dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/kaggle-display-advertising-challenge-dataset
    Explore at:
    Dataset updated
    Jan 3, 2025
    Description

    The Kaggle display advertising challenge dataset.

  9. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  10. h

    kaggle-nlp-getting-start

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    hui
    Description

    Dataset Summary

    Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

    Columns

    id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.

  11. EC class prediction dataset

    • kaggle.com
    zip
    Updated Jul 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Mitchell (2023). EC class prediction dataset [Dataset]. https://www.kaggle.com/datasets/jbomitchell/ec-class-prediction-dataset
    Explore at:
    zip(8106829 bytes)Available download formats
    Dataset updated
    Jul 10, 2023
    Authors
    John Mitchell
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains relevant notebook submission files and papers:

    Notebook submission files from:

    PS S3E18 EDA + Ensembles by @zhukovoleksiy v8 0.65031.

    PS_3.18_LGBM_bin by @akioonodera v9 0.64706.

    PS3E18 EDA| Ensemble ML Pipeline |BinaryPredictict by @tetsutani v37 0.65540.

    0.65447 | Ensemble | AutoML | Enzyme Classify by @utisop v10 0.65447.

    pyBoost baselinepyBoost baseline by @l0glikelihood v4 0.65446.

    Random Forest EC classification by @jbomitchell RF62853_submission.csv 0.62853.

    Overfit Champion by @onurkoc83 v1 0.65810.

    Playground Series S3E18 - EDA & Separate Learning by @mateuszk013 v1 0.64933.

    Ensemble ML Pipeline + Bagging = 0.65557 by @chingiznurzhanov v7 0.65557.

    PS3E18| FeatureEnginering+Stacking by @jaygun84 v5 0.64845.

    S03E18 EDA | VotingClassifier | Optuna v15 0.64776.

    PS3E18 - GaussianNB by @mehrankazeminia v1 0.65898, v2 0.66009 & v3 0.66117.

    Enzyme Weighted Voting by @nivedithavudayagiri v2 0.65028.

    Multi-label With TF-Decision Forests by @gusthema v6 0.63374.

    S3E18 Target_Encoding LB 0.65947 by @meisa0 v1 0.65947.

    Boost Classifier Model by @satyaprakashshukl v7 0.64965.

    PS3E18: Multiple lightgbm models + Optuna by syerramilli v4 0.64982.

    s3e18_solution for overfitting public :0.64785 by @onurkoc83 v1 0.64785.

    PSS3E18 : FLAML : roc_auc_weighted by @gauravduttakiit v2 0.64732.

    PGS318: combiner by @kdmitrie v4 0.65350.

    averaging best solutions mean vs Weighted mean by @omarrajaa v5 0.66106.

    Papers

    N Nath & JBO Mitchell, Is EC class predictable from reaction mechanism? BMC Bioinformatics, 13:60 (2012) doi: 10.1186/1471-2105-13-60

    L De Ferrari & JBO Mitchell, From sequence to enzyme mechanism using multi-label machine learning, BMC Bioinformatics, 15:150 (2014) doi: 10.1186/1471-2105-15-150

    N Nath, JBO Mitchell & G Caetano-Anollés, The Natural History of Biocatalytic Mechanisms, PLoS Computational Biology, 10, e1003642 (2014) doi: 10.1371/journal.pcbi.1003642

    KE Beattie, L De Ferrari & JBO Mitchell, Why do sequence signatures predict enzyme mechanism? Homology versus Chemistry, Evolutionary Bioinformatics, 11: 267-274 (2015) doi: 10.4137/EBO.S31482

    HY Mussa, L De Ferrari & JBO Mitchell, Enzyme Mechanism Prediction: A Template Matching Problem on InterPro Signature Subspaces, BMC Research Reports, 8:744 (2015) doi: 10.1186/s13104-015-1730-7

  12. CAFA Protein Function Annotation Challenges

    • kaggle.com
    zip
    Updated May 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Chervov (2023). CAFA Protein Function Annotation Challenges [Dataset]. https://www.kaggle.com/datasets/alexandervc/cafa-protein-function-annotation-challenges
    Explore at:
    zip(415515112 bytes)Available download formats
    Dataset updated
    May 29, 2023
    Authors
    Alexander Chervov
    Description

    Dataset

    This dataset was created by Alexander Chervov

    Contents

  13. CoNIC Challenge Dataset

    • kaggle.com
    zip
    Updated Jan 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aadam (2022). CoNIC Challenge Dataset [Dataset]. https://www.kaggle.com/datasets/aadimator/conic-challenge-dataset
    Explore at:
    zip(985929496 bytes)Available download formats
    Dataset updated
    Jan 6, 2022
    Authors
    Aadam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The dataset consists of Haematoxylin and Eosin stained histology images at 20x objective magnification (~0.5 microns/pixel) from 6 different data sources. For each image, an instance segmentation and a classification mask is provided. Within the dataset, each nucleus is assigned to one of the following categories:

    • Epithelial
    • Lymphocyte
    • Plasma
    • Eosinophil
    • Neutrophil
    • Connective tissue For more information on the dataset and the associated categories, we encourage participants to read the original dataset paper.

    Data Format

    Our provided patch-level dataset contains 4,981 non-overlapping images of size 256x256 provided in the following format: - RGB images - Segmentation & classification maps - Nuclei counts The RGB images and segmentation/classification maps are each stored as a single NumPy array. The RGB image array has dimensions 4981x256x256x3, whereas the segmentation & classification map array has dimensions 4981x256x256x2. Here, the first channel is the instance segmentation map and the second channel is the classification map. For the nuclei counts, we provide a single csv file, where each row corresponds to a given patch and the columns determine the counts for each type of nucleus. The row ordering is in line with the order of patches within the numpy files. https://grand-challenge-public-prod.s3.amazonaws.com/i/2021/11/20/sample.png" alt=""> A given nucleus is considered present in the image if any part of it is within the central 224x224 region within the patch. This ensures that a nucleus is only considered for counting if it lies completely within the original 256x256 image.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    This dataset was provided by the Organizers of the CoNIC Challenge: - Simon Graham (TIA, PathLAKE) - Mostafa Jahanifar (TIA, PathLAKE) - Dang Vu (TIA) - Giorgos Hadjigeorghiou (TIA, PathLAKE) - Thomas Leech (TIA, PathLAKE) - David Snead (UHCW, PathLAKE) - Shan Raza (TIA, PathLAKE) - Fayyaz Minhas (TIA, PathLAKE) - Nasir Rajpoot (TIA, PathLAKE)

    TIA: Tissue Image Analytics Centre, Department of Computer Science, University of Warwick, United Kingdom

    UHCW: Department of Pathology, University Hospitals Coventry and Warwickshire, United Kingdom

    PathLAKE: Pathology Image Data Lake for Analytics Knowledge & Education, University Hospitals Coventry and Warwickshire, United Kingdom

  14. DSN Comedy Kaggle Challenge

    • kaggle.com
    zip
    Updated Sep 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gbolahan (2018). DSN Comedy Kaggle Challenge [Dataset]. https://www.kaggle.com/datasets/gbolahack/dsn-comedy-kaggle-challenge
    Explore at:
    zip(7811552 bytes)Available download formats
    Dataset updated
    Sep 21, 2018
    Authors
    Gbolahan
    Description

    Dataset

    This dataset was created by Gbolahan

    Contents

  15. CrunchDAO Competition Unified Dataset

    • kaggle.com
    zip
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Arvidsson (2023). CrunchDAO Competition Unified Dataset [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/crunchdao-competition-unified-dataset
    Explore at:
    zip(183163058 bytes)Available download formats
    Dataset updated
    Jun 15, 2023
    Authors
    Joakim Arvidsson
    Description

    This data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.

    See notebooks (Code tab) for how to import and explore the data, and build predictive models.

    See Terms of Use for data license.

  16. Deepfake Detection Challenge Dataset - Face Images

    • kaggle.com
    zip
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VIJAY DEVANE (2024). Deepfake Detection Challenge Dataset - Face Images [Dataset]. https://www.kaggle.com/datasets/vijaydevane/deepfake-detection-challenge-dataset-face-images
    Explore at:
    zip(126132864 bytes)Available download formats
    Dataset updated
    Sep 11, 2024
    Authors
    VIJAY DEVANE
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by VIJAY DEVANE

    Released under Apache 2.0

    Contents

  17. Competitions Shake-up

    • kaggle.com
    zip
    Updated Sep 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniboy370 (2020). Competitions Shake-up [Dataset]. https://www.kaggle.com/daniboy370/competitions-shakeup
    Explore at:
    zip(388789 bytes)Available download formats
    Dataset updated
    Sep 27, 2020
    Authors
    Daniboy370
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Shake-what ?!

    The Shake phenomenon occurs when the competition is shifting between two different datasets :

    \[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]

    The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.

    Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :

                 <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">
    

    From the starter kernel :

                   <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">
    

    Content

    Seven datasets of competitions which were scraped from Kaggle :

    CompetitionName of file
    Elo Merchant Category Recommendationdf_{Elo}
    Human Protein Atlas Image Classificationdf_{Protein}
    Humpback Whale Identificationdf_{Humpback}
    Microsoft Malware Predictiondf_{Microsoft}
    Quora Insincere Questions Classificationdf_{Quora}
    TGS Salt Identification Challengedf_{TGS}
    VSB Power Line Fault Detectiondf_{VSB}

    As an example, consider the following dataframe from the Quora competition : Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public --- | --- The Zoo |1|7|6|0.71323|0.71123 ...| ...| ...| ...| ...| ... D.J. Trump|1401|65|-1336|0.000|0.70573

    I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !

    \[ \text{Enjoy !}\]

  18. Kaggle Survey Challenge - All Kernels

    • kaggle.com
    zip
    Updated Nov 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KlemenVodopivec (2022). Kaggle Survey Challenge - All Kernels [Dataset]. https://www.kaggle.com/datasets/klemenvodopivec/kaggle-survey-challenge-all-kernels/data
    Explore at:
    zip(206438 bytes)Available download formats
    Dataset updated
    Nov 22, 2022
    Authors
    KlemenVodopivec
    Description

    Collections of kernels submissions for the Kaggle survey competitions from 2017 to 2022. As this data was collected during the 2022 survey competition, it does not contain all the kernels for year 2022 .

  19. KaggleX skill assessment challenge

    • kaggle.com
    zip
    Updated Jun 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khadijat agboola (2024). KaggleX skill assessment challenge [Dataset]. https://www.kaggle.com/datasets/khadijatagboola/kagglex-skill-assessment-challenge
    Explore at:
    zip(2272868 bytes)Available download formats
    Dataset updated
    Jun 5, 2024
    Authors
    khadijat agboola
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset for this competition (both train and test) was generated from a deep learning model fine-tuned on the Used Car Price Prediction Dataset. While feature distributions are similar to the original, they are not identical. You are welcome to use the original dataset to explore differences and to see if incorporating it into your training improves model performance, though it is not mandatory.

    Files:

    train.csv: The training dataset; refer to the original dataset link above for column descriptions. test.csv: The test dataset; your objective is to predict the target value, Price. sample_submission.csv: A sample submission file in the correct format.

  20. Meta_Kaggle_Competitions_cleaned_dataset

    • kaggle.com
    zip
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarvpreet Kaur (2025). Meta_Kaggle_Competitions_cleaned_dataset [Dataset]. https://www.kaggle.com/datasets/sarvpreetkaur22/meta-kaggle-competitions-cleaned-dataset/data
    Explore at:
    zip(339979 bytes)Available download formats
    Dataset updated
    Jul 17, 2025
    Authors
    Sarvpreet Kaur
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Description:

    A cleaned version of Competitions.csv focused on timeline analysis.

    ✅ Includes: CompetitionId, Title, Deadline, EnabledDate, HostSegmentTitle ✅ Helps understand growth over time, and regional hosting focus ✅ Can be joined with teams_clean.csv and user_achievements_clean.csv

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abhinav Kumar Jha (2021). Fake News Challenge [Dataset]. https://www.kaggle.com/datasets/abhinavkrjha/fake-news-challenge
Organization logo

Fake News Challenge

Detecting abnormal news articles

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
zip(5340415 bytes)Available download formats
Dataset updated
Apr 4, 2021
Authors
Abhinav Kumar Jha
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

The issue of “fake news” has arisen recently as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge was organized in early 2017 to encourage development of machine learning-based classification systems that perform “stance detection” -- i.e. identifying whether a particular news headline “agrees” with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to allow journalists and others to more easily find and investigate possible instances of “fake news.”

Content

The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:

train_bodies.csv

This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)

train_stances.csv

This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).

Distribution of the data

The distribution of Stance classes in train_stances.csv is as follows:

rowsunrelateddiscussagreedisagree
499720.731310.178280.07360120.0168094

There are 4 possible classifications: 1. The article text agrees with the headline. 2. The article text disagrees with the headline. 3. The article text is a discussion of the headline, without taking a position on it. 4. The article text is unrelated to the headline (i.e. it doesn’t address the same topic).

Acknowledgements

For details of the task, see FakeNewsChallenge.org

Search
Clear search
Close search
Google apps
Main menu