Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The issue of “fake news” has arisen recently as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge was organized in early 2017 to encourage development of machine learning-based classification systems that perform “stance detection” -- i.e. identifying whether a particular news headline “agrees” with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to allow journalists and others to more easily find and investigate possible instances of “fake news.”
The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:
train_bodies.csvThis file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)
train_stances.csvThis file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).
The distribution of Stance classes in train_stances.csv is as follows:
| rows | unrelated | discuss | agree | disagree |
|---|---|---|---|---|
| 49972 | 0.73131 | 0.17828 | 0.0736012 | 0.0168094 |
There are 4 possible classifications: 1. The article text agrees with the headline. 2. The article text disagrees with the headline. 3. The article text is a discussion of the headline, without taking a position on it. 4. The article text is unrelated to the headline (i.e. it doesn’t address the same topic).
For details of the task, see FakeNewsChallenge.org
Facebook
TwitterCriteo Display Advertising Challenge dataset, which is provided by the Criteo company on the famous machine learning website Kaggle for advertising CTR .
Facebook
TwitterThis folder contains the baseline model implementation for the Kaggle universal image embedding challenge based on
Following the above ideas, we also add a 64 projection layer on top of the Vision Transformer base model as the final embedding, since the competition requires embeddings of at most 64 dimensions. Please find more details in image_classification.py.
To use the code, please firstly install the prerequisites
pip install -r universal_embedding_challenge/requirements.txt
git clone https://github.com/tensorflow/models.git /tmp/models
export PYTHONPATH=$PYTHONPATH:/tmp/models
pip install --user -r /tmp/models/official/requirements.txt
Secondly, please download the imagenet1k data in TFRecord format from https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0 and https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-1, and merge them together under folder imagenet-2012-tfrecord/. As a result, the paths to the training datasets and the validation datasets should be imagenet-2012-tfrecord/train* and imagenet-2012-tfrecord/validation*, respectively.
The trainer for the model is implemented in train.py, and the following example launches the training
python -m universal_embedding_challenge.train \
--experiment=vit_with_bottleneck_imagenet_pretrain \
--mode=train_and_eval \
--model_dir=/tmp/imagenet1k_test
The trained model checkpoints could be further converted to savedModel format using export_saved_model.py for Kaggle submission.
The code to compute metrics for Universal Embedding Challenge is implemented in metrics.py and the code to read the solution file is implemented in read_retrieval_solution.py.
Facebook
TwitterThis dataset was created by Sreenanda Sai Dasari
Facebook
TwitterThis dataset was created by Alexander Chervov
Facebook
TwitterGitHub Issues & Kaggle Notebooks
Description
GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Shake phenomenon occurs when the competition is shifting between two different datasets :
\[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]
The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.
Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :
<img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">
From the starter kernel :
<img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">
Seven datasets of competitions which were scraped from Kaggle :
| Competition | Name of file |
|---|---|
| Elo Merchant Category Recommendation | df_{Elo} |
| Human Protein Atlas Image Classification | df_{Protein} |
| Humpback Whale Identification | df_{Humpback} |
| Microsoft Malware Prediction | df_{Microsoft} |
| Quora Insincere Questions Classification | df_{Quora} |
| TGS Salt Identification Challenge | df_{TGS} |
| VSB Power Line Fault Detection | df_{VSB} |
As an example, consider the following dataframe from the Quora competition :
Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public
--- | ---
The Zoo |1|7|6|0.71323|0.71123
...| ...| ...| ...| ...| ...
D.J. Trump|1401|65|-1336|0.000|0.70573
I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !
\[ \text{Enjoy !}\]
Facebook
TwitterThe Kaggle display advertising challenge dataset.
Facebook
TwitterDataset Summary
Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.
Columns
id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
## Overview
Kaggle is a dataset for object detection tasks - it contains K annotations for 779 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
Facebook
TwitterThis dataset was created by Gbolahan
Facebook
TwitterCollections of kernels submissions for the Kaggle survey competitions from 2017 to 2022. As this data was collected during the 2022 survey competition, it does not contain all the kernels for year 2022 .
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by VIJAY DEVANE
Released under Apache 2.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset for this competition (both train and test) was generated from a deep learning model fine-tuned on the Used Car Price Prediction Dataset. While feature distributions are similar to the original, they are not identical. You are welcome to use the original dataset to explore differences and to see if incorporating it into your training improves model performance, though it is not mandatory.
Files:
train.csv: The training dataset; refer to the original dataset link above for column descriptions. test.csv: The test dataset; your objective is to predict the target value, Price. sample_submission.csv: A sample submission file in the correct format.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Challenge Details: In this data-driven hackathon, participants will develop machine learning models to predict the AI Trust Level (%) based on AI Perception Data.
Submission and Evaluation Submission Format: Participants must submit their predictions in the format specified in submission.csv. Evaluation Metric: Submissions will be evaluated based on the R2_Score , measuring how well the model predict the AI Trust Level (%). Leaderboard: Track your progress and aim for the top spot on the leaderboard.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]
Source: Kaggle
Facebook
TwitterThis dataset was created by Sayantan Das
Facebook
TwitterArcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.
Note👉 This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.
Below is the structure of its content:
└── ./
├── existing_tasks # Problems derived from existing data science notebooks on Github/
│ ├── metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
│ ├── artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
│ └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
├── new_tasks/
│ ├── dataset.json # Original, unprepossessed dataset
│ ├── kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
│ ├── artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
│ └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
└── checksums.txt # Table of MD5 checksums of dataset files.
All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:
{
"notebook_name": "Name of the notebook.",
"work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
"annotator": "Anonymized annotator Id."
"turns": [
# A list of natural language to code examples using the current notebook context.
{
"input": "Prompt to a code generation model.",
"turn": {
"intent": {
"value": "Annotated NL intent for the current turn.",
"is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
"cell_idx": "Index of the intent Markdown cell.",
"line_span": "Line span of the intent.",
"not_sure": "Annotation confidence.",
"output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
},
"code": {
"value": "Reference code solution.",
"cell_idx": "Cell index of the code cell containing the solution.",
"num_lines": "Number of lines in the reference solution.",
"line_span": "Line span.",
},
"code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
"delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
"metadata": {
"annotator_id": "Annotator Id",
"num_code_lines": "Metadata, please ignore.",
"utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
},
},
"notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
"metadata": {
# A dict of metadata of this turn.
"context_cells": [ # A list of context cells before the problem.
{
"cell_type": "code|markdown",
"source": "Cell content."
},
],
"delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
# The following fields only occur in datasets inlined with schema descriptions.
"context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
"inten...
Facebook
TwitterThis dataset was created by Gary CF Lee
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The issue of “fake news” has arisen recently as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge was organized in early 2017 to encourage development of machine learning-based classification systems that perform “stance detection” -- i.e. identifying whether a particular news headline “agrees” with, “disagrees” with, “discusses,” or is unrelated to a particular news article -- in order to allow journalists and others to more easily find and investigate possible instances of “fake news.”
The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:
train_bodies.csvThis file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)
train_stances.csvThis file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).
The distribution of Stance classes in train_stances.csv is as follows:
| rows | unrelated | discuss | agree | disagree |
|---|---|---|---|---|
| 49972 | 0.73131 | 0.17828 | 0.0736012 | 0.0168094 |
There are 4 possible classifications: 1. The article text agrees with the headline. 2. The article text disagrees with the headline. 3. The article text is a discussion of the headline, without taking a position on it. 4. The article text is unrelated to the headline (i.e. it doesn’t address the same topic).
For details of the task, see FakeNewsChallenge.org