Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Jørgen Sandhaug
Released under Apache 2.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Lisa Sharapova
Released under Apache 2.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the set of Kaggle competitions that are pertinent to healthcare. The dataset was created following the analysis of the Competitions.csv file which is available at https://www.kaggle.com/datasets/kaggle/meta-kaggle
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Satyam Kr
Released under MIT
Facebook
TwitterVaggP/Eedi-competition-kaggle-prompt-formats dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
| Column | Description |
| code_blocks_index | Global index linking code blocks to markup_data.csv. |
| kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
| code_block_id |
Position of the code block within the notebook. |
| code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
| Column | Description |
| kernel_id | Identifier for the Kaggle Jupyter notebook. |
| kaggle_score | Performance metric of the notebook. |
| kaggle_comments | Number of comments on the notebook. |
| kaggle_upvotes | Number of upvotes the notebook received. |
| kernel_link | URL to the notebook. |
| comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
| Column | Description |
| comp_name | Name of the Kaggle competition. |
| description | Overview of the competition task. |
| data_type | Type of data used in the competition. |
| comp_type | Classification of the competition. |
| subtitle | Short description of the task. |
| EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
| data_sources | Links to datasets used. |
| metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
| Column | Description |
| code_block | Machine learning code block. |
| too_long | Flag indicating whether the block spans multiple semantic types. |
| marks | Confidence level of the annotation. |
| graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id column.comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for [LLM Science Exam Kaggle Competition]
Dataset Summary
https://www.kaggle.com/competitions/kaggle-llm-science-exam/data
Languages
[en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]
Dataset Structure
Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.
Facebook
TwitterVaggP/Eedi-competition-kaggle-llama-fine-tune dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset was created by Muhammad Ahmed
Facebook
TwitterDataset Summary
Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.
Columns
id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is created to understand and gain some insights on the Kaggle competitions that are currently present in the competitions page of the Kaggle platform.
I've included 3 files and explained below what each of them contains.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Shake phenomenon occurs when the competition is shifting between two different datasets :
\[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]
The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.
Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :
<img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">
From the starter kernel :
<img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">
Seven datasets of competitions which were scraped from Kaggle :
| Competition | Name of file |
|---|---|
| Elo Merchant Category Recommendation | df_{Elo} |
| Human Protein Atlas Image Classification | df_{Protein} |
| Humpback Whale Identification | df_{Humpback} |
| Microsoft Malware Prediction | df_{Microsoft} |
| Quora Insincere Questions Classification | df_{Quora} |
| TGS Salt Identification Challenge | df_{TGS} |
| VSB Power Line Fault Detection | df_{VSB} |
As an example, consider the following dataframe from the Quora competition :
Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public
--- | ---
The Zoo |1|7|6|0.71323|0.71123
...| ...| ...| ...| ...| ...
D.J. Trump|1401|65|-1336|0.000|0.70573
I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !
\[ \text{Enjoy !}\]
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A cleaned version of
Competitions.csvfocused on timeline analysis.✅ Includes:
CompetitionId,Title,Deadline,EnabledDate,HostSegmentTitle✅ Helps understand growth over time, and regional hosting focus ✅ Can be joined withteams_clean.csvanduser_achievements_clean.csv
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Krishna Harsha M
Released under MIT
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.
100 rows and 13 columns. Columns' description are listed below.
Data from Kaggle. Image from Smartcat.
If you're reading this, please upvote.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I produced the dataset whilst working on the 2023 Kaggle AI report. The Meta Kaggle dataset provides helpful information about the Kaggle competitions but not the original descriptive text from the Kaggle web pages for each competition. We have information about the solutions but not the original problem. So, I wrote some web scraping scripts to collect and store that information.
Not all Kaggle web pages have that information available; some are missing or broken. Hence the nulls in the data. Secondly, note that not all previous Kaggle competitions exist in the Meta Kaggle data, which was used to collect the webpage slugs.
The scrapping scripts iterate over the IDs in Meta Kaggle competitions.csv data and attempt to collect the webpage data for that competition if it is currently null in the database. Hence new IDs will cause the scripts to go and collect their data, and each week, the scripts will try and fill in any links that were not working previously.
I have recently converted the original local scraping scripts on my machine into a Kaggle notebook that now updates this dataset weekly on Mondays. The notebook also explains the scraping procedure and its automation to keep this dataset up-to-date.
Note that the CompetitionId field joins to the Id of the competitions.csv of the Meta Kaggle dataset so that this information can be combined with the rest of Meta Kaggle.
My primary reason for collecting the data was for some text classification work I wanted to do, and I will publish it here soon. I hope that the data is useful to some other projects as well :-)
Facebook
TwitterThis data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.
See notebooks (Code tab) for how to import and explore the data, and build predictive models.
See Terms of Use for data license.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains relevant notebook submission files and papers:
Notebook submission files from:
PS S3E18 EDA + Ensembles by @zhukovoleksiy v8 0.65031.
PS_3.18_LGBM_bin by @akioonodera v9 0.64706.
PS3E18 EDA| Ensemble ML Pipeline |BinaryPredictict by @tetsutani v37 0.65540.
0.65447 | Ensemble | AutoML | Enzyme Classify by @utisop v10 0.65447.
pyBoost baselinepyBoost baseline by @l0glikelihood v4 0.65446.
Random Forest EC classification by @jbomitchell RF62853_submission.csv 0.62853.
Overfit Champion by @onurkoc83 v1 0.65810.
Playground Series S3E18 - EDA & Separate Learning by @mateuszk013 v1 0.64933.
Ensemble ML Pipeline + Bagging = 0.65557 by @chingiznurzhanov v7 0.65557.
PS3E18| FeatureEnginering+Stacking by @jaygun84 v5 0.64845.
S03E18 EDA | VotingClassifier | Optuna v15 0.64776.
PS3E18 - GaussianNB by @mehrankazeminia v1 0.65898, v2 0.66009 & v3 0.66117.
Enzyme Weighted Voting by @nivedithavudayagiri v2 0.65028.
Multi-label With TF-Decision Forests by @gusthema v6 0.63374.
S3E18 Target_Encoding LB 0.65947 by @meisa0 v1 0.65947.
Boost Classifier Model by @satyaprakashshukl v7 0.64965.
PS3E18: Multiple lightgbm models + Optuna by syerramilli v4 0.64982.
s3e18_solution for overfitting public :0.64785 by @onurkoc83 v1 0.64785.
PSS3E18 : FLAML : roc_auc_weighted by @gauravduttakiit v2 0.64732.
PGS318: combiner by @kdmitrie v4 0.65350.
averaging best solutions mean vs Weighted mean by @omarrajaa v5 0.66106.
Papers
N Nath & JBO Mitchell, Is EC class predictable from reaction mechanism? BMC Bioinformatics, 13:60 (2012) doi: 10.1186/1471-2105-13-60
L De Ferrari & JBO Mitchell, From sequence to enzyme mechanism using multi-label machine learning, BMC Bioinformatics, 15:150 (2014) doi: 10.1186/1471-2105-15-150
N Nath, JBO Mitchell & G Caetano-Anollés, The Natural History of Biocatalytic Mechanisms, PLoS Computational Biology, 10, e1003642 (2014) doi: 10.1371/journal.pcbi.1003642
KE Beattie, L De Ferrari & JBO Mitchell, Why do sequence signatures predict enzyme mechanism? Homology versus Chemistry, Evolutionary Bioinformatics, 11: 267-274 (2015) doi: 10.4137/EBO.S31482
HY Mussa, L De Ferrari & JBO Mitchell, Enzyme Mechanism Prediction: A Template Matching Problem on InterPro Signature Subspaces, BMC Research Reports, 8:744 (2015) doi: 10.1186/s13104-015-1730-7
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains all the stats of all completed competitions organized on Kaggle .It contains 15 columns . 1.Comp_name- Name of competition
2.comp_ Reward- Type of Reward
3.comp_link- link of competiton
4.teams- number of participated team
5.Entries- Number of Entries
6.Competitors- number of competitors
7.start_date- starting date
8.start_month- starting month
9.start_year- starting year
10.Final_date- ending date
11.Final_month- Ending month
12.Final_year- ending year
13.code_link- Link of one notebook on each comp
14.Desc- Description of competition
This dataset has been scrapped from link
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created to provide a stable, reliable data source for notebooks, avoiding the 'deleted-dataset' errors that can occur with the frequently-updated official Meta Kaggle dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Jørgen Sandhaug
Released under Apache 2.0