Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is the dataset about all the notebooks in Meta Kaggle Code Dataset . The original dataset is owned by kaggle team and I am just trying to extract meta data about meta kaggle code . My dataset contains following columns and hence , giving their description . If you have a feedback , you can view either Discussions or you can create a new topic as well . I hope you like the dataset , and you will utilize it for the Meta Kaggle Hackethon .
Cheers , ayush
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a categorized list of script file paths from Kaggle's Meta Kaggle Code (MKC) repository, organized by programming language and file type. It enables detailed exploration of how data scientists use different environments for notebooks and scripts on Meta Kaggle Code.
ipynb_file_list.txt – Path to Jupyter Notebooks written in Python in MKCpy_file_list.txt – Path to Standalone Python scripts in MKCr_file_list.txt – Path to R scripts in MKCrmd_file_list.txt – Path to R Markdown notebooks in MKC
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://i.imgur.com/2Egeb8R.png" alt="" title="a title">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.
UserId column in the ForumMessages table has values that do not exist in the Users table.True or False.Total columns.
For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.db_abd_create_tables.sql script.clean_data.py script.
The script does the following steps for each table:
NULL.add_foreign_keys.sql script.Total columns in the database tables. I do that by running the update_totals.sql script.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">
This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚
datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.
ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.
ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.
ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName 👩💻: The name of the dataset creator, which could be different from the owner.
creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.
creatorUserId 💼: The unique user ID of the dataset creator.
scriptCount 📜: The number of scripts (kernels) associated with this dataset.
scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount 👀: The number of views the dataset page has received on Kaggle.
downloadCount ⬇️: The number of times the dataset has been downloaded by users.
dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated 🔄: The date when the dataset was last updated or modified.
voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl ⬇️: A direct link to download the dataset files.
newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.
usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.
datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.
rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names 📑: A comma-separated string of category names that represent the dataset’s classification.
This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset contains the python files containing snippets required for the Kaggle kernel - https://www.kaggle.com/code/adeepak7/tensorflow-s-global-and-operation-level-seeds/
Since the kernel is around setting/re-setting global and local level seeds, the nullification of the effect of these seeds in the subsequent cells wasn't possible. Hence, the snippets have been provided as separate python files and these python files are executed independently in the separate cells.
Facebook
TwitterThis is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.
Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.
tree -L 1
.
├── c
├── cc
├── cpp
├── cs
├── css
├── csv
├── cxx
├── data
├── f90
├── go
├── html
├── java
├── js
├── json
├── m
├── map
├── md
├── txt
└── xml
And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.
$ tree map -L 1
map
├── 1001104
├── 1001659
├── 1001793
├── 1008839
├── 1009700
├── 1033697
├── 1034342
...
├── 836482
├── 838329
├── 838961
├── 840877
├── 840881
├── 844050
├── 845960
├── 848163
├── 888395
├── 891478
└── 893858
154 directories, 0 files
Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.
$ tree m/891531/ -L 1
m/891531/
├── 891531_0.png
├── 891531_10.png
├── 891531_11.png
├── 891531_12.png
├── 891531_13.png
├── 891531_14.png
├── 891531_15.png
├── 891531_16.png
├── 891531_17.png
├── 891531_18.png
├── 891531_19.png
├── 891531_1.png
├── 891531_20.png
├── 891531_21.png
├── 891531_22.png
├── 891531_23.png
├── 891531_24.png
├── 891531_25.png
├── 891531_26.png
├── 891531_27.png
├── 891531_28.png
├── 891531_29.png
├── 891531_2.png
├── 891531_30.png
├── 891531_3.png
├── 891531_4.png
├── 891531_5.png
├── 891531_6.png
├── 891531_7.png
├── 891531_8.png
└── 891531_9.png
0 directories, 31 files
So what's the difference?
The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.
How many images total?
We can count the number of total images:
find "." -type f -name *.png | wc -l
3,026,993
The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).
I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.
import cv2
cv2.imwrite(image_path, image)
Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.
image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
from imageio import imread
image = imread(image_path)
array([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
image.shape
(80,80)
# Deprecated
from scipy import misc
misc.imread(image_path)
Image([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?
ord(' ')
32
# And thus if you wanted to convert it back...
chr(32)
So how t...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
✅ Step 1: Mount to Dataset
Search for my dataset pytorch-models and add it — this will mount it at:
/kaggle/input/pytorch-models/
✅ Step 2: Check file paths Once mounted, the four files will be available at:
/kaggle/input/pytorch-models/base_models.py
/kaggle/input/pytorch-models/ext_base_models.py
/kaggle/input/pytorch-models/ext_hybrid_models.py
/kaggle/input/pytorch-models/hybrid_models.py
✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):
import shutil
shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:
import base_models
import ext_base_models
import ext_hybrid_models
import hybrid_models
Or, if you only want to import specific classes or functions:
from base_models import YourModelClass
from ext_base_models import AnotherModelClass
✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:
model = base_models.YourModelClass()
output = model(input_data)
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains relevant notebook submission files and papers:
Notebook submission files from:
PS S3E18 EDA + Ensembles by @zhukovoleksiy v8 0.65031.
PS_3.18_LGBM_bin by @akioonodera v9 0.64706.
PS3E18 EDA| Ensemble ML Pipeline |BinaryPredictict by @tetsutani v37 0.65540.
0.65447 | Ensemble | AutoML | Enzyme Classify by @utisop v10 0.65447.
pyBoost baselinepyBoost baseline by @l0glikelihood v4 0.65446.
Random Forest EC classification by @jbomitchell RF62853_submission.csv 0.62853.
Overfit Champion by @onurkoc83 v1 0.65810.
Playground Series S3E18 - EDA & Separate Learning by @mateuszk013 v1 0.64933.
Ensemble ML Pipeline + Bagging = 0.65557 by @chingiznurzhanov v7 0.65557.
PS3E18| FeatureEnginering+Stacking by @jaygun84 v5 0.64845.
S03E18 EDA | VotingClassifier | Optuna v15 0.64776.
PS3E18 - GaussianNB by @mehrankazeminia v1 0.65898, v2 0.66009 & v3 0.66117.
Enzyme Weighted Voting by @nivedithavudayagiri v2 0.65028.
Multi-label With TF-Decision Forests by @gusthema v6 0.63374.
S3E18 Target_Encoding LB 0.65947 by @meisa0 v1 0.65947.
Boost Classifier Model by @satyaprakashshukl v7 0.64965.
PS3E18: Multiple lightgbm models + Optuna by syerramilli v4 0.64982.
s3e18_solution for overfitting public :0.64785 by @onurkoc83 v1 0.64785.
PSS3E18 : FLAML : roc_auc_weighted by @gauravduttakiit v2 0.64732.
PGS318: combiner by @kdmitrie v4 0.65350.
averaging best solutions mean vs Weighted mean by @omarrajaa v5 0.66106.
Papers
N Nath & JBO Mitchell, Is EC class predictable from reaction mechanism? BMC Bioinformatics, 13:60 (2012) doi: 10.1186/1471-2105-13-60
L De Ferrari & JBO Mitchell, From sequence to enzyme mechanism using multi-label machine learning, BMC Bioinformatics, 15:150 (2014) doi: 10.1186/1471-2105-15-150
N Nath, JBO Mitchell & G Caetano-Anollés, The Natural History of Biocatalytic Mechanisms, PLoS Computational Biology, 10, e1003642 (2014) doi: 10.1371/journal.pcbi.1003642
KE Beattie, L De Ferrari & JBO Mitchell, Why do sequence signatures predict enzyme mechanism? Homology versus Chemistry, Evolutionary Bioinformatics, 11: 267-274 (2015) doi: 10.4137/EBO.S31482
HY Mussa, L De Ferrari & JBO Mitchell, Enzyme Mechanism Prediction: A Template Matching Problem on InterPro Signature Subspaces, BMC Research Reports, 8:744 (2015) doi: 10.1186/s13104-015-1730-7
Facebook
TwitterThis dataset is re-created from "ISO 3166 Countries with Regional Codes" dataset for specific cases.
"ISO 3166 Countries with Regional Codes" dataset: https://www.kaggle.com/datasets/aungdev/iso-3166-countries-with-regional-codes
Code used to creat country_codes_and_continents.csv file: https://www.kaggle.com/code/aungdev/create-country-codes-and-continents-csv-file
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterHere's what the dataset contains
filepath contains filepath to the images.prompt_{x/y/chart_type} contains the label for the images.The cleaning steps taken:
Notebook used to create the dataset: https://www.kaggle.com/code/pragyanbo/cleaned-dataset-creator/notebook
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Note: This is a work in progress, and not all the Kaggle forums are included in this dataset. The remaining forums will be added when I end solving some issues with the data generators related to these forums.
Welcome to the Kaggle Forum Discussions dataset!. This dataset contains curated data about recent discussions opened in the different forums on Kaggle. The data is obtained through web scraping techniques, using the selenium libraries, and converting text data into markdown style using the markdownify package.
This dataset contains information about the discussion main topic, topic title, comments, votes, medals and more, and is designed to serve as a complement to the data available on the Kaggle meta dataset, specifically for recent discussions. Keep reading to see the details.
As a dynamic website that relies heavily in JavaScript (JS), I extracted the data in this dataset through web scraping techniques using the selenium library.
The functions and classes used to scrape the data on Kaggle where stored on a utility script publicly available here. As JS-generated pages like Kaggle are unstable where trying to scrape them, the mentioned script implements capabilities for retrying connections and to await for elements to appear.
Each Forum was scrapped using a one notebook for each, then the mentioned notebooks were connected to a central notebook that generates this dataset. Also the discussions are scrapped in parallel so to enhance speed. This dataset represents all the data that can be gathered in a single notebook session, from the most recent to the most old.
If you need more control on the data you want to research, feel free to import all you need from the utility script mentioned before.
This dataset contains several folders, each named as the discussion forum they contain data about. For example, the 'competition-hosting' folder contains data about the Competition Hosting forum. Inside each folder, you'll find two files: one is a csv file and the other a json file.
The json file (in Python, represented as a dictionary) is indexed with the ID that Kaggle assigns to the mentioned discussion. Each ID is paired with its corresponding discussion, which is represented as a nested dictionary (the discussion dict), which contains the following fields: - title: The title of the main topic. - content: Content of the main topic. - tags: List containing the discussion's tags. - datetime: Date and time at which the discussion was published (in ISO 8601 format). - votes: Number of votes gotten by the discussion. - medal: Medal awarded by the main topic (if any). - user: User that published the main topic. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_comments: Total number of comments in the current discussion. - n_appreciation_comments: Total number of appreciation comments in the current discussion. - comments: Dictionary containing data about the comments in the discussion. Each comment is indexed by an ID assigned by Kaggle, containing the following fields: - content: Comment's content. - is_appreciation: Wether the comment is of appreciation. - is_deleted: Wether the comment was deleted. - n_replies: Number of replies to the comment. - datetime: Date and time at which the comment was published (in ISO 8601 format). - votes: Number of votes gotten by the current comment. - medal: Medal awarded by the comment (if any). - user: User that published the comment. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_deleted: Total number of deleted replies (including self). - replies: A dict following this same format.
By other side, the csv file serves as a summary of the json file, containing information about the comments limited to the hottest and most voted comments.
Note: Only the 'content' field is mandatory for each discussion. The availability of the other fields is subject to the stability of the scraping tasks, which may also affect the update frequency.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Source code related tasks for machine learning have become important with the large need of software production. In this dataset our main goal is to create a dataset for bug detection and repair.
The dataset is based on the CodeNet project and contains python code submissions for online coding competitions. The data is obtained by selecting consecutive attempts of a single user that resulted in fixing a buggy submission. Thus the data is represented by code pairs and annotated by the diff and error of each changed instruction. We have already tokenized all the source code files and kept the same format as in the original dataset.
CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Our goal is to create a bug detection and repair pipeline for online coding competition problems.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This AI-Kaggle-Assistant-File dataset is part of a notebook that has been specially prepared for use in the competition task Google - Gemini Long Context.
The following files can be found here:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sh
!cp -r /kaggle/input/rouge-score/rouge_score-0.1.2 /kaggle/working/
!pip install /kaggle/working/rouge_score-0.1.2/from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
'The quick brown dog jumps on the log.')
Facebook
TwitterThis is the whl file for version 0.1.9 of TabPFN.
!pip install /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl
followed by:
!mkdir /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff
!cp /kaggle/input/tabpfn-019-whl/prior_diff_real_checkpoint_n_0_epoch_100.cpkt /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff/
This dataset includes the files:
* prior_diff_real_checkpoint_n_0_epoch_42.cpkt from https://github.com/automl/TabPFN/tree/main/tabpfn/models_diff
* prior_diff_real_checkpoint_n_0_epoch_100.cpkt which seems to be the model file required.
Here is a use case demonstration notebook: "TabPFN test with notebook in "Internet off" mode"
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.
The first part of the code samples was taken from a private version of this notebook.
Here is the statistics about classes of programming languages from Github Code Snippets database
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">
From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.
Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.
The resulted file is dataset-10000.csv - included to the data card
The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">
To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card
The statistics for rare languages code snippets is as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">
For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv
To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv
After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv
The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.
The final distribution of classes turned out to be the next one
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">
To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Darren Chahal
Released under MIT
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!