Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
For installing pyspark when running notebook without Internet:
(1) Attach the pyspark-package dataset to your notebook.
(2) Install pyspark with the following code:
import shutil
src_path = r"/kaggle/input/pyspark-package/pyspark-latest.tar.gz.mp4"
dst_path = r"/kaggle/working/pyspark-latest.tar.gz"
shutil.copy(src_path, dst_path)
!pip install /kaggle/working/pyspark-latest.tar.gz
or for specific version, check if that version is available in dataset, then you can use i.e. for 3.5.0:
import shutil
src_path = r"/kaggle/input/pyspark-package/pyspark-3.5.0.tar.gz.mp4"
dst_path = r"/kaggle/working/pyspark-3.5.0.tar.gz"
shutil.copy(src_path, dst_path)
!pip install /kaggle/working/pyspark-3.5.0.tar.gz
(3) Then you can use:
python
import pyspark
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://www.kaggle.com/datasets/muzammilaliveltech/farm-harmful-animals-dataset
this dataset is not mine, it was uploaded to Kaggle by MUZAMMIL ALI VELTECH under CC0: Public Domain. This Roboflow project was made as an attempt to use the dataset after having issue trying to import in Jupyter Notebook from Kaggle
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
✅ Step 1: Mount to Dataset
Search for my dataset pytorch-models and add it — this will mount it at:
/kaggle/input/pytorch-models/
✅ Step 2: Check file paths Once mounted, the four files will be available at:
/kaggle/input/pytorch-models/base_models.py
/kaggle/input/pytorch-models/ext_base_models.py
/kaggle/input/pytorch-models/ext_hybrid_models.py
/kaggle/input/pytorch-models/hybrid_models.py
✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):
import shutil
shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:
import base_models
import ext_base_models
import ext_hybrid_models
import hybrid_models
Or, if you only want to import specific classes or functions:
from base_models import YourModelClass
from ext_base_models import AnotherModelClass
✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:
model = base_models.YourModelClass()
output = model(input_data)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Source repo is google/flan-t5-large.
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('/kaggle/input/googleflan-t5-large/flan-t5-large')
tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/googleflan-t5-large/flan-t5-large')
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Pre-trained weights of Deplot (source: https://huggingface.co/google/deplot) to import directly to the notebook when the internet is off.
First, run this to install the latest version of the library transformers (reason)
!pip install git+https://github.com/huggingface/transformers
Usage: ``` from transformers import Pix2StructForConditionalGeneration, AutoProcessor
model = Pix2StructForConditionalGeneration.from_pretrained('/kaggle/input/google-deplot-model') processor = AutoProcessor.from_pretrained('/kaggle/input/google-deplot-model') ```
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset allows readers to unlock hidden insights into contemporary literature and the books that people are choosing to purchase. It provides comprehensive and powerful data related to a web books retailer, books.toscrape.com, featuring 12 columns of crucial book metadata gathered through web scraping methods in November 2020. Researching publications through this information provides a great sense of insight and understanding into the current reading climate: uncovering emerging trends in what people are buying, reading, rating, and loving worldwide. With this dataset at your disposal you can explore book popularity from a commercial standpoint as well as a creative one; examining publishing preferences from authors' points of view across reviews and genres alike. Dive into discovering the secrets behind book selection habits by delving into topics ranging from rating systems for certain works to pricing analysis for publishers- all fuelled by this carefully organised streamline of data at play here today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
To get started analyzing this dataset with Kaggle notebooks or other tools: - Open up your tool (Kaggle notebook or another tool) that supports reading CSV files
- Import the dataset.csv file into your chosen program - Explore each column individually to better understand what type of book metadata exists within each category – descriptors such as title, image URLs/links, ratings/number of reviews, description and more can be found here; 5. Once familiarized with each type for metadata for each column provided by this dataset – begin exploring any correlations between them to deepen understanding about trends among different types for books over time – broken down by category; 6 Lastly – use all available resources through 3rd-party packages within your chosen programming language to continue exploring deeper analysis possibilities (e.g., Pandas).By following these steps - you are now ready to start exploring powerful literature insights into contemporary reading material standards! Enjoy discovering hidden insights within this book metadata - that may have otherwise gone undiscovered!
- Generating recommendations of books based on popularity, price point and/of rating.
- Tracking the success of certain authors/publishers in the long term and understanding their audience preferences.
- Analysing which types of books consumers prefer (genre, age group targeting) over time to provide useful data to new authors to increase their chances of success
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: dataset.csv | Column name | Description | |:----------------------------------------|:---------------------------------------------------| | Logan Kade (Fallen Crest High #5.5) | Title of the book. (String) | | https | Image URL of the book. (String) | | Two | Rating of the book. (Integer) | | Academic | Description Category of the book. (String) | | 7093cf549cd2e7de | Universal Product Code (UPC) of the book. (String) | | Books | Product Type of the book. (String) | | £13.12 | Price Excluding Tax of the book. (Float) | | £13.12.1 | Price Including Tax of the book. (Float) | | £0.00 | Tax Amount of the book. (Float) | | In stock (5 available) | Availability of the book. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F11228484%2F27123889b0ce5c7a326965dfc9c29f00%2Foverview-a.svg?generation=1719898030106027&alt=media" alt="Overview A">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F11228484%2Faa6a6110c2a9a63690339c6a6bfe7ab6%2Foverview-b.svg?generation=1719898209012848&alt=media" alt="Overview B">
The CLIP (Contrastive Language–Image Pre-training) model is an innovative approach developed by OpenAI, designed to enhance the robustness of computer vision tasks. It leverages a unique training regimen that aligns images with textual descriptions using a contrastive loss, enabling it to perform image classification tasks in a zero-shot manner. This means CLIP can generalize to classify images it has never seen before based solely on textual descriptions, without the need for further training specific to those tasks.
To use the CLIP model in your Kaggle notebooks, follow these simple steps:
Add the Model as a Kaggle Dataset: Ensure that the dataset containing the CLIP model files is attached to your Kaggle notebook. This dataset includes the necessary model and processor files.
Initialize the Model and Processor: You can load the model and processor directly from the path where the dataset files are stored using the following code snippet:
# Import CLIP model from transformers
from transformers import CLIPModel, CLIPProcessor
# Set the path to the model files
model_path = '/kaggle/input/openaiclip-vit-base-patch32'
# Load the CLIP model
clip_model = CLIPModel.from_pretrained(model_path)
# Load the CLIP processor
clip_processor = CLIPProcessor.from_pretrained(model_path)
Facebook
TwitterThis is an example of how to keep a .py file and import it into a notebook. It is useful for code competitions where you want to integrate utility code without duplication across notebooks.
Facebook
Twitter
Facebook
Twitterfrom flash_attn.flash_attn_interface import flash_attn_unpadded_func as flash_attention_func
def patch_model_with_flash_attn(model):
# Navigating through the 'model' attribute and then accessing 'layers'
for layer in model.model.layers:
# Assuming 'self_attn' is the correct component to modify
layer.self_attn.attention_module = flash_attention_func
patch_model_with_flash_attn(model)
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Run code server on Google Colab or Kaggle Notebooks
Quickstart:
install colabcode: pip install colabcode import colabcode: from colabcode import ColabCode run: ColabCode(port=10000, password="abhishek") you can also run it with any password or port :) Colab starter notebook: Open In Colab
ColabCode has the following arguments:
port: the port you want to run code-server on, default 10000 password: password to protect your code server from being accessed by someone else. Note that there is no password by default! mount_drive: True or False to mount your Google Drive ColabCode comes pre-installed with some VS Code extensions.
See an example in this video tutorial.: https://www.youtube.com/watch?v=7kTbM3D02jU
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Amazon Scrapping Dataset; 1. Import libraries 2. Connect to the website 3. Import CSV and datetime 4. Import pandas 5. Appending dataset to CSV 6. Automation Dataset updated 7. Timers setup 8. Email notification
Facebook
TwitterThis dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the mobilebert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
If you want to import Unsloth while turning off the internet:
!pip install --no-index --find-links=/kaggle/input/unsloth-for-offline torch torchvision torchaudio
!pip install --no-index --find-links=/kaggle/input/unsloth-for-offline xformers
!pip install --no-index --find-links=/kaggle/input/unsloth-for-offline unsloth
!pip install --no-index --find-links=/kaggle/input/unsloth-for-offline bitsandbytes
Then you can follow the stardard notebook in unsloth document to fine tune your model.
Pipeline / model splitting loading is also allowed, so if you do not have enough VRAM for 1 GPU to load say Llama 70B, no worries - we will split the model for you on each GPU! To enable this, use the device_map = "balanced" flag:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.3-70B-Instruct",
load_in_4bit = True,
device_map = "balanced",
)
Contributors have also created a repos to enable or improve multi-GPU support with Unsloth. If you want to use opensloth while turning off internet, run the following code step-by-step:
```
import tarfile
import os
source_dir = "/kaggle/input/unsloth-for-offline/fire-0.7.0/fire-0.7.0" output_path = "/kaggle/working/fire-0.7.0.tar.gz" # You can change this path
with tarfile.open(output_path, "w:gz") as tar:
tar.add(source_dir, arcname=os.path.basename(source_dir))
print(f"Created: {output_path}")
!pip install --no-index --find-links=/kaggle/working/ fire
!pip install --no-index --find-links=/kaggle/input/unsloth-for-offline opensloth==0.1.7
```
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is new classifier dataset. This dataset has two different columns. First column contain different types of news and second column contain the news category. This is a multi-class text classification problem.
encodings_to_try = ['utf-8', 'Latin-1', 'ISO-8859-1']
for encoding in encodings_to_try:
try:
df = pd.read_csv('/kaggle/input/classify-news-into-category/News Categoires.csv', encoding=encoding)
print("File read successfully with encoding:", encoding)
print(df.head())
break
except UnicodeDecodeError:
pass
df.head()
Facebook
TwitterThis dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-roberta/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Context & Motivation
This dataset provides a comprehensive, self-contained offline installer for the vllm library, a high-throughput engine for LLM inference. It is specifically designed to solve the common "no internet access" problem in Kaggle competitions like the ARC Prize, where packages must be installed from local files. Using this dataset eliminates pip install failures and ensures a consistent, reproducible environment for your submission notebook.
Content The dataset contains a single directory, vllm_wheels, which includes the Python wheel file for vllm==0.9.2 and all of its required dependencies. These files were downloaded and packaged in a standard Kaggle environment to ensure maximum compatibility with the competition's execution environment (Python 3.10, CUDA 12.x).
Usage To use this dataset in your Kaggle notebook (with internet turned OFF):
import os
# --- vLLM Offline Installation ---
# Path to the directory containing the wheel files
WHEELS_PATH = "/kaggle/input/vllm-0-9-2-offline-installer/vllm_wheels"
print("Starting offline installation of vLLM...")
!pip install --no-index --find-links={WHEELS_PATH} vllm
print("Installation complete.")
# Verify the installation
import vllm
print(f"vLLM version {vllm._version_} successfully installed.")
Facebook
TwitterUse this data set when submitting code offline for competitions otherwise just use !pip install tabpfn for online use. Usage for offline code submissions within Kaggle notebooks is as follows:
1**.First add the dataset by selecting "add data" and searching for this dataset and adding it to your input. **
2.**Next add the following code to a code block in your notebook **
python
!pip install tabpfn --no-index --find-links=file:///kaggle/input/tabpfn
!mkdir -p /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff
!cp /kaggle/input/tabpfn/prior_diff_real_checkpoint_n_0_epoch_100.cpkt /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff/
3.** Import** :
from tabpfn import TabPFNClassifier
4.**Now you are all set you can create a classifier and run it offline for submission in offline kaggle code competitions:**
python
classifier = TabPFNClassifier(device='cpu',N_ensemble_configurations=64)
classifier.fit(X_train, Y_train)
y_eval, p_eval = classifier.predict(X_cv, return_winning_probability=True)
If you want to use TabPFN with GPU use the following code when you make the model:
classifier = TabPFNClassifier(device='cuda',N_ensemble_configurations=32)
You can find documentation for this package on GitHub: https://github.com/automl/TabPFN.git Original paper on TabPFN can be found at: https://arxiv.org/abs/2207.01848 License Copyright 2022 Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Note: This is a work in progress, and not all the Kaggle forums are included in this dataset. The remaining forums will be added when I end solving some issues with the data generators related to these forums.
Welcome to the Kaggle Forum Discussions dataset!. This dataset contains curated data about recent discussions opened in the different forums on Kaggle. The data is obtained through web scraping techniques, using the selenium libraries, and converting text data into markdown style using the markdownify package.
This dataset contains information about the discussion main topic, topic title, comments, votes, medals and more, and is designed to serve as a complement to the data available on the Kaggle meta dataset, specifically for recent discussions. Keep reading to see the details.
As a dynamic website that relies heavily in JavaScript (JS), I extracted the data in this dataset through web scraping techniques using the selenium library.
The functions and classes used to scrape the data on Kaggle where stored on a utility script publicly available here. As JS-generated pages like Kaggle are unstable where trying to scrape them, the mentioned script implements capabilities for retrying connections and to await for elements to appear.
Each Forum was scrapped using a one notebook for each, then the mentioned notebooks were connected to a central notebook that generates this dataset. Also the discussions are scrapped in parallel so to enhance speed. This dataset represents all the data that can be gathered in a single notebook session, from the most recent to the most old.
If you need more control on the data you want to research, feel free to import all you need from the utility script mentioned before.
This dataset contains several folders, each named as the discussion forum they contain data about. For example, the 'competition-hosting' folder contains data about the Competition Hosting forum. Inside each folder, you'll find two files: one is a csv file and the other a json file.
The json file (in Python, represented as a dictionary) is indexed with the ID that Kaggle assigns to the mentioned discussion. Each ID is paired with its corresponding discussion, which is represented as a nested dictionary (the discussion dict), which contains the following fields: - title: The title of the main topic. - content: Content of the main topic. - tags: List containing the discussion's tags. - datetime: Date and time at which the discussion was published (in ISO 8601 format). - votes: Number of votes gotten by the discussion. - medal: Medal awarded by the main topic (if any). - user: User that published the main topic. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_comments: Total number of comments in the current discussion. - n_appreciation_comments: Total number of appreciation comments in the current discussion. - comments: Dictionary containing data about the comments in the discussion. Each comment is indexed by an ID assigned by Kaggle, containing the following fields: - content: Comment's content. - is_appreciation: Wether the comment is of appreciation. - is_deleted: Wether the comment was deleted. - n_replies: Number of replies to the comment. - datetime: Date and time at which the comment was published (in ISO 8601 format). - votes: Number of votes gotten by the current comment. - medal: Medal awarded by the comment (if any). - user: User that published the comment. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_deleted: Total number of deleted replies (including self). - replies: A dict following this same format.
By other side, the csv file serves as a summary of the json file, containing information about the comments limited to the hottest and most voted comments.
Note: Only the 'content' field is mandatory for each discussion. The availability of the other fields is subject to the stability of the scraping tasks, which may also affect the update frequency.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.