Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">
Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.
💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)
We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.
First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API:
pip install kaggle
Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run:
kaggle datasets download xhlulu/medal-emnlp
Now, unzip everything and place them inside the data directory:
unzip -nq crawl-300d-2M-subword.zip -d data
mv data/pretrain_sample/* data/
For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights:
wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
unzip -nq data/crawl-300d-2M-subword.zip -d data/
You can directly load LSTM and LSTM-SA with torch.hub:
```python
import torch
lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```
If you want to use the Electra model, you need to first install transformers:
pip install transformers
Then, you can load it with torch.hub:
python
import torch
electra = torch.hub.load("BruceWen120/medal", "electra")
transformersIf you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
Download the bibtex here, or copy the text below:
@inproceedings{wen-etal-2020-medal,
title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining",
author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva",
booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15",
pages = "130--135",
}
The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.
The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:
INTRODUCTION
Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.
MEDLINE/PUBMED SPECIFIC TERMS
NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.
GENERAL TERMS AND CONDITIONS
Users of the data agree to:
- acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
- properly use registration and/or trademark symbols when referring to NLM products, and
- not indicate or imply that NLM has endorsed its products/services/applications.
Users who republish or redistribute the data (services, products or raw data) agree to:
- maintain the most current version of all distributed data, or
- make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.
NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.
NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📦 Ecommerce Dataset (Products & Sizes Included)
🛍️ Essential Data for Building an Ecommerce Website & Analyzing Online Shopping Trends 📌 Overview This dataset contains 1,000+ ecommerce products, including detailed information on pricing, ratings, product specifications, seller details, and more. It is designed to help data scientists, developers, and analysts build product recommendation systems, price prediction models, and sentiment analysis tools.
🔹 Dataset Features
Column Name Description product_id Unique identifier for the product title Product name/title product_description Detailed product description rating Average customer rating (0-5) ratings_count Number of ratings received initial_price Original product price discount Discount percentage (%) final_price Discounted price currency Currency of the price (e.g., USD, INR) images URL(s) of product images delivery_options Available delivery methods (e.g., standard, express) product_details Additional product attributes breadcrumbs Category path (e.g., Electronics > Smartphones) product_specifications Technical specifications of the product amount_of_stars Distribution of star ratings (1-5 stars) what_customers_said Customer reviews (sentiments) seller_name Name of the product seller sizes Available sizes (for clothing, shoes, etc.) videos Product video links (if available) seller_information Seller details, such as location and rating variations Different variants of the product (e.g., color, size) best_offer Best available deal for the product more_offers Other available deals/offers category Product category
📊 Potential Use Cases
📌 Build an Ecommerce Website: Use this dataset to design a functional online store with product listings, filtering, and sorting. 🔍 Price Prediction Models: Predict product prices based on features like ratings, category, and discount. 🎯 Recommendation Systems: Suggest products based on user preferences, rating trends, and customer feedback. 🗣 Sentiment Analysis: Analyze what_customers_said to understand customer satisfaction and product popularity. 📈 Market & Competitor Analysis: Track pricing trends, popular categories, and seller performance. 🔍 Why Use This Dataset? ✅ Rich Feature Set: Includes all necessary ecommerce attributes. ✅ Realistic Pricing & Rating Data: Useful for price analysis and recommendations. ✅ Multi-Purpose: Suitable for machine learning, web development, and data visualization. ✅ Structured Format: Easy-to-use CSV format for quick integration.
📂 Dataset Format
CSV file (ecommerce_dataset.csv)
1000+ samples
Multi-category coverage
🔗 How to Use?
Download the dataset from Kaggle.
Load it in Python using Pandas:
python
Copy
Edit
import pandas as pd
df = pd.read_csv("ecommerce_dataset.csv")
df.head()
Explore trends & patterns using visualization tools (Seaborn, Matplotlib).
Build models & applications based on the dataset!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 2300 multimodal IoT sensor recordings collected from athletes during traditional sports training sessions, including basketball, soccer, running, and other athletic activities. The dataset includes heart rate, acceleration (X, Y, Z), gyroscope readings (X, Y, Z), speed, step count, jump height, and training load. It is designed to facilitate analysis of athlete performance, training load monitoring, and predictive modeling for sports science applications.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Instruction/Response format of python code question and answers Oasst files are split 3000 lines each to prevent OOm when loading. The format of the files can easily be changed with a simple python script """
`` import pandas as pd import json input_file_path = 'output_file.jsonl' output_file_path = 'output_file2.csv'
processed_records = []
with open(input_file_path, 'r') as file: for line in file: # Parse the JSON object from the current line record = json.loads(line)
# Rename, filter the desired keys, and replace newline characters
processed_record = {
"prompt": record.get("INSTRUCTION", "").replace('
', ' ').strip(), "response": record.get("RESPONSE", "").replace(' ', ' ').strip() }
# Add the processed record to the list
processed_records.append(processed_record)
df = pd.DataFrame(processed_records)
df.to_csv(output_file_path, index=False, quoting=2) # using quoting=2 to quote all fields
print(f"Conversion complete. The output is saved to '{output_file_path}'") `` """
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The MNIST dataset in HDF5 format.
Data can be loaded with the h5py package: pip install h5py, see demo
Facebook
TwitterThe MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.
The preprocessing script performs several key steps:
Text Cleaning:
fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).Audio Processing:
torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).Video Processing:
Saving Processed Samples:
.pt file in a directory structure split by data type (train, dev, and test).dia0_utt1.mp4 becomes dia0_utt1.pt).Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:
utterance (str): The cleaned textual utterance.emotion (str/int): The corresponding emotion label.video_path (str): Original path to the video file from which the sample was extracted.audio (Tensor): Raw audio waveform tensor of shape [channels, time].audio_sample_rate (int): The sampling rate of the audio waveform.audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.The preprocessed files are organized into splits:
preprocessed_data/
├── train/
│ ├── dia0_utt0.pt
│ ├── dia1_utt1.pt
│ └── ...
├── dev/
│ ├── dia0_utt0.pt
│ ├── dia1_utt1.pt
│ └── ...
└── test/
│ ├── dia0_utt0.pt
│ ├── dia1_utt1.pt
└── ...
A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:
from torch.utils.data import Dataset
import os
import torch
class PreprocessedMELDDataset(Dataset):
def _init_(self, data_dir):
"""
Args:
data_dir (str): Directory where preprocessed .pt files are stored.
"""
self.data_dir = data_dir
self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')]
def _len_(self):
return len(self.files)
def _getitem_(self, idx):
sample_path = self.files[idx]
sample = torch.load(sample_path)
return sample
def preprocessed_collate_fn(batch):
"""
Collates a list of sample dictionaries into a single dictionary with keys mapping to lists.
Modify this function to pad or stack tensor data if needed.
"""
collated = {}
collated['utterance'] = [sample['utterance'] for sample in batch]
collated['emotion'] = [sample['emotion'] for sample in batch]
collated['video_path'] = [sample['video_path'] for sample in batch]
collated['audio'] = [sample['audio'] for sample in batch]
collated['audio_sample_rate'] = batch[0]['audio_sample_rate']
collated['audio_mel'] = [sample['audio_mel'] for sample in batch]
collated['face'] = [sample['face'] for sample in batch]
return collated
from torch.utils.data import DataLoader
# Define paths for each split
train_data_dir = "preprocessed_data/train"
dev_data_dir = "preproces...
Facebook
TwitterThis Model was training using Spacy pipline and data from job_description
This method based on NER to recognite Job skill. In this model, I mostly focus on technical skill with tag "SKILL"
Training source can be find at here
import spacy
from spacy.training.example import Example
import json
import random
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="spacy")
warnings.filterwarnings("ignore", category=FutureWarning, module="tensorflow")
path = "/kaggle/input/job_skills_extractor/scikitlearn/job_skill_extractor/1/job_skills_ner_model"
loaded_nlp = spacy.load(path)
# Test the loaded model with some example texts
test_texts = [
"I am skilled in Python and Java programming.",
"My experience includes using TensorFlow for machine learning.",
"I have hands-on experience with MongoDB and MySQL.",
"Build machine learning",
]
for text in test_texts:
doc = loaded_nlp(text)
print("Input Text:", text)
print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])
Input Text: I am skilled in Python and Java programming.
Entities: [('Python', "['SKILL']"), ('Java', "['SKILL']")]
Input Text: My experience includes using TensorFlow for machine learning.
Entities: [('TensorFlow', "['SKILL']"), ('machine learning.', "['SKILL']")]
Input Text: I have hands-on experience with MongoDB and MySQL.
Entities: [('MongoDB', "['SKILL']"), ('MySQL', "['SKILL']")]
Input Text: Build machine learning
Entities: [('machine learning', "['SKILL']")]
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is derived from the ISIC Archive with the following changes:
If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.
DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.
import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial
def list_jpg_files(folder_path):
# Ensure the folder path ends with a slash
if not folder_path.endswith('/'):
folder_path += '/'
# Use glob to find all .jpg files in the specified folder (non-recursive)
jpg_files = glob.glob(folder_path + '*.jpg')
return jpg_files
def resize_image(image_path, destination_folder):
# Open the image file
with Image.open(image_path) as img:
# Get the original dimensions
original_width, original_height = img.size
# Calculate the aspect ratio
aspect_ratio = original_width / original_height
# Determine the new dimensions based on the aspect ratio
if aspect_ratio > 1:
# Width is larger, so we will crop the width
new_width = int(256 * aspect_ratio)
new_height = 256
else:
# Height is larger, so we will crop the height
new_width = 256
new_height = int(256 / aspect_ratio)
# Resize the image while maintaining the aspect ratio
img = img.resize((new_width, new_height))
# Calculate the crop box to center the image
left = (new_width - 256) / 2
top = (new_height - 256) / 2
right = (new_width + 256) / 2
bottom = (new_height + 256) / 2
# Crop the image if it results in shrinking
if new_width > 256 or new_height > 256:
img = img.crop((left, top, right, bottom))
else:
# Add black edges if it results in scaling up
img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
# Resize the image to the final dimensions
img = img.resize((256, 256))
img.save(os.path.join(destination_folder, os.path.basename(image_path)))
source_folder = ""
destination_folder = ""
images = list_jpg_files(source_folder)
with mp.Pool(processes=12) as pool:
images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")
This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.
The HDF5 file is created using the following code:
import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np
# File paths
base_folder = "./isic-2020-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'
# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))
# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
for index, row in df.iterrows():
isic_id = row['isic_id']
image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
if os.path.exists(image_file_path):
# Open the image file
with Image.open(image_file_path) as img:
# Convert the image to a byte buffer
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format=img.format)
img_byte_arr = img_byte_arr.getvalue()
hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
else:
print(f"Image file for {isic_id} not found.")
print("HDF5 file created successfully.")
To read the hdf5 file, use the following code:
import h5py
from PIL import Image
with h...
Facebook
TwitterA subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.
The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.
For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.
The script used for creating the dataset can be found here.
This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.
First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:
$ pip install -q kaggle pyarrow pylance
$ kaggle datasets download -d heyytanay/codeparrot-1m
$ mkdir codeparrot_1M.lance/
$ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/
$ rm codeparrot-1m.zip
Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.
import lance
dataset = lance.dataset('codeparrot_1M.lance/')
print(dataset.count_rows())
This will give you the total number of tokens in the dataset.
Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
Facebook
TwitterThis dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-roberta/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
For installing pyspark when running notebook without Internet:
(1) Attach the pyspark-package dataset to your notebook.
(2) Install pyspark with the following code:
import shutil
src_path = r"/kaggle/input/pyspark-package/pyspark-latest.tar.gz.mp4"
dst_path = r"/kaggle/working/pyspark-latest.tar.gz"
shutil.copy(src_path, dst_path)
!pip install /kaggle/working/pyspark-latest.tar.gz
or for specific version, check if that version is available in dataset, then you can use i.e. for 3.5.0:
import shutil
src_path = r"/kaggle/input/pyspark-package/pyspark-3.5.0.tar.gz.mp4"
dst_path = r"/kaggle/working/pyspark-3.5.0.tar.gz"
shutil.copy(src_path, dst_path)
!pip install /kaggle/working/pyspark-3.5.0.tar.gz
(3) Then you can use:
python
import pyspark
Facebook
TwitterMNIST is a subset of a larger set available from NIST (it's copied from http://yann.lecun.com/exdb/mnist/)
The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. . Four files are available:
Many methods have been tested with this training set and test set (see http://yann.lecun.com/exdb/mnist/ for more details)
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
✅ Step 1: Mount to Dataset
Search for my dataset pytorch-models and add it — this will mount it at:
/kaggle/input/pytorch-models/
✅ Step 2: Check file paths Once mounted, the four files will be available at:
/kaggle/input/pytorch-models/base_models.py
/kaggle/input/pytorch-models/ext_base_models.py
/kaggle/input/pytorch-models/ext_hybrid_models.py
/kaggle/input/pytorch-models/hybrid_models.py
✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):
import shutil
shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:
import base_models
import ext_base_models
import ext_hybrid_models
import hybrid_models
Or, if you only want to import specific classes or functions:
from base_models import YourModelClass
from ext_base_models import AnotherModelClass
✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:
model = base_models.YourModelClass()
output = model(input_data)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 3,234 data of reinforced concrete (RC) beam design parameters and their corresponding load-bearing capacities. The data is based on realistic construction standards and includes geometric, material, and reinforcement details such as beam dimensions, concrete grade, reinforcement ratios, and stirrup specifications.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 1,004 labeled images from the classic NES game "Duck Hunt" (1984), specifically prepared for YOLO (You Only Look Once) object detection training. The dataset includes sprites of the iconic hunting dog and ducks in various states, augmented to provide a balanced and comprehensive training set for computer vision models.
Perfect for: - Object detection model training - Computer vision research - Retro gaming AI projects - YOLO algorithm benchmarking - Educational purposes
| Metric | Value |
|---|---|
| Total Images | 1,004 |
| Dataset Size | 12 MB |
| Image Format | PNG |
| Annotation Format | YOLO (.txt) |
| Classes | 4 |
| Train/Val Split | 711/260 (73%/27%) |
| Class ID | Class Name | Count | Description |
|---|---|---|---|
| 0 | dog | 252 | The hunting dog in various poses (jumping, laughing, sniffing, etc.) |
| 1 | duck_dead | 256 | Dead ducks (both black and red variants) |
| 2 | duck_shot | 248 | Ducks in the moment of being shot |
| 3 | duck_flying | 248 | Flying ducks in all directions (left, right, diagonal) |
yolo_dataset_augmented/
├── images/
│ ├── train/ # 711 training images
│ └── val/ # 260 validation images
├── labels/
│ ├── train/ # 711 YOLO annotation files
│ └── val/ # 260 YOLO annotation files
├── classes.txt # Class names mapping
├── dataset.yaml # YOLO configuration file
└── augmented_dataset_stats.json # Detailed statistics
The original 47 images were enhanced using advanced data augmentation techniques to create a balanced dataset:
{
'rotation_range': (-15, 15), # Small rotations for game sprites
'brightness_range': (0.7, 1.3), # Brightness variations
'contrast_range': (0.8, 1.2), # Contrast adjustments
'saturation_range': (0.8, 1.2), # Color saturation
'noise_intensity': 0.02, # Gaussian noise
'horizontal_flip_prob': 0.5, # 50% chance horizontal flip
'scaling_range': (0.8, 1.2), # Scale variations
}
from ultralytics import YOLO
# Load and train
model = YOLO('yolov8n.pt') # Load pretrained model
results = model.train(data='dataset.yaml', epochs=100, imgsz=640)
# Validate
metrics = model.val()
# Predict
results = model('path/to/test/image.png')
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
class DuckHuntDataset(Dataset):
def _init_(self, images_dir, labels_dir, transform=None):
self.images_dir = images_dir
self.labels_dir = labels_dir
self.transform = transform
self.images = os.listdir(images_dir)
def _len_(self):
return len(self.images)
def _getitem_(self, idx):
img_path = os.path.join(self.images_dir, self.images[idx])
label_path = os.path.join(self.labels_dir,
self.images[idx].replace('.png', '.txt'))
image = Image.open(img_path)
# Load YOLO annotations
with open(label_path, 'r') as f:
labels = f.readlines()
if self.transform:
image = self.transform(image)
return image, labels
# Usage
dataset = DuckHuntDataset('images/train', 'labels/train')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
Each .txt file contains one line per object:
class_id center_x center_y width height
Example annotation:
0 0.492 0.403 0.212 0.315
Where values are normalized (0-1) relative to image dimensions.
This dataset is based on sprites from the iconic 1984 NES game "Duck Hunt," one of the most recognizable video games in history. The game featured:
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Ce dataset est une collection de questions-réponses sur le logiciel de création d'albums photo AlbumForge. Il est conçu pour le fine-tuning des grands modèles de langage (LLMs) afin de les rendre plus performants dans les tâches de "citation-aware QA" (réponse aux questions avec citation des sources).
Structure du Dataset Chaque entrée du dataset est un objet JSON avec les champs suivants :
question (string) : La question posée. answer (string) : La réponse à la question, avec des marqueurs de citation intégrés. cite_refs (list[int]) : Une liste d'identifiants de références correspondantes. Les identifiants de référence (IDs) dans cite_refs correspondent aux objets dans le fichier citations.json, qui contient les détails des sources.
Fichiers faq_albumforge_cited.jsonl : Le dataset principal, formaté en JSONL. citations.json : Un fichier JSON qui stocke les références complètes pour le dataset. Exemple d'utilisation Pour utiliser ce dataset pour entraîner un modèle, il est possible de charger les deux fichiers et de lier les références à chaque entrée via le champ cite_refs.
import json
def load_dataset(faq_path, citations_path): with open(citations_path, 'r', encoding='utf-8') as f: citations = {item['id']: item for item in json.load(f)}
with open(faq_path, 'r', encoding='utf-8') as f:
data = [json.loads(line) for line in f]
for item in data:
item['full_citations'] = [citations[ref_id] for ref_id in item['cite_refs']]
# On peut ici transformer le format pour l'injection dans un LLM
# ex: "Question: Qu'est-ce qu'AlbumForge ? Réponse: AlbumForge est un logiciel éthique et 100% hors ligne ... Sources: [1] Site officiel d'AlbumForge ([https://www.albumforge.com](https://www.albumforge.com))"
return data
This dataset contains 115 curated Q&A examples about AlbumForge, a privacy-first photo album software.
It is designed to power FAQ bots, retrieval-augmented generation (RAG), or fine-tuning of ethical assistant models.
Try querying this dataset in Python:
from datasets import load_dataset
dataset = load_dataset("albumforge/faq-albumforge-cited", split="train")
print(dataset[0]) source : AOUT 2025 : www.albumforge.com
Facebook
TwitterThis dataset is a pre-canned set of python wheel files and a requirements.txt file. It's purpose is to load in python packages into a notebook when the internet is disabled.
See this Offline Package Wheeler to see how these python modules have been packaged up and how to use them in your offline notebook.
All of the licenses pertain directly to the python packages themselves; please refer to their documentation on PyPi.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.