100+ datasets found

Kaggle's Most Used Packages & Method Calls

kaggle.com

zip

Updated Jun 13, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

TheItCrow (2025). Kaggle's Most Used Packages & Method Calls [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/kaggles-most-used-packages-and-method-calls

Explore at:

zip(2405388375 bytes)Available download formats

Dataset updated

Jun 13, 2025

Authors

TheItCrow

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.

Most Imported R Packages	Most Imported Python Packages
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2F5bb95536aa5d8092d56f526aa04c8cd1%2Foutput.png?generation=1749374431744993&alt=media" alt="">	https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2Fa3d9a02ae0b314bfa6b3eb411c405ec0%2Foutput1.png?generation=1749374439690291&alt=media" alt="">

We perform this extraction using the following three regex patterns:

PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')

This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.

MeDAL Dataset
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
Explore at:
zip(7324382521 bytes)Available download formats
Dataset updated
Nov 16, 2020
Authors
xhlulu
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)

Downloading the data

We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

Loading FastText Embeddings

For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

Model Quickstart

Using Torch Hub

You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

Using Huggingface transformers

If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("xhlu/electra-medal") tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")

Citation

Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

License, Terms and Conditions

The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

INTRODUCTION

Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

MEDLINE/PUBMED SPECIFIC TERMS

NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

GENERAL TERMS AND CONDITIONS

Users of the data agree to:

acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,

properly use registration and/or trademark symbols when referring to NLM products, and

not indicate or imply that NLM has endorsed its products/services/applications.

Users who republish or redistribute the data (services, products or raw data) agree to:

maintain the most current version of all distributed data, or

make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.

These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
original : CIFAR 100
kaggle.com
zip
Updated Dec 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
Explore at:
zip(168517945 bytes)Available download formats
Dataset updated
Dec 28, 2024
Authors
Shashwat Pandey
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Ecommerce Dataset (Products & Sizes Included)
kaggle.com
zip
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anvit kumar (2025). Ecommerce Dataset (Products & Sizes Included) [Dataset]. https://www.kaggle.com/datasets/anvitkumar/shopping-dataset
Explore at:
zip(1274856 bytes)Available download formats
Dataset updated
Nov 13, 2025
Authors
Anvit kumar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📦 Ecommerce Dataset (Products & Sizes Included)

🛍️ Essential Data for Building an Ecommerce Website & Analyzing Online Shopping Trends 📌 Overview This dataset contains 1,000+ ecommerce products, including detailed information on pricing, ratings, product specifications, seller details, and more. It is designed to help data scientists, developers, and analysts build product recommendation systems, price prediction models, and sentiment analysis tools.

🔹 Dataset Features

Column Name Description product_id Unique identifier for the product title Product name/title product_description Detailed product description rating Average customer rating (0-5) ratings_count Number of ratings received initial_price Original product price discount Discount percentage (%) final_price Discounted price currency Currency of the price (e.g., USD, INR) images URL(s) of product images delivery_options Available delivery methods (e.g., standard, express) product_details Additional product attributes breadcrumbs Category path (e.g., Electronics > Smartphones) product_specifications Technical specifications of the product amount_of_stars Distribution of star ratings (1-5 stars) what_customers_said Customer reviews (sentiments) seller_name Name of the product seller sizes Available sizes (for clothing, shoes, etc.) videos Product video links (if available) seller_information Seller details, such as location and rating variations Different variants of the product (e.g., color, size) best_offer Best available deal for the product more_offers Other available deals/offers category Product category

📊 Potential Use Cases

📌 Build an Ecommerce Website: Use this dataset to design a functional online store with product listings, filtering, and sorting. 🔍 Price Prediction Models: Predict product prices based on features like ratings, category, and discount. 🎯 Recommendation Systems: Suggest products based on user preferences, rating trends, and customer feedback. 🗣 Sentiment Analysis: Analyze what_customers_said to understand customer satisfaction and product popularity. 📈 Market & Competitor Analysis: Track pricing trends, popular categories, and seller performance. 🔍 Why Use This Dataset? ✅ Rich Feature Set: Includes all necessary ecommerce attributes. ✅ Realistic Pricing & Rating Data: Useful for price analysis and recommendations. ✅ Multi-Purpose: Suitable for machine learning, web development, and data visualization. ✅ Structured Format: Easy-to-use CSV format for quick integration.

📂 Dataset Format CSV file (ecommerce_dataset.csv) 1000+ samples Multi-category coverage 🔗 How to Use? Download the dataset from Kaggle. Load it in Python using Pandas: python Copy Edit import pandas as pd
df = pd.read_csv("ecommerce_dataset.csv")
df.head() Explore trends & patterns using visualization tools (Seaborn, Matplotlib). Build models & applications based on the dataset!
IoT Sports Training Load Dataset
kaggle.com
zip
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Python Developer (2025). IoT Sports Training Load Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/iot-sports-training-load-dataset
Explore at:
zip(226978 bytes)Available download formats
Dataset updated
Oct 9, 2025
Authors
Python Developer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 2300 multimodal IoT sensor recordings collected from athletes during traditional sports training sessions, including basketball, soccer, running, and other athletic activities. The dataset includes heart rate, acceleration (X, Y, Z), gyroscope readings (X, Y, Z), speed, step count, jump height, and training load. It is designed to facilitate analysis of athlete performance, training load monitoring, and predictive modeling for sports science applications.
Gemma-Python Training Dataset
kaggle.com
zip
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Mairs (2024). Gemma-Python Training Dataset [Dataset]. https://www.kaggle.com/datasets/dmcstllc/gemma-python-training-dataset
Explore at:
zip(102676250 bytes)Available download formats
Dataset updated
Mar 17, 2024
Authors
David Mairs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Instruction/Response format of python code question and answers Oasst files are split 3000 lines each to prevent OOm when loading. The format of the files can easily be changed with a simple python script """

`` import pandas as pd import json input_file_path = 'output_file.jsonl' output_file_path = 'output_file2.csv'

Prepare an empty list to hold the processed records

processed_records = []

Open the .jsonl file and process each line

with open(input_file_path, 'r') as file: for line in file: # Parse the JSON object from the current line record = json.loads(line)

# Rename, filter the desired keys, and replace newline characters processed_record = { "prompt": record.get("INSTRUCTION", "").replace('

', ' ').strip(), "response": record.get("RESPONSE", "").replace(' ', ' ').strip() }

# Add the processed record to the list processed_records.append(processed_record)

Convert the list of processed records to a DataFrame

df = pd.DataFrame(processed_records)

Writing the DataFrame to a .csv file

df.to_csv(output_file_path, index=False, quoting=2) # using quoting=2 to quote all fields

print(f"Conversion complete. The output is saved to '{output_file_path}'") `` """
MNIST - HDF5
kaggle.com
zip
Updated Feb 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benedict Wilkins (2020). MNIST - HDF5 [Dataset]. https://www.kaggle.com/benedictwilkinsai/mnist-hd5f
Explore at:
zip(12272528 bytes)Available download formats
Dataset updated
Feb 28, 2020
Authors
Benedict Wilkins
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The MNIST dataset in HDF5 format.

Data can be loaded with the h5py package: pip install h5py, see demo
MELD Preprocessed
kaggle.com
zip
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argish Abhangi (2025). MELD Preprocessed [Dataset]. https://www.kaggle.com/datasets/argish/meld-preprocessed
Explore at:
zip(3527202381 bytes)Available download formats
Dataset updated
Mar 1, 2025
Authors
Argish Abhangi
Description
The MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.

Data Sources

Audio: Waveforms extracted from the original video files.

Video: Video files are processed to sample frames at a target frame rate (default: 2 fps) and to detect faces using a Haar Cascade classifier.

Text: Utterances from the dialogue, which are cleaned using custom encoding functions to fix potential byte encoding issues.

Emotion Labels: Each sample is associated with an emotion label.

Preprocessing Pipeline

The preprocessing script performs several key steps:

Text Cleaning:

fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.

replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).

Audio Processing:

Extracts raw audio waveform from each sample.

Computes a Mel-spectrogram using torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).

Converts the spectrogram to a logarithmic scale for numerical stability.

Video Processing:

Reads video frames at a specified target FPS (default: 2 fps) using OpenCV.

For each video, samples frames evenly based on the original video's FPS.

Applies Haar Cascade face detection on the frames to extract the first detected face.

Resizes the detected face to 224x224 and converts it to RGB. If no face is detected, a default black image (224x224x3) is returned.

Saving Processed Samples:

Each sample is saved as a .pt file in a directory structure split by data type (train, dev, and test).

The filename is derived from the original video filename (e.g., dia0_utt1.mp4 becomes dia0_utt1.pt).

Data Format

Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:

utterance (str): The cleaned textual utterance.

emotion (str/int): The corresponding emotion label.

video_path (str): Original path to the video file from which the sample was extracted.

audio (Tensor): Raw audio waveform tensor of shape [channels, time].

audio_sample_rate (int): The sampling rate of the audio waveform.

audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].

face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.

Directory Structure

The preprocessed files are organized into splits: preprocessed_data/ ├── train/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... ├── dev/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... └── test/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt └── ...

Loading and Using the Dataset

A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:

Dataset Class

from torch.utils.data import Dataset import os import torch class PreprocessedMELDDataset(Dataset): def _init_(self, data_dir): """ Args: data_dir (str): Directory where preprocessed .pt files are stored. """ self.data_dir = data_dir self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')] def _len_(self): return len(self.files) def _getitem_(self, idx): sample_path = self.files[idx] sample = torch.load(sample_path) return sample

Custom Collate Function

def preprocessed_collate_fn(batch): """ Collates a list of sample dictionaries into a single dictionary with keys mapping to lists. Modify this function to pad or stack tensor data if needed. """ collated = {} collated['utterance'] = [sample['utterance'] for sample in batch] collated['emotion'] = [sample['emotion'] for sample in batch] collated['video_path'] = [sample['video_path'] for sample in batch] collated['audio'] = [sample['audio'] for sample in batch] collated['audio_sample_rate'] = batch[0]['audio_sample_rate'] collated['audio_mel'] = [sample['audio_mel'] for sample in batch] collated['face'] = [sample['face'] for sample in batch] return collated

Creating DataLoaders

from torch.utils.data import DataLoader # Define paths for each split train_data_dir = "preprocessed_data/train" dev_data_dir = "preproces...

Job_skill_extractor_NER

kaggle.com

zip

Updated Jan 16, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

LeewanHung (2024). Job_skill_extractor_NER [Dataset]. https://www.kaggle.com/datasets/leewanhung/job-skill-extractor-ner

Explore at:

zip(3587456 bytes)Available download formats

Dataset updated

Jan 16, 2024

Authors

LeewanHung

Description

Introdution

This Model was training using Spacy pipline and data from job_description

This method based on NER to recognite Job skill. In this model, I mostly focus on technical skill with tag "SKILL"

Training source can be find at here

How to usage:

import spacy
from spacy.training.example import Example
import json
import random
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="spacy")
warnings.filterwarnings("ignore", category=FutureWarning, module="tensorflow")

path = "/kaggle/input/job_skills_extractor/scikitlearn/job_skill_extractor/1/job_skills_ner_model"
loaded_nlp = spacy.load(path)

# Test the loaded model with some example texts
test_texts = [
  "I am skilled in Python and Java programming.",
  "My experience includes using TensorFlow for machine learning.",
  "I have hands-on experience with MongoDB and MySQL.",
  "Build machine learning",
]
for text in test_texts:
  doc = loaded_nlp(text)
  print("Input Text:", text)
  print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])

output

Input Text: I am skilled in Python and Java programming.
Entities: [('Python', "['SKILL']"), ('Java', "['SKILL']")]
Input Text: My experience includes using TensorFlow for machine learning.
Entities: [('TensorFlow', "['SKILL']"), ('machine learning.', "['SKILL']")]
Input Text: I have hands-on experience with MongoDB and MySQL.
Entities: [('MongoDB', "['SKILL']"), ('MySQL', "['SKILL']")]
Input Text: Build machine learning
Entities: [('machine learning', "['SKILL']")]

ISIC 2020 - 256x256

kaggle.com

zip

Updated Aug 6, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Mehran Ziadloo (2024). ISIC 2020 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/isic-2020-256x256

Explore at:

zip(442027409 bytes)Available download formats

Dataset updated

Aug 6, 2024

Authors

Mehran Ziadloo

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

This dataset is derived from the ISIC Archive with the following changes:

A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":

squamous cell carcinoma
basal cell carcinoma
melanoma
squamous cell carcinoma

If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.

All the images are resized to 256x256 using the following Python code:

import os
import multiprocessing as mp
from PIL import Image, ImageOps
import glob
from functools import partial


def list_jpg_files(folder_path):
  # Ensure the folder path ends with a slash
  if not folder_path.endswith('/'):
    folder_path += '/'

  # Use glob to find all .jpg files in the specified folder (non-recursive)
  jpg_files = glob.glob(folder_path + '*.jpg')

  return jpg_files



def resize_image(image_path, destination_folder):
  # Open the image file
  with Image.open(image_path) as img:
    # Get the original dimensions
    original_width, original_height = img.size

    # Calculate the aspect ratio
    aspect_ratio = original_width / original_height

    # Determine the new dimensions based on the aspect ratio
    if aspect_ratio > 1:
      # Width is larger, so we will crop the width
      new_width = int(256 * aspect_ratio)
      new_height = 256
    else:
      # Height is larger, so we will crop the height
      new_width = 256
      new_height = int(256 / aspect_ratio)

    # Resize the image while maintaining the aspect ratio
    img = img.resize((new_width, new_height))

    # Calculate the crop box to center the image
    left = (new_width - 256) / 2
    top = (new_height - 256) / 2
    right = (new_width + 256) / 2
    bottom = (new_height + 256) / 2

    # Crop the image if it results in shrinking
    if new_width > 256 or new_height > 256:
      img = img.crop((left, top, right, bottom))
    else:
      # Add black edges if it results in scaling up
      img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')

    # Resize the image to the final dimensions
    img = img.resize((256, 256))

  img.save(os.path.join(destination_folder, os.path.basename(image_path)))


source_folder = ""
destination_folder = ""

images = list_jpg_files(source_folder)

with mp.Pool(processes=12) as pool:
  images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
print("All images resized")

This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.

The HDF5 file is created using the following code:

import os
import pandas as pd
from PIL import Image
import h5py
import io
import numpy as np

# File paths
base_folder = "./isic-2020-256x256"
csv_file_path = 'train-metadata.csv'
image_folder_path = 'train-image/image'
hdf5_file_path = 'train-image.hdf5'

# Read the CSV file
df = pd.read_csv(os.path.join(base_folder, csv_file_path))

# Open an HDF5 file
with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
  for index, row in df.iterrows():
    isic_id = row['isic_id']
    image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
    
    if os.path.exists(image_file_path):
      # Open the image file
      with Image.open(image_file_path) as img:
        # Convert the image to a byte buffer
        img_byte_arr = io.BytesIO()
        img.save(img_byte_arr, format=img.format)
        img_byte_arr = img_byte_arr.getvalue()
        hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
    else:
      print(f"Image file for {isic_id} not found.")

print("HDF5 file created successfully.")

To read the hdf5 file, use the following code:

import h5py
from PIL import Image


with h...

codeparrot_1M
kaggle.com
zip
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). codeparrot_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/codeparrot-1m/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(2368083124 bytes)Available download formats
Dataset updated
Feb 25, 2024
Authors
Tanay Mehta
Description
A subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.

The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.

For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

The script used for creating the dataset can be found here.

Instructions for using this dataset

This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.

First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

$ pip install -q kaggle pyarrow pylance $ kaggle datasets download -d heyytanay/codeparrot-1m $ mkdir codeparrot_1M.lance/ $ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/ $ rm codeparrot-1m.zip

Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

import lance dataset = lance.dataset('codeparrot_1M.lance/') print(dataset.count_rows())

This will give you the total number of tokens in the dataset.

Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
Huggingface RoBERTa
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface RoBERTa [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-roberta
Explore at:
zip(34531447596 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-roberta/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base") model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")

Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
pyspark-package
kaggle.com
zip
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iulian Cozma (2024). pyspark-package [Dataset]. https://www.kaggle.com/datasets/icozma/pyspark-package
Explore at:
zip(1586185224 bytes)Available download formats
Dataset updated
Sep 26, 2024
Authors
Iulian Cozma
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
For installing pyspark when running notebook without Internet:

(1) Attach the pyspark-package dataset to your notebook.
(2) Install pyspark with the following code:

import shutil src_path = r"/kaggle/input/pyspark-package/pyspark-latest.tar.gz.mp4" dst_path = r"/kaggle/working/pyspark-latest.tar.gz" shutil.copy(src_path, dst_path) !pip install /kaggle/working/pyspark-latest.tar.gz

or for specific version, check if that version is available in dataset, then you can use i.e. for 3.5.0:

import shutil src_path = r"/kaggle/input/pyspark-package/pyspark-3.5.0.tar.gz.mp4" dst_path = r"/kaggle/working/pyspark-3.5.0.tar.gz" shutil.copy(src_path, dst_path) !pip install /kaggle/working/pyspark-3.5.0.tar.gz

(3) Then you can use: python import pyspark
MNIST Dataset
kaggle.com
opendatalab.com
+4more
zip
Updated Jan 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hojjat Khodabakhsh (2019). MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/hojjatk/mnist-dataset
Explore at:
zip(23112702 bytes)Available download formats
Dataset updated
Jan 8, 2019
Authors
Hojjat Khodabakhsh
Description
Context

MNIST is a subset of a larger set available from NIST (it's copied from http://yann.lecun.com/exdb/mnist/)

Content

The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. . Four files are available:

train-images-idx3-ubyte.gz: training set images (9912422 bytes)

train-labels-idx1-ubyte.gz: training set labels (28881 bytes)

t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)

t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

How to read

See sample MNIST reader

Acknowledgements

Yann LeCun, Courant Institute, NYU

Corinna Cortes, Google Labs, New York

Christopher J.C. Burges, Microsoft Research, Redmond

Inspiration

Many methods have been tested with this training set and test set (see http://yann.lecun.com/exdb/mnist/ for more details)
Pytorch Models
kaggle.com
zip
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data
Explore at:
zip(21493 bytes)Available download formats
Dataset updated
May 10, 2025
Authors
Sufian Othman
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
✅ Step 1: Mount to Dataset

Search for my dataset pytorch-models and add it — this will mount it at:

/kaggle/input/pytorch-models/

✅ Step 2: Check file paths Once mounted, the four files will be available at:

/kaggle/input/pytorch-models/base_models.py /kaggle/input/pytorch-models/ext_base_models.py /kaggle/input/pytorch-models/ext_hybrid_models.py /kaggle/input/pytorch-models/hybrid_models.py

✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

import shutil shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')

✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

import base_models import ext_base_models import ext_hybrid_models import hybrid_models

Or, if you only want to import specific classes or functions:

from base_models import YourModelClass from ext_base_models import AnotherModelClass

✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

model = base_models.YourModelClass() output = model(input_data)
RC Beam Capacity Optimization Dataset
kaggle.com
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Python Developer (2025). RC Beam Capacity Optimization Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/rc-beam-capacity-optimization-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Python Developer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 3,234 data of reinforced concrete (RC) beam design parameters and their corresponding load-bearing capacities. The data is based on realistic construction standards and includes geometric, material, and reinforcement details such as beam dimensions, concrete grade, reinforcement ratios, and stirrup specifications.
Iris Species
kaggle.com
zip
Updated Sep 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
Explore at:
zip(3687 bytes)Available download formats
Dataset updated
Sep 27, 2016
Dataset authored and provided by
UCI Machine Learning
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

Species

Data from: Duck Hunt

kaggle.com

zip

Updated Jul 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Hugo Zanini (2025). Duck Hunt [Dataset]. https://www.kaggle.com/datasets/hugozanini1/duck-hunt

Explore at:

zip(7379197 bytes)Available download formats

Dataset updated

Jul 26, 2025

Authors

Hugo Zanini

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Duck Hunt Object Detection Dataset

This dataset contains 1,004 labeled images from the classic NES game "Duck Hunt" (1984), specifically prepared for YOLO (You Only Look Once) object detection training. The dataset includes sprites of the iconic hunting dog and ducks in various states, augmented to provide a balanced and comprehensive training set for computer vision models.

Perfect for: - Object detection model training - Computer vision research - Retro gaming AI projects - YOLO algorithm benchmarking - Educational purposes

🎯 Dataset Statistics

Metric	Value
Total Images	1,004
Dataset Size	12 MB
Image Format	PNG
Annotation Format	YOLO (.txt)
Classes	4
Train/Val Split	711/260 (73%/27%)

Class Distribution

Class ID	Class Name	Count	Description
0	`dog`	252	The hunting dog in various poses (jumping, laughing, sniffing, etc.)
1	`duck_dead`	256	Dead ducks (both black and red variants)
2	`duck_shot`	248	Ducks in the moment of being shot
3	`duck_flying`	248	Flying ducks in all directions (left, right, diagonal)

📁 Dataset Structure

yolo_dataset_augmented/
├── images/
│  ├── train/      # 711 training images
│  └── val/       # 260 validation images
├── labels/
│  ├── train/      # 711 YOLO annotation files
│  └── val/       # 260 YOLO annotation files
├── classes.txt     # Class names mapping
├── dataset.yaml     # YOLO configuration file
└── augmented_dataset_stats.json # Detailed statistics

🔧 Data Augmentation Details

The original 47 images were enhanced using advanced data augmentation techniques to create a balanced dataset:

Augmentation Techniques Applied:

Geometric Transformations: Rotation (±15°), horizontal/vertical flipping, scaling (0.8-1.2x), translation
Color Adjustments: Brightness (0.7-1.3x), contrast (0.8-1.2x), saturation (0.8-1.2x)
Quality Variations: Gaussian noise, slight blur for robustness
Advanced Techniques: Mosaic augmentation (YOLO-style 4-image combination)

Augmentation Parameters:

{
  'rotation_range': (-15, 15),    # Small rotations for game sprites
  'brightness_range': (0.7, 1.3),  # Brightness variations
  'contrast_range': (0.8, 1.2),   # Contrast adjustments
  'saturation_range': (0.8, 1.2),  # Color saturation
  'noise_intensity': 0.02,      # Gaussian noise
  'horizontal_flip_prob': 0.5,    # 50% chance horizontal flip
  'scaling_range': (0.8, 1.2),    # Scale variations
}

🚀 Usage Examples

Loading with YOLOv8 (Ultralytics)

from ultralytics import YOLO

# Load and train
model = YOLO('yolov8n.pt') # Load pretrained model
results = model.train(data='dataset.yaml', epochs=100, imgsz=640)

# Validate
metrics = model.val()

# Predict
results = model('path/to/test/image.png')

Loading with PyTorch

import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os

class DuckHuntDataset(Dataset):
  def _init_(self, images_dir, labels_dir, transform=None):
    self.images_dir = images_dir
    self.labels_dir = labels_dir
    self.transform = transform
    self.images = os.listdir(images_dir)
  
  def _len_(self):
    return len(self.images)
  
  def _getitem_(self, idx):
    img_path = os.path.join(self.images_dir, self.images[idx])
    label_path = os.path.join(self.labels_dir, 
                 self.images[idx].replace('.png', '.txt'))
    
    image = Image.open(img_path)
    # Load YOLO annotations
    with open(label_path, 'r') as f:
      labels = f.readlines()
    
    if self.transform:
      image = self.transform(image)
      
    return image, labels

# Usage
dataset = DuckHuntDataset('images/train', 'labels/train')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

YOLO Annotation Format

Each .txt file contains one line per object: class_id center_x center_y width height

Example annotation: 0 0.492 0.403 0.212 0.315 Where values are normalized (0-1) relative to image dimensions.

📊 Technical Specifications

Image Dimensions: Variable (original sprite sizes preserved)
Color Channels: RGB (3 channels)
Annotation Precision: Float32 (normalized coordinates)
File Naming: Descriptive names indicating class and augmentation type
Quality: High-resolution pixel art sprites

🎮 Dataset Context

This dataset is based on sprites from the iconic 1984 NES game "Duck Hunt," one of the most recognizable video games in history. The game featured:

The Dog: Your hunting companion who retrieves ducks and ...

ALBUMFORGE DATASET FAQ
kaggle.com
zip
Updated Aug 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ALBUMFORGE (2025). ALBUMFORGE DATASET FAQ [Dataset]. https://www.kaggle.com/datasets/albumforge/albumforge-dataset-faq
Explore at:
zip(95100 bytes)Available download formats
Dataset updated
Aug 18, 2025
Authors
ALBUMFORGE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Ce dataset est une collection de questions-réponses sur le logiciel de création d'albums photo AlbumForge. Il est conçu pour le fine-tuning des grands modèles de langage (LLMs) afin de les rendre plus performants dans les tâches de "citation-aware QA" (réponse aux questions avec citation des sources).

Structure du Dataset Chaque entrée du dataset est un objet JSON avec les champs suivants :

question (string) : La question posée. answer (string) : La réponse à la question, avec des marqueurs de citation intégrés. cite_refs (list[int]) : Une liste d'identifiants de références correspondantes. Les identifiants de référence (IDs) dans cite_refs correspondent aux objets dans le fichier citations.json, qui contient les détails des sources.

Fichiers faq_albumforge_cited.jsonl : Le dataset principal, formaté en JSONL. citations.json : Un fichier JSON qui stocke les références complètes pour le dataset. Exemple d'utilisation Pour utiliser ce dataset pour entraîner un modèle, il est possible de charger les deux fichiers et de lier les références à chaque entrée via le champ cite_refs.

import json

def load_dataset(faq_path, citations_path): with open(citations_path, 'r', encoding='utf-8') as f: citations = {item['id']: item for item in json.load(f)}

with open(faq_path, 'r', encoding='utf-8') as f: data = [json.loads(line) for line in f] for item in data: item['full_citations'] = [citations[ref_id] for ref_id in item['cite_refs']] # On peut ici transformer le format pour l'injection dans un LLM # ex: "Question: Qu'est-ce qu'AlbumForge ? Réponse: AlbumForge est un logiciel éthique et 100% hors ligne ... Sources: [1] Site officiel d'AlbumForge ([https://www.albumforge.com](https://www.albumforge.com))" return data

Exemple

dataset = load_dataset('faq_albumforge_cited.jsonl', 'citations.json')

👋 About This Dataset

This dataset contains 115 curated Q&A examples about AlbumForge, a privacy-first photo album software.
It is designed to power FAQ bots, retrieval-augmented generation (RAG), or fine-tuning of ethical assistant models.

🧠 Format: JSONL + Parquet

📊 Split: Train (1083 rows)

💡 Use case: Privacy-focused chatbot, ethical software agents

🌍 Available in: 44 languages

Try querying this dataset in Python:

from datasets import load_dataset dataset = load_dataset("albumforge/faq-albumforge-cited", split="train") print(dataset[0]) source : AOUT 2025 : www.albumforge.com
HPA21 Extra Packages
kaggle.com
zip
Updated Mar 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Scholan (2021). HPA21 Extra Packages [Dataset]. https://www.kaggle.com/andrewscholan/hpa21-extra-packages
Explore at:
zip(1265676527 bytes)Available download formats
Dataset updated
Mar 7, 2021
Authors
Andrew Scholan
Description
This dataset is a pre-canned set of python wheel files and a requirements.txt file. It's purpose is to load in python packages into a notebook when the internet is disabled.

See this Offline Package Wheeler to see how these python modules have been packaged up and how to use them in your offline notebook.

LICENSES

All of the licenses pertain directly to the python packages themselves; please refer to their documentation on PyPi.

Facebook

Twitter

Click to copy link

Link copied

Cite

TheItCrow (2025). Kaggle's Most Used Packages & Method Calls [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/kaggles-most-used-packages-and-method-calls

Kaggle's Most Used Packages & Method Calls

Kaggle-Meta Enriched With Imports & Method Calls

Explore at:

zip(2405388375 bytes)Available download formats

Dataset updated

Jun 13, 2025

Authors

TheItCrow

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Most Imported R Packages	Most Imported Python Packages
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2F5bb95536aa5d8092d56f526aa04c8cd1%2Foutput.png?generation=1749374431744993&alt=media" alt="">	https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2Fa3d9a02ae0b314bfa6b3eb411c405ec0%2Foutput1.png?generation=1749374439690291&alt=media" alt="">

We perform this extraction using the following three regex patterns:

PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')

Clear search

Close search

Google apps

Main menu

Kaggle's Most Used Packages & Method Calls

MeDAL Dataset

Downloading the data

Loading FastText Embeddings

Model Quickstart

Using Torch Hub

Using Huggingface transformers

Citation

License, Terms and Conditions

original : CIFAR 100

Ecommerce Dataset (Products & Sizes Included)

IoT Sports Training Load Dataset

Gemma-Python Training Dataset

Prepare an empty list to hold the processed records

Open the .jsonl file and process each line

Convert the list of processed records to a DataFrame

Writing the DataFrame to a .csv file

MNIST - HDF5

MELD Preprocessed

Data Sources

Preprocessing Pipeline

Data Format

Directory Structure

Loading and Using the Dataset

Dataset Class

Custom Collate Function

Creating DataLoaders

Job_skill_extractor_NER

Introdution

How to usage:

output

ISIC 2020 - 256x256

codeparrot_1M

Instructions for using this dataset

Huggingface RoBERTa

pyspark-package

MNIST Dataset

Context

Content

How to read

Acknowledgements

Inspiration

Pytorch Models

RC Beam Capacity Optimization Dataset

Iris Species

Data from: Duck Hunt

Duck Hunt Object Detection Dataset

🎯 Dataset Statistics

Class Distribution

📁 Dataset Structure

🔧 Data Augmentation Details

Augmentation Techniques Applied:

Augmentation Parameters:

🚀 Usage Examples

Loading with YOLOv8 (Ultralytics)

Loading with PyTorch

YOLO Annotation Format

📊 Technical Specifications

🎮 Dataset Context

ALBUMFORGE DATASET FAQ

Exemple

dataset = load_dataset('faq_albumforge_cited.jsonl', 'citations.json')

👋 About This Dataset

HPA21 Extra Packages

LICENSES

Kaggle's Most Used Packages & Method Calls

Kaggle-Meta Enriched With Imports & Method Calls

Using Huggingface `transformers`