100+ datasets found
  1. Kaggle's Most Used Packages & Method Calls

    • kaggle.com
    zip
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheItCrow (2025). Kaggle's Most Used Packages & Method Calls [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/kaggles-most-used-packages-and-method-calls
    Explore at:
    zip(2405388375 bytes)Available download formats
    Dataset updated
    Jun 13, 2025
    Authors
    TheItCrow
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.

    Most Imported R PackagesMost Imported Python Packages
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2F5bb95536aa5d8092d56f526aa04c8cd1%2Foutput.png?generation=1749374431744993&alt=media" alt="">https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2Fa3d9a02ae0b314bfa6b3eb411c405ec0%2Foutput1.png?generation=1749374439690291&alt=media" alt="">


    We perform this extraction using the following three regex patterns:

    PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
    PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
    R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
    

    This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.

  2. MeDAL Dataset

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
    Explore at:
    zip(7324382521 bytes)Available download formats
    Dataset updated
    Nov 16, 2020
    Authors
    xhlulu
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

    Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

    💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv)Pre-trained ELECTRA (Hugging Face)

    Downloading the data

    We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

    First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

    Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

    Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

    Loading FastText Embeddings

    For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

    Model Quickstart

    Using Torch Hub

    You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

    lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

    If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

    Using Huggingface transformers

    If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained("xhlu/electra-medal")
    tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
    

    Citation

    Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

    License, Terms and Conditions

    The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

    The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

    INTRODUCTION

    Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

    MEDLINE/PUBMED SPECIFIC TERMS

    NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

    GENERAL TERMS AND CONDITIONS

    • Users of the data agree to:

      • acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
      • properly use registration and/or trademark symbols when referring to NLM products, and
      • not indicate or imply that NLM has endorsed its products/services/applications.
    • Users who republish or redistribute the data (services, products or raw data) agree to:

      • maintain the most current version of all distributed data, or
      • make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
    • These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

    • NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

    • NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.

  3. original : CIFAR 100

    • kaggle.com
    zip
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
    Explore at:
    zip(168517945 bytes)Available download formats
    Dataset updated
    Dec 28, 2024
    Authors
    Shashwat Pandey
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

    The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

    Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

    Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

    Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

    The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

    The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

    Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

    There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

    The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...

  4. Ecommerce Dataset (Products & Sizes Included)

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anvit kumar (2025). Ecommerce Dataset (Products & Sizes Included) [Dataset]. https://www.kaggle.com/datasets/anvitkumar/shopping-dataset
    Explore at:
    zip(1274856 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Anvit kumar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📦 Ecommerce Dataset (Products & Sizes Included)

    🛍️ Essential Data for Building an Ecommerce Website & Analyzing Online Shopping Trends 📌 Overview This dataset contains 1,000+ ecommerce products, including detailed information on pricing, ratings, product specifications, seller details, and more. It is designed to help data scientists, developers, and analysts build product recommendation systems, price prediction models, and sentiment analysis tools.

    🔹 Dataset Features

    Column Name Description product_id Unique identifier for the product title Product name/title product_description Detailed product description rating Average customer rating (0-5) ratings_count Number of ratings received initial_price Original product price discount Discount percentage (%) final_price Discounted price currency Currency of the price (e.g., USD, INR) images URL(s) of product images delivery_options Available delivery methods (e.g., standard, express) product_details Additional product attributes breadcrumbs Category path (e.g., Electronics > Smartphones) product_specifications Technical specifications of the product amount_of_stars Distribution of star ratings (1-5 stars) what_customers_said Customer reviews (sentiments) seller_name Name of the product seller sizes Available sizes (for clothing, shoes, etc.) videos Product video links (if available) seller_information Seller details, such as location and rating variations Different variants of the product (e.g., color, size) best_offer Best available deal for the product more_offers Other available deals/offers category Product category

    📊 Potential Use Cases

    📌 Build an Ecommerce Website: Use this dataset to design a functional online store with product listings, filtering, and sorting. 🔍 Price Prediction Models: Predict product prices based on features like ratings, category, and discount. 🎯 Recommendation Systems: Suggest products based on user preferences, rating trends, and customer feedback. 🗣 Sentiment Analysis: Analyze what_customers_said to understand customer satisfaction and product popularity. 📈 Market & Competitor Analysis: Track pricing trends, popular categories, and seller performance. 🔍 Why Use This Dataset? ✅ Rich Feature Set: Includes all necessary ecommerce attributes. ✅ Realistic Pricing & Rating Data: Useful for price analysis and recommendations. ✅ Multi-Purpose: Suitable for machine learning, web development, and data visualization. ✅ Structured Format: Easy-to-use CSV format for quick integration.

    📂 Dataset Format CSV file (ecommerce_dataset.csv) 1000+ samples Multi-category coverage 🔗 How to Use? Download the dataset from Kaggle. Load it in Python using Pandas: python Copy Edit import pandas as pd
    df = pd.read_csv("ecommerce_dataset.csv")
    df.head() Explore trends & patterns using visualization tools (Seaborn, Matplotlib). Build models & applications based on the dataset!

  5. IoT Sports Training Load Dataset

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). IoT Sports Training Load Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/iot-sports-training-load-dataset
    Explore at:
    zip(226978 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 2300 multimodal IoT sensor recordings collected from athletes during traditional sports training sessions, including basketball, soccer, running, and other athletic activities. The dataset includes heart rate, acceleration (X, Y, Z), gyroscope readings (X, Y, Z), speed, step count, jump height, and training load. It is designed to facilitate analysis of athlete performance, training load monitoring, and predictive modeling for sports science applications.

  6. Gemma-Python Training Dataset

    • kaggle.com
    zip
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Mairs (2024). Gemma-Python Training Dataset [Dataset]. https://www.kaggle.com/datasets/dmcstllc/gemma-python-training-dataset
    Explore at:
    zip(102676250 bytes)Available download formats
    Dataset updated
    Mar 17, 2024
    Authors
    David Mairs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Instruction/Response format of python code question and answers Oasst files are split 3000 lines each to prevent OOm when loading. The format of the files can easily be changed with a simple python script """

    `` import pandas as pd import json input_file_path = 'output_file.jsonl' output_file_path = 'output_file2.csv'

    Prepare an empty list to hold the processed records

    processed_records = []

    Open the .jsonl file and process each line

    with open(input_file_path, 'r') as file: for line in file: # Parse the JSON object from the current line record = json.loads(line)

      # Rename, filter the desired keys, and replace newline characters
      processed_record = {
        "prompt": record.get("INSTRUCTION", "").replace('
    

    ', ' ').strip(), "response": record.get("RESPONSE", "").replace(' ', ' ').strip() }

      # Add the processed record to the list
      processed_records.append(processed_record)
    

    Convert the list of processed records to a DataFrame

    df = pd.DataFrame(processed_records)

    Writing the DataFrame to a .csv file

    df.to_csv(output_file_path, index=False, quoting=2) # using quoting=2 to quote all fields

    print(f"Conversion complete. The output is saved to '{output_file_path}'") `` """

  7. MNIST - HDF5

    • kaggle.com
    zip
    Updated Feb 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benedict Wilkins (2020). MNIST - HDF5 [Dataset]. https://www.kaggle.com/benedictwilkinsai/mnist-hd5f
    Explore at:
    zip(12272528 bytes)Available download formats
    Dataset updated
    Feb 28, 2020
    Authors
    Benedict Wilkins
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The MNIST dataset in HDF5 format.

    Data can be loaded with the h5py package: pip install h5py, see demo

  8. MELD Preprocessed

    • kaggle.com
    zip
    Updated Mar 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argish Abhangi (2025). MELD Preprocessed [Dataset]. https://www.kaggle.com/datasets/argish/meld-preprocessed
    Explore at:
    zip(3527202381 bytes)Available download formats
    Dataset updated
    Mar 1, 2025
    Authors
    Argish Abhangi
    Description

    The MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.

    Data Sources

    • Audio: Waveforms extracted from the original video files.
    • Video: Video files are processed to sample frames at a target frame rate (default: 2 fps) and to detect faces using a Haar Cascade classifier.
    • Text: Utterances from the dialogue, which are cleaned using custom encoding functions to fix potential byte encoding issues.
    • Emotion Labels: Each sample is associated with an emotion label.

    Preprocessing Pipeline

    The preprocessing script performs several key steps:

    1. Text Cleaning:

      • fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.
      • replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).
    2. Audio Processing:

      • Extracts raw audio waveform from each sample.
      • Computes a Mel-spectrogram using torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).
      • Converts the spectrogram to a logarithmic scale for numerical stability.
    3. Video Processing:

      • Reads video frames at a specified target FPS (default: 2 fps) using OpenCV.
      • For each video, samples frames evenly based on the original video's FPS.
      • Applies Haar Cascade face detection on the frames to extract the first detected face.
      • Resizes the detected face to 224x224 and converts it to RGB. If no face is detected, a default black image (224x224x3) is returned.
    4. Saving Processed Samples:

      • Each sample is saved as a .pt file in a directory structure split by data type (train, dev, and test).
      • The filename is derived from the original video filename (e.g., dia0_utt1.mp4 becomes dia0_utt1.pt).

    Data Format

    Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:

    • utterance (str): The cleaned textual utterance.
    • emotion (str/int): The corresponding emotion label.
    • video_path (str): Original path to the video file from which the sample was extracted.
    • audio (Tensor): Raw audio waveform tensor of shape [channels, time].
    • audio_sample_rate (int): The sampling rate of the audio waveform.
    • audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].
    • face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.

    Directory Structure

    The preprocessed files are organized into splits: preprocessed_data/ ├── train/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... ├── dev/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... └── test/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt └── ...

    Loading and Using the Dataset

    A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:

    Dataset Class

    from torch.utils.data import Dataset
    import os
    import torch
    
    class PreprocessedMELDDataset(Dataset):
      def _init_(self, data_dir):
        """
        Args:
          data_dir (str): Directory where preprocessed .pt files are stored.
        """
        self.data_dir = data_dir
        self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')]
        
      def _len_(self):
        return len(self.files)
      
      def _getitem_(self, idx):
        sample_path = self.files[idx]
        sample = torch.load(sample_path)
        return sample
    

    Custom Collate Function

    def preprocessed_collate_fn(batch):
      """
      Collates a list of sample dictionaries into a single dictionary with keys mapping to lists.
      Modify this function to pad or stack tensor data if needed.
      """
      collated = {}
      collated['utterance'] = [sample['utterance'] for sample in batch]
      collated['emotion'] = [sample['emotion'] for sample in batch]
      collated['video_path'] = [sample['video_path'] for sample in batch]
      collated['audio'] = [sample['audio'] for sample in batch]
      collated['audio_sample_rate'] = batch[0]['audio_sample_rate']
      collated['audio_mel'] = [sample['audio_mel'] for sample in batch]
      collated['face'] = [sample['face'] for sample in batch]
      return collated
    

    Creating DataLoaders

    from torch.utils.data import DataLoader
    
    # Define paths for each split
    train_data_dir = "preprocessed_data/train"
    dev_data_dir = "preproces...
    
  9. Job_skill_extractor_NER

    • kaggle.com
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LeewanHung (2024). Job_skill_extractor_NER [Dataset]. https://www.kaggle.com/datasets/leewanhung/job-skill-extractor-ner
    Explore at:
    zip(3587456 bytes)Available download formats
    Dataset updated
    Jan 16, 2024
    Authors
    LeewanHung
    Description

    Introdution

    This Model was training using Spacy pipline and data from job_description

    This method based on NER to recognite Job skill. In this model, I mostly focus on technical skill with tag "SKILL"

    Training source can be find at here

    How to usage:

    import spacy
    from spacy.training.example import Example
    import json
    import random
    import warnings
    
    warnings.filterwarnings("ignore", category=UserWarning, module="spacy")
    warnings.filterwarnings("ignore", category=FutureWarning, module="tensorflow")
    
    path = "/kaggle/input/job_skills_extractor/scikitlearn/job_skill_extractor/1/job_skills_ner_model"
    loaded_nlp = spacy.load(path)
    
    # Test the loaded model with some example texts
    test_texts = [
      "I am skilled in Python and Java programming.",
      "My experience includes using TensorFlow for machine learning.",
      "I have hands-on experience with MongoDB and MySQL.",
      "Build machine learning",
    ]
    for text in test_texts:
      doc = loaded_nlp(text)
      print("Input Text:", text)
      print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])
    

    output

    Input Text: I am skilled in Python and Java programming.
    Entities: [('Python', "['SKILL']"), ('Java', "['SKILL']")]
    Input Text: My experience includes using TensorFlow for machine learning.
    Entities: [('TensorFlow', "['SKILL']"), ('machine learning.', "['SKILL']")]
    Input Text: I have hands-on experience with MongoDB and MySQL.
    Entities: [('MongoDB', "['SKILL']"), ('MySQL', "['SKILL']")]
    Input Text: Build machine learning
    Entities: [('machine learning', "['SKILL']")]
    
  10. ISIC 2020 - 256x256

    • kaggle.com
    zip
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehran Ziadloo (2024). ISIC 2020 - 256x256 [Dataset]. https://www.kaggle.com/datasets/ziadloo/isic-2020-256x256
    Explore at:
    zip(442027409 bytes)Available download formats
    Dataset updated
    Aug 6, 2024
    Authors
    Mehran Ziadloo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset is derived from the ISIC Archive with the following changes:

    1. A new integer column is added named "target" with values 0, 1, null. This column is populated using two other columns: "bengin_malignant" and "diagnosis". If the first column explicitly confirms that the record is either "benign" or "malignant", the target is set to "0" and "1" respectively. If the "benign_malignant" column is null, then the value of "diagnosis" column is used to determine the value for "target". The following diagnosis values are considered cancerous and as the result, "target" is set to "1":
    • squamous cell carcinoma
    • basal cell carcinoma
    • melanoma
    • squamous cell carcinoma

    If the "benign_malignant" column is null and the "diagnosis" column is "vascular lesion", the target is set to null.

    DISCLAIMER I'm not a dermatologist and I'm not affiliated with ISIC in any way. I don't know if my approach to setting the target value is acceptable by the ISIC competition. Use at your own risk.

    1. All the images are resized to 256x256 using the following Python code:
    import os
    import multiprocessing as mp
    from PIL import Image, ImageOps
    import glob
    from functools import partial
    
    
    def list_jpg_files(folder_path):
      # Ensure the folder path ends with a slash
      if not folder_path.endswith('/'):
        folder_path += '/'
    
      # Use glob to find all .jpg files in the specified folder (non-recursive)
      jpg_files = glob.glob(folder_path + '*.jpg')
    
      return jpg_files
    
    
    
    def resize_image(image_path, destination_folder):
      # Open the image file
      with Image.open(image_path) as img:
        # Get the original dimensions
        original_width, original_height = img.size
    
        # Calculate the aspect ratio
        aspect_ratio = original_width / original_height
    
        # Determine the new dimensions based on the aspect ratio
        if aspect_ratio > 1:
          # Width is larger, so we will crop the width
          new_width = int(256 * aspect_ratio)
          new_height = 256
        else:
          # Height is larger, so we will crop the height
          new_width = 256
          new_height = int(256 / aspect_ratio)
    
        # Resize the image while maintaining the aspect ratio
        img = img.resize((new_width, new_height))
    
        # Calculate the crop box to center the image
        left = (new_width - 256) / 2
        top = (new_height - 256) / 2
        right = (new_width + 256) / 2
        bottom = (new_height + 256) / 2
    
        # Crop the image if it results in shrinking
        if new_width > 256 or new_height > 256:
          img = img.crop((left, top, right, bottom))
        else:
          # Add black edges if it results in scaling up
          img = ImageOps.expand(img, border=(int(left), int(top), int(left), int(top)), fill='black')
    
        # Resize the image to the final dimensions
        img = img.resize((256, 256))
    
      img.save(os.path.join(destination_folder, os.path.basename(image_path)))
    
    
    source_folder = ""
    destination_folder = ""
    
    images = list_jpg_files(source_folder)
    
    with mp.Pool(processes=12) as pool:
      images = pool.map(partial(resize_image, destination_folder=destination_folder), images)
    print("All images resized")
    

    This code will shrink (down-sample) the image if it is larger than 256x256. But if the image is smaller than 256x256, it will add either vertical or horizontal black edges after scaling up the image. In both scenarios, it will keep the center of the input image in the center of the output image.

    The HDF5 file is created using the following code:

    import os
    import pandas as pd
    from PIL import Image
    import h5py
    import io
    import numpy as np
    
    # File paths
    base_folder = "./isic-2020-256x256"
    csv_file_path = 'train-metadata.csv'
    image_folder_path = 'train-image/image'
    hdf5_file_path = 'train-image.hdf5'
    
    # Read the CSV file
    df = pd.read_csv(os.path.join(base_folder, csv_file_path))
    
    # Open an HDF5 file
    with h5py.File(os.path.join(base_folder, hdf5_file_path), 'w') as hdf5_file:
      for index, row in df.iterrows():
        isic_id = row['isic_id']
        image_file_path = os.path.join(base_folder, image_folder_path, f'{isic_id}.jpg')
        
        if os.path.exists(image_file_path):
          # Open the image file
          with Image.open(image_file_path) as img:
            # Convert the image to a byte buffer
            img_byte_arr = io.BytesIO()
            img.save(img_byte_arr, format=img.format)
            img_byte_arr = img_byte_arr.getvalue()
            hdf5_file.create_dataset(isic_id, data=np.void(img_byte_arr))
        else:
          print(f"Image file for {isic_id} not found.")
    
    print("HDF5 file created successfully.")
    

    To read the hdf5 file, use the following code:

    import h5py
    from PIL import Image
    
    
    with h...
    
  11. codeparrot_1M

    • kaggle.com
    zip
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). codeparrot_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/codeparrot-1m/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(2368083124 bytes)Available download formats
    Dataset updated
    Feb 25, 2024
    Authors
    Tanay Mehta
    Description

    A subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.

    The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.

    For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

    The script used for creating the dataset can be found here.

    Instructions for using this dataset

    This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.

    First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

    $ pip install -q kaggle pyarrow pylance
    $ kaggle datasets download -d heyytanay/codeparrot-1m
    $ mkdir codeparrot_1M.lance/
    $ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/
    $ rm codeparrot-1m.zip
    

    Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

    import lance
    dataset = lance.dataset('codeparrot_1M.lance/')
    print(dataset.count_rows())
    

    This will give you the total number of tokens in the dataset.

    Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.

  12. Huggingface RoBERTa

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface RoBERTa [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-roberta
    Explore at:
    zip(34531447596 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-roberta/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
    

    Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  13. pyspark-package

    • kaggle.com
    zip
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iulian Cozma (2024). pyspark-package [Dataset]. https://www.kaggle.com/datasets/icozma/pyspark-package
    Explore at:
    zip(1586185224 bytes)Available download formats
    Dataset updated
    Sep 26, 2024
    Authors
    Iulian Cozma
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    For installing pyspark when running notebook without Internet:

    (1) Attach the pyspark-package dataset to your notebook.
    (2) Install pyspark with the following code:

    import shutil
    src_path = r"/kaggle/input/pyspark-package/pyspark-latest.tar.gz.mp4"
    dst_path = r"/kaggle/working/pyspark-latest.tar.gz"
    shutil.copy(src_path, dst_path)
    
    !pip install /kaggle/working/pyspark-latest.tar.gz
    

    or for specific version, check if that version is available in dataset, then you can use i.e. for 3.5.0:

    import shutil
    src_path = r"/kaggle/input/pyspark-package/pyspark-3.5.0.tar.gz.mp4"
    dst_path = r"/kaggle/working/pyspark-3.5.0.tar.gz"
    shutil.copy(src_path, dst_path)
    
    !pip install /kaggle/working/pyspark-3.5.0.tar.gz
    

    (3) Then you can use: python import pyspark

  14. MNIST Dataset

    • kaggle.com
    • opendatalab.com
    • +4more
    zip
    Updated Jan 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hojjat Khodabakhsh (2019). MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/hojjatk/mnist-dataset
    Explore at:
    zip(23112702 bytes)Available download formats
    Dataset updated
    Jan 8, 2019
    Authors
    Hojjat Khodabakhsh
    Description

    Context

    MNIST is a subset of a larger set available from NIST (it's copied from http://yann.lecun.com/exdb/mnist/)

    Content

    The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. . Four files are available:

    • train-images-idx3-ubyte.gz: training set images (9912422 bytes)
    • train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
    • t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
    • t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

    How to read

    See sample MNIST reader

    Acknowledgements

    • Yann LeCun, Courant Institute, NYU
    • Corinna Cortes, Google Labs, New York
    • Christopher J.C. Burges, Microsoft Research, Redmond

    Inspiration

    Many methods have been tested with this training set and test set (see http://yann.lecun.com/exdb/mnist/ for more details)

  15. Pytorch Models

    • kaggle.com
    zip
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data
    Explore at:
    zip(21493 bytes)Available download formats
    Dataset updated
    May 10, 2025
    Authors
    Sufian Othman
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ✅ Step 1: Mount to Dataset

    Search for my dataset pytorch-models and add it — this will mount it at:

    /kaggle/input/pytorch-models/

    ✅ Step 2: Check file paths Once mounted, the four files will be available at:

    /kaggle/input/pytorch-models/base_models.py
    /kaggle/input/pytorch-models/ext_base_models.py
    /kaggle/input/pytorch-models/ext_hybrid_models.py
    /kaggle/input/pytorch-models/hybrid_models.py
    

    ✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

    import shutil
    
    shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
    

    ✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

    import base_models
    import ext_base_models
    import ext_hybrid_models
    import hybrid_models
    

    Or, if you only want to import specific classes or functions:

    from base_models import YourModelClass
    from ext_base_models import AnotherModelClass
    

    ✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

    model = base_models.YourModelClass()
    output = model(input_data)
    
  16. RC Beam Capacity Optimization Dataset

    • kaggle.com
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). RC Beam Capacity Optimization Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/rc-beam-capacity-optimization-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 3,234 data of reinforced concrete (RC) beam design parameters and their corresponding load-bearing capacities. The data is based on realistic construction standards and includes geometric, material, and reinforcement details such as beam dimensions, concrete grade, reinforcement ratios, and stirrup specifications.

  17. Iris Species

    • kaggle.com
    zip
    Updated Sep 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
    Explore at:
    zip(3687 bytes)Available download formats
    Dataset updated
    Sep 27, 2016
    Dataset authored and provided by
    UCI Machine Learning
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

    It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

    The columns in this dataset are:

    • Id
    • SepalLengthCm
    • SepalWidthCm
    • PetalLengthCm
    • PetalWidthCm
    • Species

    Sepal Width vs. Sepal Length

  18. Data from: Duck Hunt

    • kaggle.com
    zip
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Zanini (2025). Duck Hunt [Dataset]. https://www.kaggle.com/datasets/hugozanini1/duck-hunt
    Explore at:
    zip(7379197 bytes)Available download formats
    Dataset updated
    Jul 26, 2025
    Authors
    Hugo Zanini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Duck Hunt Object Detection Dataset

    This dataset contains 1,004 labeled images from the classic NES game "Duck Hunt" (1984), specifically prepared for YOLO (You Only Look Once) object detection training. The dataset includes sprites of the iconic hunting dog and ducks in various states, augmented to provide a balanced and comprehensive training set for computer vision models.

    Perfect for: - Object detection model training - Computer vision research - Retro gaming AI projects - YOLO algorithm benchmarking - Educational purposes

    🎯 Dataset Statistics

    MetricValue
    Total Images1,004
    Dataset Size12 MB
    Image FormatPNG
    Annotation FormatYOLO (.txt)
    Classes4
    Train/Val Split711/260 (73%/27%)

    Class Distribution

    Class IDClass NameCountDescription
    0dog252The hunting dog in various poses (jumping, laughing, sniffing, etc.)
    1duck_dead256Dead ducks (both black and red variants)
    2duck_shot248Ducks in the moment of being shot
    3duck_flying248Flying ducks in all directions (left, right, diagonal)

    📁 Dataset Structure

    yolo_dataset_augmented/
    ├── images/
    │  ├── train/      # 711 training images
    │  └── val/       # 260 validation images
    ├── labels/
    │  ├── train/      # 711 YOLO annotation files
    │  └── val/       # 260 YOLO annotation files
    ├── classes.txt     # Class names mapping
    ├── dataset.yaml     # YOLO configuration file
    └── augmented_dataset_stats.json # Detailed statistics
    

    🔧 Data Augmentation Details

    The original 47 images were enhanced using advanced data augmentation techniques to create a balanced dataset:

    Augmentation Techniques Applied:

    • Geometric Transformations: Rotation (±15°), horizontal/vertical flipping, scaling (0.8-1.2x), translation
    • Color Adjustments: Brightness (0.7-1.3x), contrast (0.8-1.2x), saturation (0.8-1.2x)
    • Quality Variations: Gaussian noise, slight blur for robustness
    • Advanced Techniques: Mosaic augmentation (YOLO-style 4-image combination)

    Augmentation Parameters:

    {
      'rotation_range': (-15, 15),    # Small rotations for game sprites
      'brightness_range': (0.7, 1.3),  # Brightness variations
      'contrast_range': (0.8, 1.2),   # Contrast adjustments
      'saturation_range': (0.8, 1.2),  # Color saturation
      'noise_intensity': 0.02,      # Gaussian noise
      'horizontal_flip_prob': 0.5,    # 50% chance horizontal flip
      'scaling_range': (0.8, 1.2),    # Scale variations
    }
    

    🚀 Usage Examples

    Loading with YOLOv8 (Ultralytics)

    from ultralytics import YOLO
    
    # Load and train
    model = YOLO('yolov8n.pt') # Load pretrained model
    results = model.train(data='dataset.yaml', epochs=100, imgsz=640)
    
    # Validate
    metrics = model.val()
    
    # Predict
    results = model('path/to/test/image.png')
    

    Loading with PyTorch

    import torch
    from torch.utils.data import Dataset, DataLoader
    from PIL import Image
    import os
    
    class DuckHuntDataset(Dataset):
      def _init_(self, images_dir, labels_dir, transform=None):
        self.images_dir = images_dir
        self.labels_dir = labels_dir
        self.transform = transform
        self.images = os.listdir(images_dir)
      
      def _len_(self):
        return len(self.images)
      
      def _getitem_(self, idx):
        img_path = os.path.join(self.images_dir, self.images[idx])
        label_path = os.path.join(self.labels_dir, 
                     self.images[idx].replace('.png', '.txt'))
        
        image = Image.open(img_path)
        # Load YOLO annotations
        with open(label_path, 'r') as f:
          labels = f.readlines()
        
        if self.transform:
          image = self.transform(image)
          
        return image, labels
    
    # Usage
    dataset = DuckHuntDataset('images/train', 'labels/train')
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    

    YOLO Annotation Format

    Each .txt file contains one line per object: class_id center_x center_y width height

    Example annotation: 0 0.492 0.403 0.212 0.315 Where values are normalized (0-1) relative to image dimensions.

    📊 Technical Specifications

    • Image Dimensions: Variable (original sprite sizes preserved)
    • Color Channels: RGB (3 channels)
    • Annotation Precision: Float32 (normalized coordinates)
    • File Naming: Descriptive names indicating class and augmentation type
    • Quality: High-resolution pixel art sprites

    🎮 Dataset Context

    This dataset is based on sprites from the iconic 1984 NES game "Duck Hunt," one of the most recognizable video games in history. The game featured:

    • The Dog: Your hunting companion who retrieves ducks and ...
  19. ALBUMFORGE DATASET FAQ

    • kaggle.com
    zip
    Updated Aug 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ALBUMFORGE (2025). ALBUMFORGE DATASET FAQ [Dataset]. https://www.kaggle.com/datasets/albumforge/albumforge-dataset-faq
    Explore at:
    zip(95100 bytes)Available download formats
    Dataset updated
    Aug 18, 2025
    Authors
    ALBUMFORGE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Ce dataset est une collection de questions-réponses sur le logiciel de création d'albums photo AlbumForge. Il est conçu pour le fine-tuning des grands modèles de langage (LLMs) afin de les rendre plus performants dans les tâches de "citation-aware QA" (réponse aux questions avec citation des sources).

    Structure du Dataset Chaque entrée du dataset est un objet JSON avec les champs suivants :

    question (string) : La question posée. answer (string) : La réponse à la question, avec des marqueurs de citation intégrés. cite_refs (list[int]) : Une liste d'identifiants de références correspondantes. Les identifiants de référence (IDs) dans cite_refs correspondent aux objets dans le fichier citations.json, qui contient les détails des sources.

    Fichiers faq_albumforge_cited.jsonl : Le dataset principal, formaté en JSONL. citations.json : Un fichier JSON qui stocke les références complètes pour le dataset. Exemple d'utilisation Pour utiliser ce dataset pour entraîner un modèle, il est possible de charger les deux fichiers et de lier les références à chaque entrée via le champ cite_refs.

    import json

    def load_dataset(faq_path, citations_path): with open(citations_path, 'r', encoding='utf-8') as f: citations = {item['id']: item for item in json.load(f)}

    with open(faq_path, 'r', encoding='utf-8') as f:
      data = [json.loads(line) for line in f]
    
    for item in data:
      item['full_citations'] = [citations[ref_id] for ref_id in item['cite_refs']]
      # On peut ici transformer le format pour l'injection dans un LLM
      # ex: "Question: Qu'est-ce qu'AlbumForge ? Réponse: AlbumForge est un logiciel éthique et 100% hors ligne ... Sources: [1] Site officiel d'AlbumForge ([https://www.albumforge.com](https://www.albumforge.com))"
    
    return data
    

    Exemple

    dataset = load_dataset('faq_albumforge_cited.jsonl', 'citations.json')

    👋 About This Dataset

    This dataset contains 115 curated Q&A examples about AlbumForge, a privacy-first photo album software.
    It is designed to power FAQ bots, retrieval-augmented generation (RAG), or fine-tuning of ethical assistant models.

    • 🧠 Format: JSONL + Parquet
    • 📊 Split: Train (1083 rows)
    • 💡 Use case: Privacy-focused chatbot, ethical software agents
    • 🌍 Available in: 44 languages

    Try querying this dataset in Python:

    from datasets import load_dataset
    
    dataset = load_dataset("albumforge/faq-albumforge-cited", split="train")
    print(dataset[0]) source : AOUT 2025 : www.albumforge.com
    
  20. HPA21 Extra Packages

    • kaggle.com
    zip
    Updated Mar 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Scholan (2021). HPA21 Extra Packages [Dataset]. https://www.kaggle.com/andrewscholan/hpa21-extra-packages
    Explore at:
    zip(1265676527 bytes)Available download formats
    Dataset updated
    Mar 7, 2021
    Authors
    Andrew Scholan
    Description

    This dataset is a pre-canned set of python wheel files and a requirements.txt file. It's purpose is to load in python packages into a notebook when the internet is disabled.

    See this Offline Package Wheeler to see how these python modules have been packaged up and how to use them in your offline notebook.

    LICENSES

    All of the licenses pertain directly to the python packages themselves; please refer to their documentation on PyPi.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TheItCrow (2025). Kaggle's Most Used Packages & Method Calls [Dataset]. https://www.kaggle.com/datasets/kevinbnisch/kaggles-most-used-packages-and-method-calls
Organization logo

Kaggle's Most Used Packages & Method Calls

Kaggle-Meta Enriched With Imports & Method Calls

Explore at:
zip(2405388375 bytes)Available download formats
Dataset updated
Jun 13, 2025
Authors
TheItCrow
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.

Most Imported R PackagesMost Imported Python Packages
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2F5bb95536aa5d8092d56f526aa04c8cd1%2Foutput.png?generation=1749374431744993&alt=media" alt="">https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17421843%2Fa3d9a02ae0b314bfa6b3eb411c405ec0%2Foutput1.png?generation=1749374439690291&alt=media" alt="">


We perform this extraction using the following three regex patterns:

PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')

This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.

Search
Clear search
Close search
Google apps
Main menu