100+ datasets found
  1. pyVips: python & deb šŸ“¦package

    • kaggle.com
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jirka Borovec (2023). pyVips: python & deb šŸ“¦package [Dataset]. https://www.kaggle.com/datasets/jirkaborovec/pyvips-python-and-deb-package
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jirka Borovec
    Description

    Downloaded both Python and Debian packages for offline use. The creation and usage is described in https://www.kaggle.com/code/jirkaborovec/pip-pkg-pyvips-download-4-offline

    How to use:

    1. Click "**Add Data**" on your own notebook
    2. Search for dataset pyVips: python & deb package
    3. Run those installation lines below:
    !ls /kaggle/input/pyvips-python-and-deb-package
    # intall the deb packages
    !dpkg -i --force-depends /kaggle/input/pyvips-python-and-deb-package/linux_packages/archives/*.deb
    # install the python wrapper
    !pip install pyvips -f /kaggle/input/pyvips-python-and-deb-package/python_packages/ --no-index
    !pip list | grep pyvips
    
  2. TabPFN (0.1.9) whl

    • kaggle.com
    zip
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2025). TabPFN (0.1.9) whl [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/tabpfn-019-whl
    Explore at:
    zip(232721099 bytes)Available download formats
    Dataset updated
    Jan 9, 2025
    Authors
    Carl McBride Ellis
    Description

    This is the whl file for version 0.1.9 of TabPFN.

    1. add the following dataset to ones notebook: TabPFN (0.1.9) whl using + Add Data button located on the right side of your notebook
    2. then simply install via:
    !pip install /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl
    

    followed by:

    !mkdir /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff
    !cp /kaggle/input/tabpfn-019-whl/prior_diff_real_checkpoint_n_0_epoch_100.cpkt /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff/
    

    This dataset includes the files: * prior_diff_real_checkpoint_n_0_epoch_42.cpkt from https://github.com/automl/TabPFN/tree/main/tabpfn/models_diff * prior_diff_real_checkpoint_n_0_epoch_100.cpkt which seems to be the model file required.

    Here is a use case demonstration notebook: "TabPFN test with notebook in "Internet off" mode"

  3. Pytorch Models

    • kaggle.com
    zip
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data
    Explore at:
    zip(21493 bytes)Available download formats
    Dataset updated
    May 10, 2025
    Authors
    Sufian Othman
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    āœ… Step 1: Mount to Dataset

    Search for my dataset pytorch-models and add it — this will mount it at:

    /kaggle/input/pytorch-models/

    āœ… Step 2: Check file paths Once mounted, the four files will be available at:

    /kaggle/input/pytorch-models/base_models.py
    /kaggle/input/pytorch-models/ext_base_models.py
    /kaggle/input/pytorch-models/ext_hybrid_models.py
    /kaggle/input/pytorch-models/hybrid_models.py
    

    āœ… Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

    import shutil
    
    shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
    shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
    

    āœ… Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

    import base_models
    import ext_base_models
    import ext_hybrid_models
    import hybrid_models
    

    Or, if you only want to import specific classes or functions:

    from base_models import YourModelClass
    from ext_base_models import AnotherModelClass
    

    āœ… Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

    model = base_models.YourModelClass()
    output = model(input_data)
    
  4. Job_skill_extractor_NER

    • kaggle.com
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LeewanHung (2024). Job_skill_extractor_NER [Dataset]. https://www.kaggle.com/datasets/leewanhung/job-skill-extractor-ner
    Explore at:
    zip(3587456 bytes)Available download formats
    Dataset updated
    Jan 16, 2024
    Authors
    LeewanHung
    Description

    Introdution

    This Model was training using Spacy pipline and data from job_description

    This method based on NER to recognite Job skill. In this model, I mostly focus on technical skill with tag "SKILL"

    Training source can be find at here

    How to usage:

    import spacy
    from spacy.training.example import Example
    import json
    import random
    import warnings
    
    warnings.filterwarnings("ignore", category=UserWarning, module="spacy")
    warnings.filterwarnings("ignore", category=FutureWarning, module="tensorflow")
    
    path = "/kaggle/input/job_skills_extractor/scikitlearn/job_skill_extractor/1/job_skills_ner_model"
    loaded_nlp = spacy.load(path)
    
    # Test the loaded model with some example texts
    test_texts = [
      "I am skilled in Python and Java programming.",
      "My experience includes using TensorFlow for machine learning.",
      "I have hands-on experience with MongoDB and MySQL.",
      "Build machine learning",
    ]
    for text in test_texts:
      doc = loaded_nlp(text)
      print("Input Text:", text)
      print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])
    

    output

    Input Text: I am skilled in Python and Java programming.
    Entities: [('Python', "['SKILL']"), ('Java', "['SKILL']")]
    Input Text: My experience includes using TensorFlow for machine learning.
    Entities: [('TensorFlow', "['SKILL']"), ('machine learning.', "['SKILL']")]
    Input Text: I have hands-on experience with MongoDB and MySQL.
    Entities: [('MongoDB', "['SKILL']"), ('MySQL', "['SKILL']")]
    Input Text: Build machine learning
    Entities: [('machine learning', "['SKILL']")]
    
  5. rouge-score

    • kaggle.com
    zip
    Updated Sep 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bytestorm (2023). rouge-score [Dataset]. https://www.kaggle.com/datasets/bytestorm/rouge-score/code
    Explore at:
    zip(30793 bytes)Available download formats
    Dataset updated
    Sep 3, 2023
    Authors
    bytestorm
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Steps for installation

    1. Add the dataset to your notebook.
    2. Then run following two bash commands from a notebook cell: sh !cp -r /kaggle/input/rouge-score/rouge_score-0.1.2 /kaggle/working/ !pip install /kaggle/working/rouge_score-0.1.2/

    Usage in python

    from rouge_score import rouge_scorer
    
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    scores = scorer.score('The quick brown fox jumps over the lazy dog',
               'The quick brown dog jumps on the log.')
    
  6. Vezora/Tested-188k-Python-Alpaca: Functional

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth/discussion
    Explore at:
    zip(12200606 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

    188k Functional Python Code Samples

    By Vezora (From Huggingface) [source]

    About this dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

    This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

    By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

    How to use the dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

    Contents of the Dataset

    The dataset consists of several columns:

    • output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
    • instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
    • input: The input parameters or values required to execute each Python code sample.

    Exploring the Dataset

    To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

    • Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('train.csv')
    
    • Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
    # Display column names
    print(df.columns)
    
    • Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
    # Display random samples from 'output' column
    print(df['output'].sample(5))
    
    • Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
    # Count unique instructions and display top ones with highest occurrences
    instruction_counts = df['instruction'].value_counts()
    print(instruction_counts.head(10))
    

    Potential Use Cases

    The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

    • Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
    • Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
    • Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
    • Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

    Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

    Research Ideas

    • Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
    • Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
    • Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
  7. pytorch_tabularšŸ”„: python šŸ“¦package

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jirka Borovec (2023). pytorch_tabularšŸ”„: python šŸ“¦package [Dataset]. https://www.kaggle.com/datasets/jirkaborovec/pytorch-tabular-python-package
    Explore at:
    zip(4699016393 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    Jirka Borovec
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    How to install this package | step by step:

    1. Click "**Add Data**" on your own notebook
    2. Search for "**pytorch_tabularšŸ”„: python šŸ“¦package**" and add this dataset as a data source
    3. Run those installation lines below:
    !pip install pytorch_tabular -f /kaggle/input/pytorch-tabular-python-package/ --no-index
    
  8. AI4Code Train Dataframe

    • kaggle.com
    zip
    Updated May 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2022). AI4Code Train Dataframe [Dataset]. https://www.kaggle.com/datasets/dschettler8845/ai4code-train-dataframe
    Explore at:
    zip(622120487 bytes)Available download formats
    Dataset updated
    May 12, 2022
    Authors
    Darien Schettler
    Description

    [EDIT/UPDATE]

    There are a few important updates.

    1. When SAVING the pd.Dataframe as a .csv, the following command should be used to avoid improper interpretation of newline character(s).
    train_df.to_csv(
      "train.csv", index=False, 
      encoding='utf-8', 
      quoting=csv.QUOTE_NONNUMERIC  # <== THIS IS REQUIRED
    )
    
    1. When LOADING the .csv as a pd.Dataframe, the following command must be used to avoid misinterpretation of NaN like strings (null, nan, ...) as pd.NaN values.
    train_df = pd.read_csv(
      "/kaggle/input/ai4code-train-dataframe/train.csv", 
      keep_default_na=False  # <== THIS IS REQUIRED
    )
    
  9. MSU Prac Denoising THICK Dataset

    • kaggle.com
    zip
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita Breskanu (2024). MSU Prac Denoising THICK Dataset [Dataset]. https://www.kaggle.com/datasets/nikitabreskanu/msu-prac-denoising-thick-dataset
    Explore at:
    zip(26572934991 bytes)Available download formats
    Dataset updated
    Apr 30, 2024
    Authors
    Nikita Breskanu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a dataset for MSU practicum homework on denoising audio. The homework can be found here: https://github.com/mmp-practicum-team/mmp_practicum_spring_2024/blob/main/Tasks/Task%2004/task_04.ipynb

    After adding this dataset to your kaggle kernel, change the templates cell to the following:

    noise_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/musan/musan/noise/*/*.wav'
    audio_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/train-clean-100/LibriSpeech/train-clean-100/*/*/*.flac'
    ru_audio_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/ruls_data/dev/audio/*/*/*.wav'
    

    The purpose of this dataset is to save students time on uploading 30 GB of audio data to kaggle.

  10. CIFAR-10 Python

    • kaggle.com
    zip
    Updated Jan 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kris (2018). CIFAR-10 Python [Dataset]. https://www.kaggle.com/datasets/pankrzysiu/cifar10-python/code
    Explore at:
    zip(340613496 bytes)Available download formats
    Dataset updated
    Jan 27, 2018
    Authors
    Kris
    Description

    Context

    CIFAR-10 is the excellent Dataset for many Image processing experiments.

    Content

    Usage instructions

    in Keras

    from os import listdir, makedirs
    from os.path import join, exists, expanduser
    
    cache_dir = expanduser(join('~', '.keras'))
    if not exists(cache_dir):
      makedirs(cache_dir)
    datasets_dir = join(cache_dir, 'datasets') # /cifar-10-batches-py
    if not exists(datasets_dir):
      makedirs(datasets_dir)
    
    # If you have multiple input datasets, change the below cp command accordingly, typically:
    # !cp ../input/cifar10-python/cifar-10-python.tar.gz ~/.keras/datasets/
    !cp ../input/cifar-10-python.tar.gz ~/.keras/datasets/
    !ln -s ~/.keras/datasets/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
    !tar xzvf ~/.keras/datasets/cifar-10-python.tar.gz -C ~/.keras/datasets/
    

    general Python 3

    def unpickle(file):
      import pickle
      with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
      return dict
    
    !tar xzvf ../input/cifar-10-python.tar.gz
    

    then see section "Dataset layout" in https://www.cs.toronto.edu/~kriz/cifar.html for details

    Acknowledgements

    Downloaded directly from here:

    https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

    See description: https://www.cs.toronto.edu/~kriz/cifar.html

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  11. pip: sktime 0.19.1

    • kaggle.com
    zip
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Panozzo (2023). pip: sktime 0.19.1 [Dataset]. https://www.kaggle.com/datasets/panozzaj/pip-sktime-0-19-1
    Explore at:
    zip(92945770 bytes)Available download formats
    Dataset updated
    Jun 15, 2023
    Authors
    Anthony Panozzo
    Description

    This dataset contains the dependencies for the sktime package, version 0.19.1. You can use this to install sktime on Kaggle without needing to download the dependencies. This can be useful if you are working on a competition that prohibits internet access in submission notebooks.

    To use, add this dataset to your notebook and then install the dependencies by executing a cell with the following code:

    
    deps_path = '/kaggle/input/pip-sktime-0-19-1'
    ! pip install --no-index --find-links {deps_path}/deps --requirement {deps_path}/requirements.txt
    

    License: whatever the underlying dependencies licenses are. I claim no ownership or responsibility for the dependencies.

    Feedback? Additional packages you'd like? Run this Python code to find my email address:

    import base64; print(base64.b64decode('cGFub3p6YWpAZ21haWwuY29t'.encode()).decode())

    If you end up using this package, an upvote or note would be helpful as it would let me know that it's useful to upload these kinds of datasets. Thanks!

  12. Huggingface RoBERTa

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface RoBERTa [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-roberta
    Explore at:
    zip(34531447596 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-roberta/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
    

    Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  13. kaggle-notebooks-edu-v0

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jupyter Agent (2025). kaggle-notebooks-edu-v0 [Dataset]. https://huggingface.co/datasets/jupyter-agent/kaggle-notebooks-edu-v0
    Explore at:
    Dataset updated
    May 31, 2025
    Dataset provided by
    Project Jupyterhttps://jupyter.org/
    Authors
    Jupyter Agent
    Description

    Kaggle Notebooks LLM Filtered

    Model: meta-llama/Meta-Llama-3.1-70B-Instruct Sample: 12,400 Source dataset: data-agents/kaggle-notebooks Prompt:

    Below is an extract from a Jupyter notebook. Evaluate whether it has a high analysis value and could help a data scientist.

    The notebooks are formatted with the following tokens:

    START

    Here comes markdown content

    Here comes python code

    Here comes code output

    More… See the full description on the dataset page: https://huggingface.co/datasets/jupyter-agent/kaggle-notebooks-edu-v0.

  14. codeparrot_1M

    • kaggle.com
    zip
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). codeparrot_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/codeparrot-1m/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(2368083124 bytes)Available download formats
    Dataset updated
    Feb 25, 2024
    Authors
    Tanay Mehta
    Description

    A subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.

    The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.

    For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

    The script used for creating the dataset can be found here.

    Instructions for using this dataset

    This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.

    First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

    $ pip install -q kaggle pyarrow pylance
    $ kaggle datasets download -d heyytanay/codeparrot-1m
    $ mkdir codeparrot_1M.lance/
    $ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/
    $ rm codeparrot-1m.zip
    

    Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

    import lance
    dataset = lance.dataset('codeparrot_1M.lance/')
    print(dataset.count_rows())
    

    This will give you the total number of tokens in the dataset.

    Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.

  15. bitsandbytes

    • kaggle.com
    zip
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yudai Hayashi (2025). bitsandbytes [Dataset]. https://www.kaggle.com/datasets/yuhaya9/bitsandbytes/data
    Explore at:
    zip(3982501081 bytes)Available download formats
    Dataset updated
    Oct 2, 2025
    Authors
    Yudai Hayashi
    Description

    How to use

    In your notebook, execute the following command.

    !pip install --no-deps --no-index --find-links=/kaggle/input/bitsandbytes bitsandbytes
    

    Version

    bitsandbytes 0.48.0

    How to make this dataset

    !pip download bitsandbytes
    
  16. r/cosplay hot top images with titles

    • kaggle.com
    zip
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dinhanhx (2023). r/cosplay hot top images with titles [Dataset]. https://www.kaggle.com/datasets/inhanhv/rcosplay-hot-top-images-with-titles
    Explore at:
    zip(1251562500 bytes)Available download formats
    Dataset updated
    Mar 2, 2023
    Authors
    dinhanhx
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    Please visit dinhanhx/rct

    Sauce for the thumbnail

    r/cosplay title crawler

    Available on Kaggle

    Please take time to read all this readme before using the dataset. Yes I'm serious!

    Setup

    pip install -e .
    

    Go to this PRAW doc page, follow the instructions to get your client id, client secret, and user agent.

    Then store them in confidential/reddit.json like this (don't actually write "spooky"): json { "id": "spooky", "secret": "spooky", "user-agent": "windows-10:spooky:v0.0.1 (by u/spooky)" }

    Run

    Download all posts in top and hot

    (but the number in each category limited by Reddit) - Output file: data/cosplay.jsonl - 2161 posts (on 01/03/2023) python rct/crawl.py

    Clean text

    (in post's title) enclosed by square brackets such as [self], [found], ... - Input file: data/cosplay.jsonl - Output file: data/clean_cosplay.jsonl python rct/clean.py

    Download images

    • Input file: data/clean_cosplay.jsonl
    • Output file: data/map_cosplay.jsonl, data/bad_response.jsonl
    • 2160 downloaded images, 1 bad/delete/deprecated image (on 02/03/2023) python rct/download.py

    ⚠ The image_id, and image_path attributes' values are NOT linearly continuous. For example,

    in data/bad_response.jsonl python {"image_id": "001912", "image_path": "data/image/001912.jpg"} and in data/map_cosplay.jsonl ```python

    omit other json objects

    {"image_id": "001911", "image_path": "data/image/001911.jpg"} {"image_id": "001913", "image_path": "data/image/001913.jpg"}

    omit other json objects

    
    ⚠ `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts.
    
    ⚠ `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on _Kaggle_*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`
    
  17. working with pipeline

    • kaggle.com
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fiza Aslam1 (2025). working with pipeline [Dataset]. https://www.kaggle.com/datasets/fizaaslam12/working-with-pipeline
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fiza Aslam1
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    šŸš€ Feature Engineering with Scikit-Learn (Titanic Case Study)

    This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
    It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.

    šŸ“Œ About

    Feature Engineering is a crucial step in Machine Learning.
    In this project, I show: - Handling missing values with SimpleImputer - Encoding categorical variables with OneHotEncoder - Building models manually vs using Pipeline - Saving models and pipelines with pickle - Making predictions with and without pipelines

    šŸ“‚ Content

    • train.csv → Titanic dataset
    • withpipeline.ipynb → End-to-end pipeline workflow
    • withoutpipeline.ipynb → Manual preprocessing workflow
    • predictusingpipeline.ipynb → Predictions with saved pipeline (pipe.pkl)
    • predictwithoutpipeline.ipynb → Predictions with classifier + encoders
    • models/
      • pipe.pkl → Complete ML pipeline (recommended for predictions)
      • clf.pkl → Classifier without pipeline
      • ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features

    ⚔ Usage

    1ļøāƒ£ Load and Use Pipeline

    import pickle
    
    pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
    sample = [[22, 1, 0, 7.25, 'male', 'S']]
    print(pipe.predict(sample))
    Predict with pipeline
    import pickle
    
    clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
    ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
    ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
    
    # Preprocess input manually using the encoders, then predict with clf
    šŸŽÆ Inspiration
    
    Learn difference between manual feature engineering and pipeline-based workflows
    
    Understand how to avoid data leakage using Pipeline
    
    Explore cross-validation with pipelines
    
    Practice model persistence and deployment strategies
    
    āœ… Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step!
    
    
    ---
    
    šŸ‘‰ This version is **Kaggle-friendly** (short, structured, with code examples). 
    Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
    
  18. BYU 2025 | CryoET Dataset (Part 1)

    • kaggle.com
    zip
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Ravaghi (2025). BYU 2025 | CryoET Dataset (Part 1) [Dataset]. https://www.kaggle.com/datasets/ravaghi/byu-2025-cryoet-dataset-part-1
    Explore at:
    zip(118559380441 bytes)Available download formats
    Dataset updated
    Apr 17, 2025
    Authors
    Mahdi Ravaghi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is based on the work by @brendanartley. The images are kept in their original size, and no preprocessing has been done to maintain flexibility. This makes the dataset larger than the allowed size on Kaggle, so it has been split into two parts. This is the first part; the second part can be found here.

    Here is the full script used collect the data:

    from cryoet_data_portal import Client, Dataset
    import pandas as pd
    import numpy as np
    import shutil
    import zarr
    import cv2
    import os
    import gc
    
    datasets = Dataset.find(Client(), [Dataset.authors.name == "Morgan Beeby"])
    datasets.extend(Dataset.find(Client(), [Dataset.authors.name == "Yi-Wei Chang"]))
    datasets.extend(Dataset.find(Client(), [Dataset.authors.name == "Ariane Briegel"]))
    
    new_labels = pd.read_csv("/kaggle/input/byu-locating-bacterial-flagellar-motors-2025/train_labels.csv")[0:0]
    annotations = pd.read_csv("/kaggle/input/cryoet-flagellar-motors-dataset/labels.csv")
    
    row_id = len(new_labels)
    D, H, W = 128, 512, 512
    
    tmp_dir = "/temp"
    for dataset_idx, dataset in enumerate(datasets):
      print(f"Processing {dataset_idx+1}/{len(datasets)}: {dataset.title} ({len(dataset.runs)})")
    
      for run in dataset.runs:
        if run.name not in annotations.tomo_id.values:
          continue
    
        os.makedirs(tmp_dir, exist_ok=True)
        try:
          out_dir = f"dataset/{run.name}"
          if not os.path.exists(out_dir):
            os.makedirs(out_dir)
    
          tomo = run.tomograms[0]
          zarr_path = f"{tmp_dir}/{run.name}.zarr"
          tomo.download_omezarr(dest_path=tmp_dir)
    
          arr = zarr.open(zarr_path, mode='r')[0]
    
          batch_size = 32
          for i in range(0, arr.shape[0], batch_size):
            end_idx = min(i + batch_size, arr.shape[0])
            batch = arr[i:end_idx]
    
            for j, img in enumerate(batch):
              slice_idx = i + j
              cv2.imwrite(f"{out_dir}/slice_{str(slice_idx).zfill(4)}.jpg", (img*255).astype(np.uint8))
    
            del batch
            gc.collect()
    
          shape = arr.shape
          annotation = annotations[annotations.tomo_id == run.name]
          for i, row in annotation.iterrows():
            new_labels.loc[len(new_labels)] = {
              "row_id": row_id,
              "tomo_id": run.name,
              "Motor axis 0": row.z * (shape[0]/D),
              "Motor axis 1": row.y * (shape[1]/H),
              "Motor axis 2": row.x * (shape[2]/W),
              "Array shape (axis 0)": shape[0],
              "Array shape (axis 1)": shape[1],
              "Array shape (axis 2)": shape[2],
              "Voxel spacing": tomo.voxel_spacing,
              "Number of motors": len(annotation)
            }
            row_id += 1
    
        except Exception as e:
          print(e)
          
        shutil.rmtree(tmp_dir)
    
    new_labels.to_csv("labels.csv", index=False)
    
  19. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 šŸ”„

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 šŸ”„

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    šŸ“¦ YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  20. Huggingface Google MobileBERT

    • kaggle.com
    zip
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface Google MobileBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-google-mobilebert
    Explore at:
    zip(875319161 bytes)Available download formats
    Dataset updated
    Jul 26, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the mobilebert hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
    

    Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jirka Borovec (2023). pyVips: python & deb šŸ“¦package [Dataset]. https://www.kaggle.com/datasets/jirkaborovec/pyvips-python-and-deb-package
Organization logo

pyVips: python & deb šŸ“¦package

Distribution vips packages for offline use...

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jirka Borovec
Description

Downloaded both Python and Debian packages for offline use. The creation and usage is described in https://www.kaggle.com/code/jirkaborovec/pip-pkg-pyvips-download-4-offline

How to use:

  1. Click "**Add Data**" on your own notebook
  2. Search for dataset pyVips: python & deb package
  3. Run those installation lines below:
!ls /kaggle/input/pyvips-python-and-deb-package
# intall the deb packages
!dpkg -i --force-depends /kaggle/input/pyvips-python-and-deb-package/linux_packages/archives/*.deb
# install the python wrapper
!pip install pyvips -f /kaggle/input/pyvips-python-and-deb-package/python_packages/ --no-index
!pip list | grep pyvips
Search
Clear search
Close search
Google apps
Main menu