100+ datasets found

pyVips: python & deb 📦package
kaggle.com
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jirka Borovec (2023). pyVips: python & deb 📦package [Dataset]. https://www.kaggle.com/datasets/jirkaborovec/pyvips-python-and-deb-package
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jirka Borovec
Description
Downloaded both Python and Debian packages for offline use. The creation and usage is described in https://www.kaggle.com/code/jirkaborovec/pip-pkg-pyvips-download-4-offline

How to use:

Click "**Add Data**" on your own notebook

Search for dataset pyVips: python & deb package

Run those installation lines below:

!ls /kaggle/input/pyvips-python-and-deb-package # intall the deb packages !dpkg -i --force-depends /kaggle/input/pyvips-python-and-deb-package/linux_packages/archives/*.deb # install the python wrapper !pip install pyvips -f /kaggle/input/pyvips-python-and-deb-package/python_packages/ --no-index !pip list | grep pyvips
TabPFN (0.1.9) whl
kaggle.com
zip
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2025). TabPFN (0.1.9) whl [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/tabpfn-019-whl
Explore at:
zip(232721099 bytes)Available download formats
Dataset updated
Jan 9, 2025
Authors
Carl McBride Ellis
Description
This is the whl file for version 0.1.9 of TabPFN.

add the following dataset to ones notebook: TabPFN (0.1.9) whl using + Add Data button located on the right side of your notebook

then simply install via:

!pip install /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl

followed by:

!mkdir /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff !cp /kaggle/input/tabpfn-019-whl/prior_diff_real_checkpoint_n_0_epoch_100.cpkt /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff/

This dataset includes the files: * prior_diff_real_checkpoint_n_0_epoch_42.cpkt from https://github.com/automl/TabPFN/tree/main/tabpfn/models_diff * prior_diff_real_checkpoint_n_0_epoch_100.cpkt which seems to be the model file required.

Here is a use case demonstration notebook: "TabPFN test with notebook in "Internet off" mode"
Pytorch Models
kaggle.com
zip
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufian Othman (2025). Pytorch Models [Dataset]. https://www.kaggle.com/datasets/mohdsufianbinothman/pytorch-models/data
Explore at:
zip(21493 bytes)Available download formats
Dataset updated
May 10, 2025
Authors
Sufian Othman
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
✅ Step 1: Mount to Dataset

Search for my dataset pytorch-models and add it — this will mount it at:

/kaggle/input/pytorch-models/

✅ Step 2: Check file paths Once mounted, the four files will be available at:

/kaggle/input/pytorch-models/base_models.py /kaggle/input/pytorch-models/ext_base_models.py /kaggle/input/pytorch-models/ext_hybrid_models.py /kaggle/input/pytorch-models/hybrid_models.py

✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):

import shutil shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/') shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')

✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:

import base_models import ext_base_models import ext_hybrid_models import hybrid_models

Or, if you only want to import specific classes or functions:

from base_models import YourModelClass from ext_base_models import AnotherModelClass

✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:

model = base_models.YourModelClass() output = model(input_data)

Job_skill_extractor_NER

kaggle.com

zip

Updated Jan 16, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

LeewanHung (2024). Job_skill_extractor_NER [Dataset]. https://www.kaggle.com/datasets/leewanhung/job-skill-extractor-ner

Explore at:

zip(3587456 bytes)Available download formats

Dataset updated

Jan 16, 2024

Authors

LeewanHung

Description

Introdution

This Model was training using Spacy pipline and data from job_description

This method based on NER to recognite Job skill. In this model, I mostly focus on technical skill with tag "SKILL"

Training source can be find at here

How to usage:

import spacy
from spacy.training.example import Example
import json
import random
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="spacy")
warnings.filterwarnings("ignore", category=FutureWarning, module="tensorflow")

path = "/kaggle/input/job_skills_extractor/scikitlearn/job_skill_extractor/1/job_skills_ner_model"
loaded_nlp = spacy.load(path)

# Test the loaded model with some example texts
test_texts = [
  "I am skilled in Python and Java programming.",
  "My experience includes using TensorFlow for machine learning.",
  "I have hands-on experience with MongoDB and MySQL.",
  "Build machine learning",
]
for text in test_texts:
  doc = loaded_nlp(text)
  print("Input Text:", text)
  print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])

output

Input Text: I am skilled in Python and Java programming.
Entities: [('Python', "['SKILL']"), ('Java', "['SKILL']")]
Input Text: My experience includes using TensorFlow for machine learning.
Entities: [('TensorFlow', "['SKILL']"), ('machine learning.', "['SKILL']")]
Input Text: I have hands-on experience with MongoDB and MySQL.
Entities: [('MongoDB', "['SKILL']"), ('MySQL', "['SKILL']")]
Input Text: Build machine learning
Entities: [('machine learning', "['SKILL']")]

rouge-score
kaggle.com
zip
Updated Sep 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bytestorm (2023). rouge-score [Dataset]. https://www.kaggle.com/datasets/bytestorm/rouge-score/code
Explore at:
zip(30793 bytes)Available download formats
Dataset updated
Sep 3, 2023
Authors
bytestorm
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Steps for installation

Add the dataset to your notebook.

Then run following two bash commands from a notebook cell: sh !cp -r /kaggle/input/rouge-score/rouge_score-0.1.2 /kaggle/working/ !pip install /kaggle/working/rouge_score-0.1.2/

Usage in python

from rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True) scores = scorer.score('The quick brown fox jumps over the lazy dog', 'The quick brown dog jumps on the log.')
Vezora/Tested-188k-Python-Alpaca: Functional
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth/discussion
Explore at:
zip(12200606 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

By Vezora (From Huggingface) [source]

About this dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

How to use the dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

Contents of the Dataset

The dataset consists of several columns:

output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.

instruction: It provides information about the task or instruction that each Python code sample is intended to solve.

input: The input parameters or values required to execute each Python code sample.

Exploring the Dataset

To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.

import pandas as pd # Load the dataset df = pd.read_csv('train.csv')

Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.

# Display column names print(df.columns)

Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.

# Display random samples from 'output' column print(df['output'].sample(5))

Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.

# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))

Potential Use Cases

The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

Code Analysis: Analyze the code samples to understand common programming patterns and best practices.

Code Debugging: Use code samples with known outputs to test and debug your own Python programs.

Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.

Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

Research Ideas

Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.

Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.

Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
pytorch_tabular🔥: python 📦package
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jirka Borovec (2023). pytorch_tabular🔥: python 📦package [Dataset]. https://www.kaggle.com/datasets/jirkaborovec/pytorch-tabular-python-package
Explore at:
zip(4699016393 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
Jirka Borovec
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
How to install this package | step by step:

Click "**Add Data**" on your own notebook

Search for "**pytorch_tabular🔥: python 📦package**" and add this dataset as a data source

Run those installation lines below:

!pip install pytorch_tabular -f /kaggle/input/pytorch-tabular-python-package/ --no-index
AI4Code Train Dataframe
kaggle.com
zip
Updated May 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2022). AI4Code Train Dataframe [Dataset]. https://www.kaggle.com/datasets/dschettler8845/ai4code-train-dataframe
Explore at:
zip(622120487 bytes)Available download formats
Dataset updated
May 12, 2022
Authors
Darien Schettler
Description
[EDIT/UPDATE]

There are a few important updates.

When SAVING the pd.Dataframe as a .csv, the following command should be used to avoid improper interpretation of newline character(s).

train_df.to_csv( "train.csv", index=False, encoding='utf-8', quoting=csv.QUOTE_NONNUMERIC # <== THIS IS REQUIRED )

When LOADING the .csv as a pd.Dataframe, the following command must be used to avoid misinterpretation of NaN like strings (null, nan, ...) as pd.NaN values.

train_df = pd.read_csv( "/kaggle/input/ai4code-train-dataframe/train.csv", keep_default_na=False # <== THIS IS REQUIRED )
MSU Prac Denoising THICK Dataset
kaggle.com
zip
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Breskanu (2024). MSU Prac Denoising THICK Dataset [Dataset]. https://www.kaggle.com/datasets/nikitabreskanu/msu-prac-denoising-thick-dataset
Explore at:
zip(26572934991 bytes)Available download formats
Dataset updated
Apr 30, 2024
Authors
Nikita Breskanu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a dataset for MSU practicum homework on denoising audio. The homework can be found here: https://github.com/mmp-practicum-team/mmp_practicum_spring_2024/blob/main/Tasks/Task%2004/task_04.ipynb

After adding this dataset to your kaggle kernel, change the templates cell to the following:

noise_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/musan/musan/noise/*/*.wav' audio_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/train-clean-100/LibriSpeech/train-clean-100/*/*/*.flac' ru_audio_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/ruls_data/dev/audio/*/*/*.wav'

The purpose of this dataset is to save students time on uploading 30 GB of audio data to kaggle.
CIFAR-10 Python
kaggle.com
zip
Updated Jan 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kris (2018). CIFAR-10 Python [Dataset]. https://www.kaggle.com/datasets/pankrzysiu/cifar10-python/code
Explore at:
zip(340613496 bytes)Available download formats
Dataset updated
Jan 27, 2018
Authors
Kris
Description
Context

CIFAR-10 is the excellent Dataset for many Image processing experiments.

Content

Usage instructions

in Keras

from os import listdir, makedirs from os.path import join, exists, expanduser cache_dir = expanduser(join('~', '.keras')) if not exists(cache_dir): makedirs(cache_dir) datasets_dir = join(cache_dir, 'datasets') # /cifar-10-batches-py if not exists(datasets_dir): makedirs(datasets_dir) # If you have multiple input datasets, change the below cp command accordingly, typically: # !cp ../input/cifar10-python/cifar-10-python.tar.gz ~/.keras/datasets/ !cp ../input/cifar-10-python.tar.gz ~/.keras/datasets/ !ln -s ~/.keras/datasets/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz !tar xzvf ~/.keras/datasets/cifar-10-python.tar.gz -C ~/.keras/datasets/

general Python 3

def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict !tar xzvf ../input/cifar-10-python.tar.gz

then see section "Dataset layout" in https://www.cs.toronto.edu/~kriz/cifar.html for details

Acknowledgements

Downloaded directly from here:

https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

See description: https://www.cs.toronto.edu/~kriz/cifar.html

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
pip: sktime 0.19.1
kaggle.com
zip
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Panozzo (2023). pip: sktime 0.19.1 [Dataset]. https://www.kaggle.com/datasets/panozzaj/pip-sktime-0-19-1
Explore at:
zip(92945770 bytes)Available download formats
Dataset updated
Jun 15, 2023
Authors
Anthony Panozzo
Description
This dataset contains the dependencies for the sktime package, version 0.19.1. You can use this to install sktime on Kaggle without needing to download the dependencies. This can be useful if you are working on a competition that prohibits internet access in submission notebooks.

To use, add this dataset to your notebook and then install the dependencies by executing a cell with the following code:

deps_path = '/kaggle/input/pip-sktime-0-19-1' ! pip install --no-index --find-links {deps_path}/deps --requirement {deps_path}/requirements.txt

License: whatever the underlying dependencies licenses are. I claim no ownership or responsibility for the dependencies.

Feedback? Additional packages you'd like? Run this Python code to find my email address:

import base64; print(base64.b64decode('cGFub3p6YWpAZ21haWwuY29t'.encode()).decode())

If you end up using this package, an upvote or note would be helpful as it would let me know that it's useful to upload these kinds of datasets. Thanks!
Huggingface RoBERTa
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface RoBERTa [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-roberta
Explore at:
zip(34531447596 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-roberta/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base") model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")

Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
kaggle-notebooks-edu-v0
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jupyter Agent (2025). kaggle-notebooks-edu-v0 [Dataset]. https://huggingface.co/datasets/jupyter-agent/kaggle-notebooks-edu-v0
Explore at:
Dataset updated
May 31, 2025
Dataset provided by
Project Jupyterhttps://jupyter.org/
Authors
Jupyter Agent
Description
Kaggle Notebooks LLM Filtered

Model: meta-llama/Meta-Llama-3.1-70B-Instruct Sample: 12,400 Source dataset: data-agents/kaggle-notebooks Prompt:

Below is an extract from a Jupyter notebook. Evaluate whether it has a high analysis value and could help a data scientist.

The notebooks are formatted with the following tokens:

START

Here comes markdown content

Here comes python code

Here comes code output

More… See the full description on the dataset page: https://huggingface.co/datasets/jupyter-agent/kaggle-notebooks-edu-v0.
codeparrot_1M
kaggle.com
zip
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). codeparrot_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/codeparrot-1m/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(2368083124 bytes)Available download formats
Dataset updated
Feb 25, 2024
Authors
Tanay Mehta
Description
A subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.

The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.

For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

The script used for creating the dataset can be found here.

Instructions for using this dataset

This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.

First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

$ pip install -q kaggle pyarrow pylance $ kaggle datasets download -d heyytanay/codeparrot-1m $ mkdir codeparrot_1M.lance/ $ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/ $ rm codeparrot-1m.zip

Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

import lance dataset = lance.dataset('codeparrot_1M.lance/') print(dataset.count_rows())

This will give you the total number of tokens in the dataset.

Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
bitsandbytes
kaggle.com
zip
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yudai Hayashi (2025). bitsandbytes [Dataset]. https://www.kaggle.com/datasets/yuhaya9/bitsandbytes/data
Explore at:
zip(3982501081 bytes)Available download formats
Dataset updated
Oct 2, 2025
Authors
Yudai Hayashi
Description
How to use

In your notebook, execute the following command.

!pip install --no-deps --no-index --find-links=/kaggle/input/bitsandbytes bitsandbytes

Version

bitsandbytes 0.48.0

How to make this dataset

!pip download bitsandbytes
r/cosplay hot top images with titles
kaggle.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dinhanhx (2023). r/cosplay hot top images with titles [Dataset]. https://www.kaggle.com/datasets/inhanhv/rcosplay-hot-top-images-with-titles
Explore at:
zip(1251562500 bytes)Available download formats
Dataset updated
Mar 2, 2023
Authors
dinhanhx
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Please visit dinhanhx/rct

Sauce for the thumbnail

r/cosplay title crawler

Available on Kaggle

Please take time to read all this readme before using the dataset. Yes I'm serious!

Setup

pip install -e .

Go to this PRAW doc page, follow the instructions to get your client id, client secret, and user agent.

Then store them in confidential/reddit.json like this (don't actually write "spooky"): json { "id": "spooky", "secret": "spooky", "user-agent": "windows-10:spooky:v0.0.1 (by u/spooky)" }

Run

Download all posts in top and hot

(but the number in each category limited by Reddit) - Output file: data/cosplay.jsonl - 2161 posts (on 01/03/2023) python rct/crawl.py

Clean text

(in post's title) enclosed by square brackets such as [self], [found], ... - Input file: data/cosplay.jsonl - Output file: data/clean_cosplay.jsonl python rct/clean.py

Download images

Input file: data/clean_cosplay.jsonl

Output file: data/map_cosplay.jsonl, data/bad_response.jsonl

2160 downloaded images, 1 bad/delete/deprecated image (on 02/03/2023) python rct/download.py

⚠ The image_id, and image_path attributes' values are NOT linearly continuous. For example,

in data/bad_response.jsonl python {"image_id": "001912", "image_path": "data/image/001912.jpg"} and in data/map_cosplay.jsonl ```python

omit other json objects

{"image_id": "001911", "image_path": "data/image/001911.jpg"} {"image_id": "001913", "image_path": "data/image/001913.jpg"}

omit other json objects

⚠ `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts. ⚠ `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on _Kaggle_*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`
working with pipeline
kaggle.com
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fiza Aslam1 (2025). working with pipeline [Dataset]. https://www.kaggle.com/datasets/fizaaslam12/working-with-pipeline
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fiza Aslam1
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🚀 Feature Engineering with Scikit-Learn (Titanic Case Study)

This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.

📌 About

Feature Engineering is a crucial step in Machine Learning.
In this project, I show: - Handling missing values with SimpleImputer - Encoding categorical variables with OneHotEncoder - Building models manually vs using Pipeline - Saving models and pipelines with pickle - Making predictions with and without pipelines

📂 Content

train.csv → Titanic dataset

withpipeline.ipynb → End-to-end pipeline workflow

withoutpipeline.ipynb → Manual preprocessing workflow

predictusingpipeline.ipynb → Predictions with saved pipeline (pipe.pkl)

predictwithoutpipeline.ipynb → Predictions with classifier + encoders

models/

pipe.pkl → Complete ML pipeline (recommended for predictions)

clf.pkl → Classifier without pipeline

ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features

⚡ Usage

1️⃣ Load and Use Pipeline

import pickle pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb")) sample = [[22, 1, 0, 7.25, 'male', 'S']] print(pipe.predict(sample)) Predict with pipeline import pickle clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb")) ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb")) ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb")) # Preprocess input manually using the encoders, then predict with clf 🎯 Inspiration Learn difference between manual feature engineering and pipeline-based workflows Understand how to avoid data leakage using Pipeline Explore cross-validation with pipelines Practice model persistence and deployment strategies ✅ Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step! --- 👉 This version is **Kaggle-friendly** (short, structured, with code examples). Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?

BYU 2025 | CryoET Dataset (Part 1)

kaggle.com

zip

Updated Apr 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mahdi Ravaghi (2025). BYU 2025 | CryoET Dataset (Part 1) [Dataset]. https://www.kaggle.com/datasets/ravaghi/byu-2025-cryoet-dataset-part-1

Explore at:

zip(118559380441 bytes)Available download formats

Dataset updated

Apr 17, 2025

Authors

Mahdi Ravaghi

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is based on the work by @brendanartley. The images are kept in their original size, and no preprocessing has been done to maintain flexibility. This makes the dataset larger than the allowed size on Kaggle, so it has been split into two parts. This is the first part; the second part can be found here.

Here is the full script used collect the data:

from cryoet_data_portal import Client, Dataset
import pandas as pd
import numpy as np
import shutil
import zarr
import cv2
import os
import gc

datasets = Dataset.find(Client(), [Dataset.authors.name == "Morgan Beeby"])
datasets.extend(Dataset.find(Client(), [Dataset.authors.name == "Yi-Wei Chang"]))
datasets.extend(Dataset.find(Client(), [Dataset.authors.name == "Ariane Briegel"]))

new_labels = pd.read_csv("/kaggle/input/byu-locating-bacterial-flagellar-motors-2025/train_labels.csv")[0:0]
annotations = pd.read_csv("/kaggle/input/cryoet-flagellar-motors-dataset/labels.csv")

row_id = len(new_labels)
D, H, W = 128, 512, 512

tmp_dir = "/temp"
for dataset_idx, dataset in enumerate(datasets):
  print(f"Processing {dataset_idx+1}/{len(datasets)}: {dataset.title} ({len(dataset.runs)})")

  for run in dataset.runs:
    if run.name not in annotations.tomo_id.values:
      continue

    os.makedirs(tmp_dir, exist_ok=True)
    try:
      out_dir = f"dataset/{run.name}"
      if not os.path.exists(out_dir):
        os.makedirs(out_dir)

      tomo = run.tomograms[0]
      zarr_path = f"{tmp_dir}/{run.name}.zarr"
      tomo.download_omezarr(dest_path=tmp_dir)

      arr = zarr.open(zarr_path, mode='r')[0]

      batch_size = 32
      for i in range(0, arr.shape[0], batch_size):
        end_idx = min(i + batch_size, arr.shape[0])
        batch = arr[i:end_idx]

        for j, img in enumerate(batch):
          slice_idx = i + j
          cv2.imwrite(f"{out_dir}/slice_{str(slice_idx).zfill(4)}.jpg", (img*255).astype(np.uint8))

        del batch
        gc.collect()

      shape = arr.shape
      annotation = annotations[annotations.tomo_id == run.name]
      for i, row in annotation.iterrows():
        new_labels.loc[len(new_labels)] = {
          "row_id": row_id,
          "tomo_id": run.name,
          "Motor axis 0": row.z * (shape[0]/D),
          "Motor axis 1": row.y * (shape[1]/H),
          "Motor axis 2": row.x * (shape[2]/W),
          "Array shape (axis 0)": shape[0],
          "Array shape (axis 1)": shape[1],
          "Array shape (axis 2)": shape[2],
          "Voxel spacing": tomo.voxel_spacing,
          "Number of motors": len(annotation)
        }
        row_id += 1

    except Exception as e:
      print(e)
      
    shutil.rmtree(tmp_dir)

new_labels.to_csv("labels.csv", index=False)

R
Custom Yolov7 On Kaggle On Custom Dataset
universe.roboflow.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Car Bounding Boxes
Description
Custom Training with YOLOv7 🔥

Some Important links

Model Inference🤖

🚀Training Yolov7 on Kaggle

Weight and Biases 🐝

HuggingFace 🤗 Model Repo

Contact Information

Name - Owais Ahmad

Phone - +91-9515884381

Email - owaiskhan9654@gmail.com

Portfolio - https://owaiskhan9654.github.io/

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

Link to the Downloadable Dataset

from IPython.display import Markdown, display display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

Export the dataset to YOLOv7

Train YOLOv7 to recognize the objects in our dataset

Evaluate our YOLOv7 model's performance

Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements %cd yolov7 !pip install -qr requirements.txt !pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

import os import glob import wandb import torch from roboflow import Roboflow from kaggle_secrets import UserSecretsClient from IPython.display import Image, clear_output, display # to display images print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try: user_secrets = UserSecretsClient() wandb_api_key = user_secrets.get_secret("wandb_api") wandb.login(key=wandb_api_key) anonymous = None except: wandb.login(anonymous='must') print('To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize') wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Convert an existing Coco dataset to YOLOv7 format. In Roboflow it supports over 30 formats object detection formats for conversion.

Uploading only these raw images and annotate them in Roboflow with Roboflow Annotate.

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient() roboflow_api_key = user_secrets.get_secret("roboflow_api")

rf = Roboflow(api_key=roboflow_api_key) project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq") dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
Huggingface Google MobileBERT
kaggle.com
zip
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface Google MobileBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-google-mobilebert
Explore at:
zip(875319161 bytes)Available download formats
Dataset updated
Jul 26, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the mobilebert hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR) model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)

Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jirka Borovec (2023). pyVips: python & deb 📦package [Dataset]. https://www.kaggle.com/datasets/jirkaborovec/pyvips-python-and-deb-package

pyVips: python & deb 📦package

Distribution vips packages for offline use...

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 23, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Jirka Borovec

Description

Downloaded both Python and Debian packages for offline use. The creation and usage is described in https://www.kaggle.com/code/jirkaborovec/pip-pkg-pyvips-download-4-offline

How to use:

Click "**Add Data**" on your own notebook
Search for dataset pyVips: python & deb package
Run those installation lines below:

!ls /kaggle/input/pyvips-python-and-deb-package
# intall the deb packages
!dpkg -i --force-depends /kaggle/input/pyvips-python-and-deb-package/linux_packages/archives/*.deb
# install the python wrapper
!pip install pyvips -f /kaggle/input/pyvips-python-and-deb-package/python_packages/ --no-index
!pip list | grep pyvips

Clear search

Close search

Google apps

Main menu

pyVips: python & deb 📦package

How to use:

TabPFN (0.1.9) whl

Pytorch Models

Job_skill_extractor_NER

Introdution

How to usage:

output

rouge-score

Steps for installation

Usage in python

Vezora/Tested-188k-Python-Alpaca: Functional

Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

About this dataset

How to use the dataset

Contents of the Dataset

Exploring the Dataset

Potential Use Cases

Research Ideas

pytorch_tabular🔥: python 📦package

How to install this package | step by step:

AI4Code Train Dataframe

MSU Prac Denoising THICK Dataset

CIFAR-10 Python

Context

Content

in Keras

general Python 3

Acknowledgements

Inspiration

pip: sktime 0.19.1

Huggingface RoBERTa

kaggle-notebooks-edu-v0

Here comes markdown content

Here comes python code

Here comes code output

More… See the full description on the dataset page: https://huggingface.co/datasets/jupyter-agent/kaggle-notebooks-edu-v0.

codeparrot_1M

Instructions for using this dataset

bitsandbytes

How to use

Version

How to make this dataset

r/cosplay hot top images with titles

r/cosplay title crawler

Setup

Run

Download all posts in top and hot

Clean text

Download images

omit other json objects

omit other json objects

working with pipeline

🚀 Feature Engineering with Scikit-Learn (Titanic Case Study)

📌 About

📂 Content

⚡ Usage

1️⃣ Load and Use Pipeline

BYU 2025 | CryoET Dataset (Part 1)

Custom Yolov7 On Kaggle On Custom Dataset

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

Custom Training with YOLOv7 🔥

📦 YOLOv7

Step 1: Install Requirements

Downloading YOLOV7 starting checkpoint

Step 2: Assemble Our Dataset

Version v2 Aug 12, 2022 Looks like this.

Step 3: Training Custom pretrained YOLOv7 model

Huggingface Google MobileBERT

pyVips: python & deb 📦package

Distribution vips packages for offline use...

How to use: