Facebook
TwitterDownloaded both Python and Debian packages for offline use. The creation and usage is described in https://www.kaggle.com/code/jirkaborovec/pip-pkg-pyvips-download-4-offline
!ls /kaggle/input/pyvips-python-and-deb-package
# intall the deb packages
!dpkg -i --force-depends /kaggle/input/pyvips-python-and-deb-package/linux_packages/archives/*.deb
# install the python wrapper
!pip install pyvips -f /kaggle/input/pyvips-python-and-deb-package/python_packages/ --no-index
!pip list | grep pyvips
Facebook
TwitterThis is the whl file for version 0.1.9 of TabPFN.
!pip install /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl
followed by:
!mkdir /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff
!cp /kaggle/input/tabpfn-019-whl/prior_diff_real_checkpoint_n_0_epoch_100.cpkt /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff/
This dataset includes the files:
* prior_diff_real_checkpoint_n_0_epoch_42.cpkt from https://github.com/automl/TabPFN/tree/main/tabpfn/models_diff
* prior_diff_real_checkpoint_n_0_epoch_100.cpkt which seems to be the model file required.
Here is a use case demonstration notebook: "TabPFN test with notebook in "Internet off" mode"
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ā Step 1: Mount to Dataset
Search for my dataset pytorch-models and add it ā this will mount it at:
/kaggle/input/pytorch-models/
ā Step 2: Check file paths Once mounted, the four files will be available at:
/kaggle/input/pytorch-models/base_models.py
/kaggle/input/pytorch-models/ext_base_models.py
/kaggle/input/pytorch-models/ext_hybrid_models.py
/kaggle/input/pytorch-models/hybrid_models.py
ā Step 3: Copy files to working directory To make them importable, copy the .py files to your notebookās working directory (/kaggle/working/):
import shutil
shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
ā Step 4: Import your modules Now that they are in the working directory, you can import them like normal:
import base_models
import ext_base_models
import ext_hybrid_models
import hybrid_models
Or, if you only want to import specific classes or functions:
from base_models import YourModelClass
from ext_base_models import AnotherModelClass
ā Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:
model = base_models.YourModelClass()
output = model(input_data)
Facebook
TwitterThis Model was training using Spacy pipline and data from job_description
This method based on NER to recognite Job skill. In this model, I mostly focus on technical skill with tag "SKILL"
Training source can be find at here
import spacy
from spacy.training.example import Example
import json
import random
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="spacy")
warnings.filterwarnings("ignore", category=FutureWarning, module="tensorflow")
path = "/kaggle/input/job_skills_extractor/scikitlearn/job_skill_extractor/1/job_skills_ner_model"
loaded_nlp = spacy.load(path)
# Test the loaded model with some example texts
test_texts = [
"I am skilled in Python and Java programming.",
"My experience includes using TensorFlow for machine learning.",
"I have hands-on experience with MongoDB and MySQL.",
"Build machine learning",
]
for text in test_texts:
doc = loaded_nlp(text)
print("Input Text:", text)
print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])
Input Text: I am skilled in Python and Java programming.
Entities: [('Python', "['SKILL']"), ('Java', "['SKILL']")]
Input Text: My experience includes using TensorFlow for machine learning.
Entities: [('TensorFlow', "['SKILL']"), ('machine learning.', "['SKILL']")]
Input Text: I have hands-on experience with MongoDB and MySQL.
Entities: [('MongoDB', "['SKILL']"), ('MySQL', "['SKILL']")]
Input Text: Build machine learning
Entities: [('machine learning', "['SKILL']")]
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sh
!cp -r /kaggle/input/rouge-score/rouge_score-0.1.2 /kaggle/working/
!pip install /kaggle/working/rouge_score-0.1.2/from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
'The quick brown dog jumps on the log.')
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Vezora (From Huggingface) [source]
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.
This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.
By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.
Contents of the Dataset
The dataset consists of several columns:
- output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
- instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
- input: The input parameters or values required to execute each Python code sample.
Exploring the Dataset
To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:
- Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
import pandas as pd # Load the dataset df = pd.read_csv('train.csv')
- Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
# Display column names print(df.columns)
- Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
# Display random samples from 'output' column print(df['output'].sample(5))
- Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))Potential Use Cases
The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:
- Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
- Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
- Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
- Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.
Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different
- Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
- Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
- Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
!pip install pytorch_tabular -f /kaggle/input/pytorch-tabular-python-package/ --no-index
Facebook
Twitter[EDIT/UPDATE]
There are a few important updates.
pd.Dataframe as a .csv, the following command should be used to avoid improper interpretation of newline character(s). train_df.to_csv(
"train.csv", index=False,
encoding='utf-8',
quoting=csv.QUOTE_NONNUMERIC # <== THIS IS REQUIRED
)
.csv as a pd.Dataframe, the following command must be used to avoid misinterpretation of NaN like strings (null, nan, ...) as pd.NaN values.train_df = pd.read_csv(
"/kaggle/input/ai4code-train-dataframe/train.csv",
keep_default_na=False # <== THIS IS REQUIRED
)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a dataset for MSU practicum homework on denoising audio. The homework can be found here: https://github.com/mmp-practicum-team/mmp_practicum_spring_2024/blob/main/Tasks/Task%2004/task_04.ipynb
After adding this dataset to your kaggle kernel, change the templates cell to the following:
noise_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/musan/musan/noise/*/*.wav'
audio_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/train-clean-100/LibriSpeech/train-clean-100/*/*/*.flac'
ru_audio_files_template = '/kaggle/input/msu-prac-denoising-thick-dataset/ruls_data/dev/audio/*/*/*.wav'
The purpose of this dataset is to save students time on uploading 30 GB of audio data to kaggle.
Facebook
TwitterCIFAR-10 is the excellent Dataset for many Image processing experiments.
Usage instructions
from os import listdir, makedirs
from os.path import join, exists, expanduser
cache_dir = expanduser(join('~', '.keras'))
if not exists(cache_dir):
makedirs(cache_dir)
datasets_dir = join(cache_dir, 'datasets') # /cifar-10-batches-py
if not exists(datasets_dir):
makedirs(datasets_dir)
# If you have multiple input datasets, change the below cp command accordingly, typically:
# !cp ../input/cifar10-python/cifar-10-python.tar.gz ~/.keras/datasets/
!cp ../input/cifar-10-python.tar.gz ~/.keras/datasets/
!ln -s ~/.keras/datasets/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
!tar xzvf ~/.keras/datasets/cifar-10-python.tar.gz -C ~/.keras/datasets/
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
!tar xzvf ../input/cifar-10-python.tar.gz
then see section "Dataset layout" in https://www.cs.toronto.edu/~kriz/cifar.html for details
Downloaded directly from here:
https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
See description: https://www.cs.toronto.edu/~kriz/cifar.html
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterThis dataset contains the dependencies for the sktime package, version 0.19.1. You can use this to install sktime on Kaggle without needing to download the dependencies. This can be useful if you are working on a competition that prohibits internet access in submission notebooks.
To use, add this dataset to your notebook and then install the dependencies by executing a cell with the following code:
deps_path = '/kaggle/input/pip-sktime-0-19-1'
! pip install --no-index --find-links {deps_path}/deps --requirement {deps_path}/requirements.txt
License: whatever the underlying dependencies licenses are. I claim no ownership or responsibility for the dependencies.
Feedback? Additional packages you'd like? Run this Python code to find my email address:
import base64; print(base64.b64decode('cGFub3p6YWpAZ21haWwuY29t'.encode()).decode())
If you end up using this package, an upvote or note would be helpful as it would let me know that it's useful to upload these kinds of datasets. Thanks!
Facebook
TwitterThis dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
ā
MODEL_DIR = "/kaggle/input/huggingface-roberta/"
ā
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterKaggle Notebooks LLM Filtered
Model: meta-llama/Meta-Llama-3.1-70B-Instruct Sample: 12,400 Source dataset: data-agents/kaggle-notebooks Prompt:
Below is an extract from a Jupyter notebook. Evaluate whether it has a high analysis value and could help a data scientist.
The notebooks are formatted with the following tokens:
START
Facebook
TwitterA subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.
The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.
For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.
The script used for creating the dataset can be found here.
This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.
First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:
$ pip install -q kaggle pyarrow pylance
$ kaggle datasets download -d heyytanay/codeparrot-1m
$ mkdir codeparrot_1M.lance/
$ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/
$ rm codeparrot-1m.zip
Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.
import lance
dataset = lance.dataset('codeparrot_1M.lance/')
print(dataset.count_rows())
This will give you the total number of tokens in the dataset.
Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
Facebook
TwitterIn your notebook, execute the following command.
!pip install --no-deps --no-index --find-links=/kaggle/input/bitsandbytes bitsandbytes
bitsandbytes 0.48.0
!pip download bitsandbytes
Facebook
Twitterhttps://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Please visit dinhanhx/rct
Please take time to read all this readme before using the dataset. Yes I'm serious!
pip install -e .
Go to this PRAW doc page, follow the instructions to get your client id, client secret, and user agent.
Then store them in confidential/reddit.json like this (don't actually write "spooky"):
json
{
"id": "spooky",
"secret": "spooky",
"user-agent": "windows-10:spooky:v0.0.1 (by u/spooky)"
}
(but the number in each category limited by Reddit)
- Output file: data/cosplay.jsonl
- 2161 posts (on 01/03/2023)
python rct/crawl.py
(in post's title) enclosed by square brackets such as [self], [found], ...
- Input file: data/cosplay.jsonl
- Output file: data/clean_cosplay.jsonl
python rct/clean.py
data/clean_cosplay.jsonldata/map_cosplay.jsonl, data/bad_response.jsonlpython rct/download.py ā The image_id, and image_path attributes' values are NOT linearly continuous. For example,
in data/bad_response.jsonl
python
{"image_id": "001912", "image_path": "data/image/001912.jpg"}
and in data/map_cosplay.jsonl
```python
{"image_id": "001911", "image_path": "data/image/001911.jpg"} {"image_id": "001913", "image_path": "data/image/001913.jpg"}
ā `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts.
ā `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on _Kaggle_*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.
Feature Engineering is a crucial step in Machine Learning.
In this project, I show:
- Handling missing values with SimpleImputer
- Encoding categorical variables with OneHotEncoder
- Building models manually vs using Pipeline
- Saving models and pipelines with pickle
- Making predictions with and without pipelines
pipe.pkl) pipe.pkl ā Complete ML pipeline (recommended for predictions) clf.pkl ā Classifier without pipeline ohe_sex.pkl, ohe_embarked.pkl ā Encoders for categorical features import pickle
pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
sample = [[22, 1, 0, 7.25, 'male', 'S']]
print(pipe.predict(sample))
Predict with pipeline
import pickle
clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
# Preprocess input manually using the encoders, then predict with clf
šÆ Inspiration
Learn difference between manual feature engineering and pipeline-based workflows
Understand how to avoid data leakage using Pipeline
Explore cross-validation with pipelines
Practice model persistence and deployment strategies
ā
Best Practice: Use pipe.pkl (pipeline) for predictions ā it automatically handles preprocessing + modeling in one step!
---
š This version is **Kaggle-friendly** (short, structured, with code examples).
Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is based on the work by @brendanartley. The images are kept in their original size, and no preprocessing has been done to maintain flexibility. This makes the dataset larger than the allowed size on Kaggle, so it has been split into two parts. This is the first part; the second part can be found here.
Here is the full script used collect the data:
from cryoet_data_portal import Client, Dataset
import pandas as pd
import numpy as np
import shutil
import zarr
import cv2
import os
import gc
datasets = Dataset.find(Client(), [Dataset.authors.name == "Morgan Beeby"])
datasets.extend(Dataset.find(Client(), [Dataset.authors.name == "Yi-Wei Chang"]))
datasets.extend(Dataset.find(Client(), [Dataset.authors.name == "Ariane Briegel"]))
new_labels = pd.read_csv("/kaggle/input/byu-locating-bacterial-flagellar-motors-2025/train_labels.csv")[0:0]
annotations = pd.read_csv("/kaggle/input/cryoet-flagellar-motors-dataset/labels.csv")
row_id = len(new_labels)
D, H, W = 128, 512, 512
tmp_dir = "/temp"
for dataset_idx, dataset in enumerate(datasets):
print(f"Processing {dataset_idx+1}/{len(datasets)}: {dataset.title} ({len(dataset.runs)})")
for run in dataset.runs:
if run.name not in annotations.tomo_id.values:
continue
os.makedirs(tmp_dir, exist_ok=True)
try:
out_dir = f"dataset/{run.name}"
if not os.path.exists(out_dir):
os.makedirs(out_dir)
tomo = run.tomograms[0]
zarr_path = f"{tmp_dir}/{run.name}.zarr"
tomo.download_omezarr(dest_path=tmp_dir)
arr = zarr.open(zarr_path, mode='r')[0]
batch_size = 32
for i in range(0, arr.shape[0], batch_size):
end_idx = min(i + batch_size, arr.shape[0])
batch = arr[i:end_idx]
for j, img in enumerate(batch):
slice_idx = i + j
cv2.imwrite(f"{out_dir}/slice_{str(slice_idx).zfill(4)}.jpg", (img*255).astype(np.uint8))
del batch
gc.collect()
shape = arr.shape
annotation = annotations[annotations.tomo_id == run.name]
for i, row in annotation.iterrows():
new_labels.loc[len(new_labels)] = {
"row_id": row_id,
"tomo_id": run.name,
"Motor axis 0": row.z * (shape[0]/D),
"Motor axis 1": row.y * (shape[1]/H),
"Motor axis 2": row.x * (shape[2]/W),
"Array shape (axis 0)": shape[0],
"Array shape (axis 1)": shape[1],
"Array shape (axis 2)": shape[2],
"Voxel spacing": tomo.voxel_spacing,
"Number of motors": len(annotation)
}
row_id += 1
except Exception as e:
print(e)
shutil.rmtree(tmp_dir)
new_labels.to_csv("labels.csv", index=False)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
Facebook
TwitterThis dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the mobilebert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
ā
MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
ā
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterDownloaded both Python and Debian packages for offline use. The creation and usage is described in https://www.kaggle.com/code/jirkaborovec/pip-pkg-pyvips-download-4-offline
!ls /kaggle/input/pyvips-python-and-deb-package
# intall the deb packages
!dpkg -i --force-depends /kaggle/input/pyvips-python-and-deb-package/linux_packages/archives/*.deb
# install the python wrapper
!pip install pyvips -f /kaggle/input/pyvips-python-and-deb-package/python_packages/ --no-index
!pip list | grep pyvips