Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code
from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset
dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input')
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.
Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.
Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.
| Package | #Graphs | #Nodes per Graph | #Edges per Graph | Split Type | Task Type | Metric |
|---|---|---|---|---|---|---|
ogb>=1.2.0 | 452,741 | 125.2 | 124.2 | Project | Sub-token prediction | F1 score |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset comprises 3,000 (1,5k perfect and 1,5k imperfect) randomly generated mazes, each represented in a structured format suitable for algorithmic pathfinding, reinforcement learning, or maze-solving research. The mazes vary in dimensions from 10×10 to 150×150, offering a diverse range of complexity and size to support various levels of algorithmic challenge and scalability testing.
def create_maze(dim, i):
# Create a grid filled with walls
maze = np.ones((dim * 2 + 1, dim * 2 + 1))
# Define the starting point
x, y = (0, 0)
maze[2 * x + 1, 2 * y + 1] = 0
# Initialize the stack with the starting point
stack = [(x, y)]
while len(stack) > 0:
x, y = stack[-1]
# Define possible directions
directions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
random.shuffle(directions)
for dx, dy in directions:
nx, ny = x + dx, y + dy
if nx >= 0 and ny >= 0 and nx < dim and ny < dim and maze[2 * nx + 1, 2 * ny + 1] == 1:
maze[2 * nx + 1, 2 * ny + 1] = 0
maze[2 * x + 1 + dx, 2 * y + 1 + dy] = 0
stack.append((nx, ny))
break
else:
stack.pop()
# Create an entrance and an exit
maze[1, 0] = 0
maze[-2, -1] = 0
The original author from the function above is Michael Gold. Can you read his awesome Python’s Path Through Mazes: A Journey of Creation and Solution article on Medium.
def create_imperfect_maze(dim, extra_wall_removals=0.05):
def create_perfect_maze(dim):
maze = np.ones((dim * 2 + 1, dim * 2 + 1), dtype=int)
x, y = (0, 0)
maze[2 * x + 1, 2 * y + 1] = 0
stack = [(x, y)]
while stack:
x, y = stack[-1]
directions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
random.shuffle(directions)
for dx, dy in directions:
nx, ny = x + dx, y + dy
if 0 <= nx < dim and 0 <= ny < dim and maze[2 * nx + 1, 2 * ny + 1] == 1:
maze[2 * nx + 1, 2 * ny + 1] = 0
maze[2 * x + 1 + dx, 2 * y + 1 + dy] = 0
stack.append((nx, ny))
break
else:
stack.pop()
maze[1, 0] = 0 # entrance
maze[-2, -1] = 0 # exit
return maze
maze = create_perfect_maze(dim)
wall_candidates = []
for i in range(1, maze.shape[0] - 1):
for j in range(1, maze.shape[1] - 1):
if maze[i, j] == 1:
if maze[i - 1, j] == 0 and maze[i + 1, j] == 0:
wall_candidates.append((i, j))
elif maze[i, j - 1] == 0 and maze[i, j + 1] == 0:
wall_candidates.append((i, j))
num_to_remove = int(len(wall_candidates) * extra_wall_removals)
walls_to_remove = random.sample(wall_candidates, num_to_remove)
for i, j in walls_to_remove:
maze[i, j] = 0
return maze
Created by myself.
Foto von Mitchell Luo auf Unsplash
Facebook
TwitterYALM Instruct Data - 1
The YALM Instruct Data - 1 is a mix of instruction tuning data of English, Hindi, Math and Python Code taken from various sources for the Supervised Finetuning Task of YALM(Yet Another Language Model). Total Samples: 1.31M Shuffle Seed: 101 Datasets:
HuggingFaceTB/smoltalk
Language: English, Math, Python
damerajee/Instruct-hindi
Language: Hindi, Hinglish
smangrul/hindi_instruct_v1
Language: Hindi, Hinglish
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This item provides a self-contained, runnable package that reproduces the statistical validation of rung alignment used in the MEG harmonic ladder manuscript. It includes the pre-specified 12-feature dataset (in rung units) and Python code to run three null models, compute effect-size metrics, and test robustness, so referees (and readers) can verify all reported results with a single command.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.
We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.
An example showing how to use the dataset is shown below.
# code for testing robustness of a model
import os.path
from torchvision import datasets, transforms, models
import torch.utils.data
class ImageFolderWithEmptyDirs(datasets.ImageFolder):
"""
This is required for handling empty folders from the ImageFolder Class.
"""
def find_classes(self, directory):
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
if not classes:
raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
len(os.listdir(os.path.join(directory, cls_name))) > 0}
return classes, class_to_idx
# extract and unzip the dataset, then write top folder here
dataset_folder = 'data/ImageNet-Patch'
available_labels = {
487: 'cellular telephone',
513: 'cornet',
546: 'electric guitar',
585: 'hair spray',
804: 'soap dispenser',
806: 'sock',
878: 'typewriter keyboard',
923: 'plate',
954: 'banana',
968: 'cup'
}
# select folder with specific target
target_label = 954
dataset_folder = os.path.join(dataset_folder, str(target_label))
normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
transforms = transforms.Compose([
transforms.ToTensor(),
normalizer
])
dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms)
model = models.resnet50(pretrained=True)
loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5)
model.eval()
batches = 10
correct, attack_success, total = 0, 0, 0
for batch_idx, (images, labels) in enumerate(loader):
if batch_idx == batches:
break
pred = model(images).argmax(dim=1)
correct += (pred == labels).sum()
attack_success += sum(pred == target_label)
total += pred.shape[0]
accuracy = correct / total
attack_sr = attack_success / total
print("Robust Accuracy: ", accuracy)
print("Attack Success: ", attack_sr)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Shuffled Car Dataset This project contains a shuffled dataset of cars, derived from the original dataset "CARS_1.csv." The data includes information about various car models, their specifications, and market information. Dataset Overview The dataset provides detailed attributes for multiple car models, including engine specifications, body type, pricing, fuel type, and user reviews. The rows have been randomly shuffled to ensure data randomness. Dataset Columns car_name: The name of the car model. reviews_count: The number of reviews the car has received. fuel_type: The type of fuel the car uses (Petrol, Diesel, Electric, etc.). engine_displacement: The engine displacement volume in cubic centimeters (cc). no_cylinder: The number of cylinders in the engine. seating_capacity: The seating capacity of the car (number of passengers). transmission_type: The type of transmission (Automatic, Manual, Electric). fuel_tank_capacity: The capacity of the fuel tank in liters. body_type: The classification of the car based on its shape and design (SUV, Sedan, Hatchback, etc.). rating: The user rating of the car, typically out of 5. starting_price: The starting price of the car in local currency. ending_price: The highest price of the car in local currency. max_torque_nm: The maximum torque the car engine can produce (in Newton meters). max_torque_rpm: The RPM (Revolutions Per Minute) at which the car delivers its maximum torque. max_power_bhp: The maximum power the car engine can produce (in Brake Horsepower). max_power_rp: The RPM at which the car delivers its maximum power. Usage This dataset can be used for various data analysis and machine learning tasks, including: Predicting car prices based on engine specifications and other attributes. Clustering cars by their specifications (e.g., body type, fuel type). Analyzing customer preferences based on review counts and ratings. How to Use Load the dataset into your environment (e.g., Python, R, Excel, etc.). Use appropriate data analysis and visualization tools to gain insights. Perform machine learning tasks such as regression or classification using the car specifications. File Information Source File: CARS_1.csv Shuffled File: You may shuffle the dataset or access the already shuffled dataset for analysis. Let me know if you'd like to modify or expand any sections of this README!
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
Out of 20,577 human proteins (from UniProt human proteome), sequences shorter than 20 amino acids or longer than 512 amino acids were removed, resulting in a set of 12,703 proteins. The uShuffle algorithm (python pacakge) was then used to shuffle these protein sequences while maintaining their doublet distribution. The very few sequences for which uShuffle failed to create a shuffled version were eliminated. Afterwards, h-CD-HIT algorithm (web server) was used… See the full description on the dataset page: https://huggingface.co/datasets/yarongef/human_proteome_doublets.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This folder contains the code and data used to compare the performance of two algorithms for generating random hypergraphs with prescribed degree sequences. The comparison is conducted by simulating and analyzing the mixing time of each algorithm.
Specifically, the folder includes:
The folder accompanies these papers:
- Yanna J. Kraakman and Clara Stegehuis (2024). Configuration models for random directed hypergraphs. arXiv:2402.06466.
- Yanna J. Kraakman and Clara Stegehuis (2024). Hypercurveball algorithm for sampling hypergraphs with fixed degrees. arXiv:2412.05100
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-mol
import os
import os.path as osp
import pandas as pd
import torch
from ogb.graphproppred import PygGraphPropPredDataset
class PygOgbgMol(PygGraphPropPredDataset):
def _init_(self, name, transform = None, pre_transform = None, meta_csv = None):
root = '../input'
if meta_csv is None:
meta_csv = osp.join(root, name, 'ogbg-master.csv')
master = pd.read_csv(meta_csv, index_col = 0)
meta_dict = master[name]
meta_dict['dir_path'] = osp.join(root, name)
super()._init_(name = name, root = root, transform = transform, pre_transform = pre_transform, meta_dict = meta_dict)
def get_idx_split(self, split_type = None):
if split_type is None:
split_type = self.meta_info['split']
path = osp.join(self.root, 'split', split_type)
# short-cut if split_dict.pt exists
if os.path.isfile(os.path.join(path, 'split_dict.pt')):
return torch.load(os.path.join(path, 'split_dict.pt'))
train_idx = pd.read_csv(osp.join(path, 'train.csv'), header = None).values.T[0]
valid_idx = pd.read_csv(osp.join(path, 'valid.csv'), header = None).values.T[0]
test_idx = pd.read_csv(osp.join(path, 'test.csv'), header = None).values.T[0]
return {'train': torch.tensor(train_idx, dtype = torch.long), 'valid': torch.tensor(valid_idx, dtype = torch.long), 'test': torch.tensor(test_idx, dtype = torch.long)}
dataset = PygOgbgMol('ogbg-molclintox')
from torch_geometric.data import DataLoader
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDKit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].
Beside the two main datasets, the dataset authors additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].
For encoding these raw input features, the dataset authors prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
atom_encoder = AtomEncoder(emb_dim = 100)
bond_encoder = BondEncoder(emb_dim = 100)
atom_emb = atom_encoder(x) # x is the input atom feature
edge_emb = bond_encoder(edge_attr) # edge_attr is the input edge feature
Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, the dataset authors closely follow [2]. Specifically, for ogbg-molhiv, the dataset authors use ROC...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description
Out of 20,577 human proteins (from UniProt human proteome), sequences shorter than 20 amino acids or longer than 512 amino acids were removed, resulting in a set of 12,703 proteins. The uShuffle algorithm (python pacakge) was then used to shuffle these protein sequences while maintaining their triplet distribution. The sequences for which uShuffle failed to create a shuffled version were eliminated. Afterwards, h-CD-HIT algorithm (web server) was used with three… See the full description on the dataset page: https://huggingface.co/datasets/yarongef/human_proteome_triplets.
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-mol
import os
import os.path as osp
import pandas as pd
import torch
from ogb.graphproppred import PygGraphPropPredDataset
class PygOgbgMol(PygGraphPropPredDataset):
def _init_(self, name, transform = None, pre_transform = None, meta_csv = None):
root = '../input'
if meta_csv is None:
meta_csv = osp.join(root, name, 'ogbg-master.csv')
master = pd.read_csv(meta_csv, index_col = 0)
meta_dict = master[name]
meta_dict['dir_path'] = osp.join(root, name)
super()._init_(name = name, root = root, transform = transform, pre_transform = pre_transform, meta_dict = meta_dict)
def get_idx_split(self, split_type = None):
if split_type is None:
split_type = self.meta_info['split']
path = osp.join(self.root, 'split', split_type)
# short-cut if split_dict.pt exists
if os.path.isfile(os.path.join(path, 'split_dict.pt')):
return torch.load(os.path.join(path, 'split_dict.pt'))
train_idx = pd.read_csv(osp.join(path, 'train.csv'), header = None).values.T[0]
valid_idx = pd.read_csv(osp.join(path, 'valid.csv'), header = None).values.T[0]
test_idx = pd.read_csv(osp.join(path, 'test.csv'), header = None).values.T[0]
return {'train': torch.tensor(train_idx, dtype = torch.long), 'valid': torch.tensor(valid_idx, dtype = torch.long), 'test': torch.tensor(test_idx, dtype = torch.long)}
dataset = PygOgbgMol('ogbg-molbbbp')
from torch_geometric.data import DataLoader
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDKit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].
Beside the two main datasets, the dataset authors additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].
For encoding these raw input features, the dataset authors prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
atom_encoder = AtomEncoder(emb_dim = 100)
bond_encoder = BondEncoder(emb_dim = 100)
atom_emb = atom_encoder(x) # x is the input atom feature
edge_emb = bond_encoder(edge_attr) # edge_attr is the input edge feature
Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, the dataset authors closely follow [2]. Specifically, for ogbg-molhiv, the dataset authors use ROC-AUC f...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 1,004 labeled images from the classic NES game "Duck Hunt" (1984), specifically prepared for YOLO (You Only Look Once) object detection training. The dataset includes sprites of the iconic hunting dog and ducks in various states, augmented to provide a balanced and comprehensive training set for computer vision models.
Perfect for: - Object detection model training - Computer vision research - Retro gaming AI projects - YOLO algorithm benchmarking - Educational purposes
| Metric | Value |
|---|---|
| Total Images | 1,004 |
| Dataset Size | 12 MB |
| Image Format | PNG |
| Annotation Format | YOLO (.txt) |
| Classes | 4 |
| Train/Val Split | 711/260 (73%/27%) |
| Class ID | Class Name | Count | Description |
|---|---|---|---|
| 0 | dog | 252 | The hunting dog in various poses (jumping, laughing, sniffing, etc.) |
| 1 | duck_dead | 256 | Dead ducks (both black and red variants) |
| 2 | duck_shot | 248 | Ducks in the moment of being shot |
| 3 | duck_flying | 248 | Flying ducks in all directions (left, right, diagonal) |
yolo_dataset_augmented/
├── images/
│ ├── train/ # 711 training images
│ └── val/ # 260 validation images
├── labels/
│ ├── train/ # 711 YOLO annotation files
│ └── val/ # 260 YOLO annotation files
├── classes.txt # Class names mapping
├── dataset.yaml # YOLO configuration file
└── augmented_dataset_stats.json # Detailed statistics
The original 47 images were enhanced using advanced data augmentation techniques to create a balanced dataset:
{
'rotation_range': (-15, 15), # Small rotations for game sprites
'brightness_range': (0.7, 1.3), # Brightness variations
'contrast_range': (0.8, 1.2), # Contrast adjustments
'saturation_range': (0.8, 1.2), # Color saturation
'noise_intensity': 0.02, # Gaussian noise
'horizontal_flip_prob': 0.5, # 50% chance horizontal flip
'scaling_range': (0.8, 1.2), # Scale variations
}
from ultralytics import YOLO
# Load and train
model = YOLO('yolov8n.pt') # Load pretrained model
results = model.train(data='dataset.yaml', epochs=100, imgsz=640)
# Validate
metrics = model.val()
# Predict
results = model('path/to/test/image.png')
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
class DuckHuntDataset(Dataset):
def _init_(self, images_dir, labels_dir, transform=None):
self.images_dir = images_dir
self.labels_dir = labels_dir
self.transform = transform
self.images = os.listdir(images_dir)
def _len_(self):
return len(self.images)
def _getitem_(self, idx):
img_path = os.path.join(self.images_dir, self.images[idx])
label_path = os.path.join(self.labels_dir,
self.images[idx].replace('.png', '.txt'))
image = Image.open(img_path)
# Load YOLO annotations
with open(label_path, 'r') as f:
labels = f.readlines()
if self.transform:
image = self.transform(image)
return image, labels
# Usage
dataset = DuckHuntDataset('images/train', 'labels/train')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
Each .txt file contains one line per object:
class_id center_x center_y width height
Example annotation:
0 0.492 0.403 0.212 0.315
Where values are normalized (0-1) relative to image dimensions.
This dataset is based on sprites from the iconic 1984 NES game "Duck Hunt," one of the most recognizable video games in history. The game featured:
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset used to investigate the influence of the unique amount of 3D-Models (Shapes) and Materials (Textures) towards the shape-textures bias, performance and generalization of deep neural network instance segmentation in my bachelor exam.
You can load the images like:
import cv2
image = cv2.imread(img_path)
if image is None:
raise FileNotFoundError(f"Error during data loading: there is no '{img_path}'")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
depth = cv2.imread(depth_path, cv2.IMREAD_UNCHANGED)
if len(depth.shape) > 2:
_, depth, _, _ = cv2.split(depth)
mask = cv2.imread(mask_path, cv2.IMREAD_UNCHANGED) # cv2.IMREAD_GRAYSCALE)
For easy use I recommend to use my own code. You can directly use it to train Mask R-CNN or just use the dataloader. Both are shown now:
First: Clone my torch github project into your project
terminal
cd ./path/to/your/project
git clone https://github.com/xXAI-botXx/torch-mask-rcnn-instance-segmentation.git
Second: Install the anaconda env (optional)
terminal
cd ./path/to/your/project
cd ./torch-mask-rcnn-instance-segmentation
conda env create -f conda_env.yml
Third: You are ready to use
Using only the dataloader for your custom project: ```python import os import numpy as np import matplotlib.pyplot as plt import cv2 from torch.utils.data import DataLoader
import sys sys.path.append("./torch-mask-rcnn-instance-segmentation")
from maskrcnn_toolkit import DATA_LOADING_MODE, Dual_Dir_Dataset, collate_fn, extract_and_visualize_mask
data_mode = DATA_LOADING_MODE.ALL
dataset = Dual_Dir_Dataset(img_dir="/path/to/rgb-folder", depth_dir="/path/to/depth-folder", mask_dir="/path/to/mask-folder", transform=None, amount=1, start_idx=0, end_idx=0, image_name="...", data_mode=data_mode, use_mask=True, use_depth=False, log_path="./logs", width=1920, height=1080, should_log=True, should_print=True, should_verify=False) data_loader = DataLoader(dataset, batch_size=5, shuffle=True, num_workers=4, collate_fn=collate_fn)
for data in data_loader: for batch_idx in range(len(data[0])): if len(data) == 3: image = data[0][batch_idx].cpu().unsqueeze(0) masks = data[1][batch_idx]["masks"] masks = masks.cpu() name = data[2][batch_idx] else: image = data[0][batch_idx].cpu().unsqueeze(0) name = data[1][batch_idx]
image = image.cpu().numpy().squeeze(0)
image = np.transpose(image, (1, 2, 0)) # Convert to HWC
# Remove 4.th channel if existing
if image.shape[2] == 4:
depth = image[:, :, 3]
image = image[:, :, :3]
else:
depth = None
masks_gt = masks.cpu().numpy()
masks_gt = np.transpose(masks_gt, (1, 2, 0))
mask = extract_and_visualize_mask(masks_gt, image=None, ax=None, visualize=False, color_map=None, soft_join=False)
# plot
cols = 1
if depth is not None:
cols += 1
if mask is not None:
cols += 1
fig, ax = plt.subplots(nrows=1, ncols=cols, figsize=(20, 15*cols))
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.05, hspace=0.05)
plot_idx = 0
ax[plot_idx].imshow(image)
ax[plot_idx].set_title("RGB Input Image")
ax[plot_idx].axis("off")
if depth is not None:
plot_idx += 1
ax[plot_idx].imshow(depth, cmap="gray")
ax[plot_idx].set_title("Depth Input Image")
ax[plot_idx].axis("off")
if mask is not None:
plot_idx += 1
ax[plot_idx].imshow(mask)
ax[plot_idx].set_title("Mask Ground Truth")
ax[plot_idx].axis("off")
plt.show()
**Using the whole Mask R-CNN training pipeline:**
```python
import sys
sys.path.append("./torch-mask-rcnn-instance-segmentation")
from maskrcnn_toolkit import DATA_LOADING_MODE, train
# set the vars as you need
WEIGHTS_PATH = None # Path to the model weights file
USE_DEPTH = False # Whether to include depth information -> as rgb and depth on green channel
VERIFY_DATA = False # True is recommended
GROUND_PATH = "D:/3xM"
DATASET_NAME = "3xM_Dataset_10_10"
IMG_DIR = os.path.join(GR...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.
The dataset is organized into three standard splits: - Train set - Validation set - Test set
Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data
train/: Original training imagesvalid/: Original validation imagestest/: Original test imagestrain_mask/: Corresponding segmentation masks for trainingvalid_mask/: Corresponding segmentation masks for validationtest_mask/: Corresponding segmentation masks for testingtrain_dataset.parquet, valid_dataset.parquet, test_dataset.parquetsplit_at = image_size[0] * image_size[1] * image_channels
[-1, 224, 224, 3])[-1, 224, 224, 1])train_dataset.pkl, valid_dataset.pkl, test_dataset.pklsplit_at = image_size[0] * image_size[1] * image_channelstrain_dataset.csv, valid_dataset.csv, test_dataset.csvAll images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks
When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]
A custom CatDataset class is included for easy integration with PyTorch:
from cat_dataset import CatDataset
# Load from parquet format
dataset = CatDataset(
root="path/to/dataset",
split="train", # Options: "train", "valid", "test"
format="parquet", # Options: "parquet", "pkl"
image_size=[224, 224],
image_channels=3,
mask_channels=1
)
# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration
The pickle format provides the fastest loading times and is recommended for most use cases.
If you use this dataset in your research or projects, please cite:
@misc{feral-cat-segmentation_dataset,
title = {feral-cat-segmentation Dataset},
type = {Open Source Dataset},
author = {Paul Cashman},
howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}},
url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation},
journal = {Roboflow Universe},
publisher = {Roboflow},
year = {2025},
month = {mar},
note = {visited on 2025-03-19},
}
from ca...
Facebook
TwitterThe dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.
| Class | Number of Images | Description |
|---|---|---|
| Cats | 1,000 | Includes multiple breeds and poses |
| Dogs | 1,000 | Covers various breeds and backgrounds |
| Snakes | 1,000 | Includes multiple species and natural settings |
Total Images: 3,000
Image Properties:
| Set | Percentage | Number of Images |
|---|---|---|
| Training | 70% | 2,100 |
| Validation | 15% | 450 |
| Test | 15% | 450 |
Images in the dataset have been standardized to support machine learning pipelines:
import os
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Path to dataset
dataset_path = "path/to/dataset"
# ImageDataGenerator for preprocessing
datagen = ImageDataGenerator(
rescale=1./255,
validation_split=0.15 # 15% for validation
)
# Load training data
train_generator = datagen.flow_from_directory(
dataset_path,
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
subset='training',
shuffle=True
)
# Load validation data
validation_generator = datagen.flow_from_directory(
dataset_path,
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
subset='validation',
shuffle=False
)
# Example: Iterate over one batch
images, labels = next(train_generator)
print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This is dataset was created for Computer Vision & Deep Learning Applications of OpenCV University
This dataset is created for Project 4 for YOLO Face Mask Detector. it was used for training your own YOLO model.
The dataset contains 2,712 image and caption files, which is 1,356 samples (*.jpg, *.txt files)
For YOLOv5, YOLOv8 to YOLOv10 as PyThorch version. For Colab use this prebuild code:
if not os.path.exists('face_mask_dataset.zip'):
!curl -L "https://storage.googleapis.com/kaggle-data-sets/5418712/8996055/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240721%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240721T204520Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=65e5e938f51c2f80f23fbed8b4d5460669729108f266f1977a3d3af260eeea4b7a413c4bd8050e5bba034dde8a7f7fc2fc06a43e48b3e4a43c9dd6a6f0747739e338b3ca89db762dd1797afa4cccc78bead9d39bb85bd86720cbb8d33628b37aeadda551e1394b45faaa93288d385bfbbc9b0b57ac793ed5a53917c1ba1303238a40b599abb9f3063d3a3d34bd289992d58cbf10ecf836242767ec139d24a1e78b9f11d6e897d245163fa1d5d555bffbc06eb60411dcdd28594dd0582bbe09add0fb269565a2f4a714f285ec018c463e01179794185cf5010cba2974fa3cf58ccaa1513c619b0a434707c9b22c958e61b71633540935ee6c1b804d5831002a9a" > face_mask_dataset.zip;
# Funkce pro rozbalení ZIP souboru
def unzip_file(zip_path, extract_to):
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extract_to)
zip_path = './face_mask_dataset.zip'
os.makedirs('face_mask_dataset', exist_ok=True)
extract_to = './face_mask_dataset/'
unzip_file(zip_path, extract_to)
!rm face_mask_dataset.zip
!mkdir train
!mkdir valid
!mkdir test
dataset_dir = './face_mask_dataset/'
train_dir = './train/'
test_dir = './test/'
valid_dir = './valid/'
# create subfolder
for folder in [train_dir, test_dir, valid_dir]:
os.makedirs(os.path.join(folder, 'images'), exist_ok=True)
os.makedirs(os.path.join(folder, 'labels'), exist_ok=True)
# Get all image and corresponding description files
image_files = [f for f in os.listdir(dataset_dir) if f.endswith('.jpg')]
label_files = [f.replace('.jpg', '.txt') for f in image_files]
# Checking that a corresponding description file exists for each image
assert all(os.path.isfile(os.path.join(dataset_dir, lbl)) for lbl in label_files), "Some matching descriptor files are missing!"
# Shuffle list of files
combined = list(zip(image_files, label_files))
random.shuffle(combined)
image_files[:], label_files[:] = zip(*combined)
# Split files according to the specified ratio
total_files = len(image_files)
train_split = int(0.6 * total_files)
test_split = int(0.2 * total_files)
train_files = image_files[:train_split]
train_labels = label_files[:train_split]
test_files = image_files[train_split:train_split + test_split]
test_labels = label_files[train_split:train_split + test_split]
valid_files = image_files[train_split + test_split:]
valid_labels = label_files[train_split + test_split:]
# Function to move files to their respective folders
def move_files(files, labels, dest_dir):
for img_file, lbl_file in zip(files, labels):
shutil.move(os.path.join(dataset_dir, img_file), os.path.join(dest_dir, 'images', img_file))
shutil.move(os.path.join(dataset_dir, lbl_file), os.path.join(dest_dir, 'labels', lbl_file))
# Přesun souborů
move_files(train_files, train_labels, train_dir)
move_files(test_files, test_labels, test_dir)
move_files(valid_files, valid_labels, valid_dir)
print(f"Split complete: {len(train_files)} train, {len(test_files)} test, {len(valid_files)} valid.")
!rm -r face_mask_dataset
For Kaggle is possible use Face Mask Detection prebuild - OpenCV University
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterWebpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code
from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset
dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input')
batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.
Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.
Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.
| Package | #Graphs | #Nodes per Graph | #Edges per Graph | Split Type | Task Type | Metric |
|---|---|---|---|---|---|---|
ogb>=1.2.0 | 452,741 | 125.2 | 124.2 | Project | Sub-token prediction | F1 score |
Website: https://ogb.stanford.edu
The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.
[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.
I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.