16 datasets found

OGBG-Code (Processed for PyG)
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code
Explore at:
zip(1314604183 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Redao da Taupl
Description
OGBN-Code

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

Usage in Python

from torch_geometric.data import DataLoader from ogb.graphproppred import PygGraphPropPredDataset dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') batch_size = 32 split_idx = dataset.get_idx_split() train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True) valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False) test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

Summary

Package #Graphs #Nodes per Graph #Edges per Graph Split Type Task Type Metric
ogb>=1.2.0 452,741 125.2 124.2 Project Sub-token prediction F1 score

License: MIT License

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

🧩 Maze Dataset

kaggle.com

zip

Updated Apr 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

mexwell (2025). 🧩 Maze Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/maze-dataset

Explore at:

zip(16391387 bytes)Available download formats

Dataset updated

Apr 18, 2025

Authors

mexwell

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

About

This dataset comprises 3,000 (1,5k perfect and 1,5k imperfect) randomly generated mazes, each represented in a structured format suitable for algorithmic pathfinding, reinforcement learning, or maze-solving research. The mazes vary in dimensions from 10×10 to 150×150, offering a diverse range of complexity and size to support various levels of algorithmic challenge and scalability testing.

Python Code (Perfect mazes)

def create_maze(dim, i):
  # Create a grid filled with walls
  maze = np.ones((dim * 2 + 1, dim * 2 + 1))

  # Define the starting point
  x, y = (0, 0)
  maze[2 * x + 1, 2 * y + 1] = 0

  # Initialize the stack with the starting point
  stack = [(x, y)]
  while len(stack) > 0:
    x, y = stack[-1]

    # Define possible directions
    directions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
    random.shuffle(directions)

    for dx, dy in directions:
      nx, ny = x + dx, y + dy
      if nx >= 0 and ny >= 0 and nx < dim and ny < dim and maze[2 * nx + 1, 2 * ny + 1] == 1:
        maze[2 * nx + 1, 2 * ny + 1] = 0
        maze[2 * x + 1 + dx, 2 * y + 1 + dy] = 0
        stack.append((nx, ny))
        break
    else:
      stack.pop()

  # Create an entrance and an exit
  maze[1, 0] = 0
  maze[-2, -1] = 0

The original author from the function above is Michael Gold. Can you read his awesome Python’s Path Through Mazes: A Journey of Creation and Solution article on Medium.

Python Code (Imperfect mazes)

def create_imperfect_maze(dim, extra_wall_removals=0.05):
  def create_perfect_maze(dim):
    maze = np.ones((dim * 2 + 1, dim * 2 + 1), dtype=int)
    x, y = (0, 0)
    maze[2 * x + 1, 2 * y + 1] = 0
    stack = [(x, y)]
    while stack:
      x, y = stack[-1]
      directions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
      random.shuffle(directions)
      for dx, dy in directions:
        nx, ny = x + dx, y + dy
        if 0 <= nx < dim and 0 <= ny < dim and maze[2 * nx + 1, 2 * ny + 1] == 1:
          maze[2 * nx + 1, 2 * ny + 1] = 0
          maze[2 * x + 1 + dx, 2 * y + 1 + dy] = 0
          stack.append((nx, ny))
          break
      else:
        stack.pop()
    maze[1, 0] = 0 # entrance
    maze[-2, -1] = 0 # exit
    return maze

  maze = create_perfect_maze(dim)

  wall_candidates = []
  for i in range(1, maze.shape[0] - 1):
    for j in range(1, maze.shape[1] - 1):
      if maze[i, j] == 1:
        if maze[i - 1, j] == 0 and maze[i + 1, j] == 0:
          wall_candidates.append((i, j))
        elif maze[i, j - 1] == 0 and maze[i, j + 1] == 0:
          wall_candidates.append((i, j))

  num_to_remove = int(len(wall_candidates) * extra_wall_removals)
  walls_to_remove = random.sample(wall_candidates, num_to_remove)
  for i, j in walls_to_remove:
    maze[i, j] = 0

  return maze

Created by myself.

Foto von Mitchell Luo auf Unsplash

h
YALM-instruct1-1M
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuldip Patel (2025). YALM-instruct1-1M [Dataset]. https://huggingface.co/datasets/kp7742/YALM-instruct1-1M
Explore at:
Dataset updated
Jul 8, 2025
Authors
Kuldip Patel
Description
YALM Instruct Data - 1

The YALM Instruct Data - 1 is a mix of instruction tuning data of English, Hindi, Math and Python Code taken from various sources for the Supervised Finetuning Task of YALM(Yet Another Language Model). Total Samples: 1.31M Shuffle Seed: 101 Datasets:

HuggingFaceTB/smoltalk

Language: English, Math, Python

damerajee/Instruct-hindi

Language: Hindi, Hinglish

smangrul/hindi_instruct_v1

Language: Hindi, Hinglish
MEG Ladder Stats: Reproducible Shuffle Tests for Quantized Steepness...
figshare.com
zip
Updated Oct 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Shreenan-Dyck (2025). MEG Ladder Stats: Reproducible Shuffle Tests for Quantized Steepness Alignments (v0.1.0) [Dataset]. http://doi.org/10.6084/m9.figshare.30276487.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30276487.v1
Dataset updated
Oct 4, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chris Shreenan-Dyck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This item provides a self-contained, runnable package that reproduces the statistical validation of rung alignment used in the MEG harmonic ladder manuscript. It includes the pre-specified 12-feature dataset (in rung units) and Python code to run three null models, compute effect-size metrics, and test robustness, so referees (and readers) can verify all reported results with a single command.
Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. http://doi.org/10.5281/zenodo.6568778
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6568778
Dataset updated
Jun 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.

An example showing how to use the dataset is shown below.

# code for testing robustness of a model import os.path from torchvision import datasets, transforms, models import torch.utils.data class ImageFolderWithEmptyDirs(datasets.ImageFolder): """ This is required for handling empty folders from the ImageFolder Class. """ def find_classes(self, directory): classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir()) if not classes: raise FileNotFoundError(f"Couldn't find any class folder in {directory}.") class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if len(os.listdir(os.path.join(directory, cls_name))) > 0} return classes, class_to_idx # extract and unzip the dataset, then write top folder here dataset_folder = 'data/ImageNet-Patch' available_labels = { 487: 'cellular telephone', 513: 'cornet', 546: 'electric guitar', 585: 'hair spray', 804: 'soap dispenser', 806: 'sock', 878: 'typewriter keyboard', 923: 'plate', 954: 'banana', 968: 'cup' } # select folder with specific target target_label = 954 dataset_folder = os.path.join(dataset_folder, str(target_label)) normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) transforms = transforms.Compose([ transforms.ToTensor(), normalizer ]) dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms) model = models.resnet50(pretrained=True) loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5) model.eval() batches = 10 correct, attack_success, total = 0, 0, 0 for batch_idx, (images, labels) in enumerate(loader): if batch_idx == batches: break pred = model(images).argmax(dim=1) correct += (pred == labels).sum() attack_success += sum(pred == target_label) total += pred.shape[0] accuracy = correct / total attack_sr = attack_success / total print("Robust Accuracy: ", accuracy) print("Attack Success: ", attack_sr)
new cars prices
kaggle.com
zip
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeevan Sai123 (2024). new cars prices [Dataset]. https://www.kaggle.com/datasets/jeevansai123/new-cars-prices
Explore at:
zip(37952 bytes)Available download formats
Dataset updated
Oct 25, 2024
Authors
Jeevan Sai123
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Shuffled Car Dataset This project contains a shuffled dataset of cars, derived from the original dataset "CARS_1.csv." The data includes information about various car models, their specifications, and market information. Dataset Overview The dataset provides detailed attributes for multiple car models, including engine specifications, body type, pricing, fuel type, and user reviews. The rows have been randomly shuffled to ensure data randomness. Dataset Columns car_name: The name of the car model. reviews_count: The number of reviews the car has received. fuel_type: The type of fuel the car uses (Petrol, Diesel, Electric, etc.). engine_displacement: The engine displacement volume in cubic centimeters (cc). no_cylinder: The number of cylinders in the engine. seating_capacity: The seating capacity of the car (number of passengers). transmission_type: The type of transmission (Automatic, Manual, Electric). fuel_tank_capacity: The capacity of the fuel tank in liters. body_type: The classification of the car based on its shape and design (SUV, Sedan, Hatchback, etc.). rating: The user rating of the car, typically out of 5. starting_price: The starting price of the car in local currency. ending_price: The highest price of the car in local currency. max_torque_nm: The maximum torque the car engine can produce (in Newton meters). max_torque_rpm: The RPM (Revolutions Per Minute) at which the car delivers its maximum torque. max_power_bhp: The maximum power the car engine can produce (in Brake Horsepower). max_power_rp: The RPM at which the car delivers its maximum power. Usage This dataset can be used for various data analysis and machine learning tasks, including: Predicting car prices based on engine specifications and other attributes. Clustering cars by their specifications (e.g., body type, fuel type). Analyzing customer preferences based on review counts and ratings. How to Use Load the dataset into your environment (e.g., Python, R, Excel, etc.). Use appropriate data analysis and visualization tools to gain insights. Perform machine learning tasks such as regression or classification using the car specifications. File Information Source File: CARS_1.csv Shuffled File: You may shuffle the dataset or access the already shuffled dataset for analysis. Let me know if you'd like to modify or expand any sections of this README!
h
human_proteome_doublets
huggingface.co
Updated Aug 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaron Geffen (2022). human_proteome_doublets [Dataset]. https://huggingface.co/datasets/yarongef/human_proteome_doublets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2022
Authors
Yaron Geffen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description

Out of 20,577 human proteins (from UniProt human proteome), sequences shorter than 20 amino acids or longer than 512 amino acids were removed, resulting in a set of 12,703 proteins. The uShuffle algorithm (python pacakge) was then used to shuffle these protein sequences while maintaining their doublet distribution. The very few sequences for which uShuffle failed to create a shuffled version were eliminated. Afterwards, h-CD-HIT algorithm (web server) was used… See the full description on the dataset page: https://huggingface.co/datasets/yarongef/human_proteome_doublets.
4
Data and code underlying the publications: 'Configuration models for random...
data.4tu.nl
zip
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ir. Y.J. Kraakman; Clara Stegehuis (2025). Data and code underlying the publications: 'Configuration models for random directed hypergraphs' and 'Hypercurveball algorithm for sampling hypergraphs with fixed degrees' [Dataset]. http://doi.org/10.4121/9beea11f-2e93-473d-9d22-8d8a6bec9d5a.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/9beea11f-2e93-473d-9d22-8d8a6bec9d5a.v1
Dataset updated
Apr 17, 2025
Dataset provided by
4TU.ResearchData
Authors
ir. Y.J. Kraakman; Clara Stegehuis
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Dataset funded by
Nederlandse Organisatie voor Wetenschappelijk Onderzoek
Description
This folder contains the code and data used to compare the performance of two algorithms for generating random hypergraphs with prescribed degree sequences. The comparison is conducted by simulating and analyzing the mixing time of each algorithm.
Specifically, the folder includes:
Python scripts (.py) for generating random directed or undirected hypergraphs using either the Hypercurveball algorithm or the Hyperedge-shuffle algorithm.
A Python script (.py) for analyzing the mixing time of each algorithm. For each algorithm, this script outputs a .csv file that contains the perturbation degree at each step of the simulation.
25 hypergraph datasets (.csv), containing both undirected and directed hypergraphs.
For each dataset: perturbation degree files (.csv), containing the perturbation degree value at each step of a simulation, for both algorithms. Each algorithm is simulated either 10 or 100 times per dataset.
A Python script (.py) for computing various statistics of a hypergraph.

The folder accompanies these papers:
- Yanna J. Kraakman and Clara Stegehuis (2024). Configuration models for random directed hypergraphs. arXiv:2402.06466.
- Yanna J. Kraakman and Clara Stegehuis (2024). Hypercurveball algorithm for sampling hypergraphs with fixed degrees. arXiv:2412.05100
OGBG-MolClinTox (Processed for PyG)
kaggle.com
zip
Updated Feb 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redao da Taupl (2021). OGBG-MolClinTox (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbg-molclintox
Explore at:
zip(325594 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Redao da Taupl
Description
OGBN-MolClinTox

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-mol

Usage in Python

import os import os.path as osp import pandas as pd import torch from ogb.graphproppred import PygGraphPropPredDataset class PygOgbgMol(PygGraphPropPredDataset): def _init_(self, name, transform = None, pre_transform = None, meta_csv = None): root = '../input' if meta_csv is None: meta_csv = osp.join(root, name, 'ogbg-master.csv') master = pd.read_csv(meta_csv, index_col = 0) meta_dict = master[name] meta_dict['dir_path'] = osp.join(root, name) super()._init_(name = name, root = root, transform = transform, pre_transform = pre_transform, meta_dict = meta_dict) def get_idx_split(self, split_type = None): if split_type is None: split_type = self.meta_info['split'] path = osp.join(self.root, 'split', split_type) # short-cut if split_dict.pt exists if os.path.isfile(os.path.join(path, 'split_dict.pt')): return torch.load(os.path.join(path, 'split_dict.pt')) train_idx = pd.read_csv(osp.join(path, 'train.csv'), header = None).values.T[0] valid_idx = pd.read_csv(osp.join(path, 'valid.csv'), header = None).values.T[0] test_idx = pd.read_csv(osp.join(path, 'test.csv'), header = None).values.T[0] return {'train': torch.tensor(train_idx, dtype = torch.long), 'valid': torch.tensor(valid_idx, dtype = torch.long), 'test': torch.tensor(test_idx, dtype = torch.long)} dataset = PygOgbgMol('ogbg-molclintox') from torch_geometric.data import DataLoader batch_size = 32 split_idx = dataset.get_idx_split() train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True) valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False) test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDKit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].

Beside the two main datasets, the dataset authors additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].

For encoding these raw input features, the dataset authors prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.

from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder atom_encoder = AtomEncoder(emb_dim = 100) bond_encoder = BondEncoder(emb_dim = 100) atom_emb = atom_encoder(x) # x is the input atom feature edge_emb = bond_encoder(edge_attr) # edge_attr is the input edge feature

Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, the dataset authors closely follow [2]. Specifically, for ogbg-molhiv, the dataset authors use ROC...
h
human_proteome_triplets
huggingface.co
Updated Sep 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaron Geffen (2022). human_proteome_triplets [Dataset]. https://huggingface.co/datasets/yarongef/human_proteome_triplets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2022
Authors
Yaron Geffen
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description

Out of 20,577 human proteins (from UniProt human proteome), sequences shorter than 20 amino acids or longer than 512 amino acids were removed, resulting in a set of 12,703 proteins. The uShuffle algorithm (python pacakge) was then used to shuffle these protein sequences while maintaining their triplet distribution. The sequences for which uShuffle failed to create a shuffled version were eliminated. Afterwards, h-CD-HIT algorithm (web server) was used with three… See the full description on the dataset page: https://huggingface.co/datasets/yarongef/human_proteome_triplets.
OGBG-MolBBBP (Processed for PyG)
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redao da Taupl (2021). OGBG-MolBBBP (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbg-molbbbp
Explore at:
zip(471366 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Redao da Taupl
Description
OGBN-MolBBBP

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-mol

Usage in Python

import os import os.path as osp import pandas as pd import torch from ogb.graphproppred import PygGraphPropPredDataset class PygOgbgMol(PygGraphPropPredDataset): def _init_(self, name, transform = None, pre_transform = None, meta_csv = None): root = '../input' if meta_csv is None: meta_csv = osp.join(root, name, 'ogbg-master.csv') master = pd.read_csv(meta_csv, index_col = 0) meta_dict = master[name] meta_dict['dir_path'] = osp.join(root, name) super()._init_(name = name, root = root, transform = transform, pre_transform = pre_transform, meta_dict = meta_dict) def get_idx_split(self, split_type = None): if split_type is None: split_type = self.meta_info['split'] path = osp.join(self.root, 'split', split_type) # short-cut if split_dict.pt exists if os.path.isfile(os.path.join(path, 'split_dict.pt')): return torch.load(os.path.join(path, 'split_dict.pt')) train_idx = pd.read_csv(osp.join(path, 'train.csv'), header = None).values.T[0] valid_idx = pd.read_csv(osp.join(path, 'valid.csv'), header = None).values.T[0] test_idx = pd.read_csv(osp.join(path, 'test.csv'), header = None).values.T[0] return {'train': torch.tensor(train_idx, dtype = torch.long), 'valid': torch.tensor(valid_idx, dtype = torch.long), 'test': torch.tensor(test_idx, dtype = torch.long)} dataset = PygOgbgMol('ogbg-molbbbp') from torch_geometric.data import DataLoader batch_size = 32 split_idx = dataset.get_idx_split() train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True) valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False) test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDKit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].

Beside the two main datasets, the dataset authors additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].

For encoding these raw input features, the dataset authors prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.

from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder atom_encoder = AtomEncoder(emb_dim = 100) bond_encoder = BondEncoder(emb_dim = 100) atom_emb = atom_encoder(x) # x is the input atom feature edge_emb = bond_encoder(edge_attr) # edge_attr is the input edge feature

Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, the dataset authors closely follow [2]. Specifically, for ogbg-molhiv, the dataset authors use ROC-AUC f...

Data from: Duck Hunt

kaggle.com

zip

Updated Jul 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Hugo Zanini (2025). Duck Hunt [Dataset]. https://www.kaggle.com/datasets/hugozanini1/duck-hunt

Explore at:

zip(7379197 bytes)Available download formats

Dataset updated

Jul 26, 2025

Authors

Hugo Zanini

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Duck Hunt Object Detection Dataset

This dataset contains 1,004 labeled images from the classic NES game "Duck Hunt" (1984), specifically prepared for YOLO (You Only Look Once) object detection training. The dataset includes sprites of the iconic hunting dog and ducks in various states, augmented to provide a balanced and comprehensive training set for computer vision models.

Perfect for: - Object detection model training - Computer vision research - Retro gaming AI projects - YOLO algorithm benchmarking - Educational purposes

🎯 Dataset Statistics

Metric	Value
Total Images	1,004
Dataset Size	12 MB
Image Format	PNG
Annotation Format	YOLO (.txt)
Classes	4
Train/Val Split	711/260 (73%/27%)

Class Distribution

Class ID	Class Name	Count	Description
0	`dog`	252	The hunting dog in various poses (jumping, laughing, sniffing, etc.)
1	`duck_dead`	256	Dead ducks (both black and red variants)
2	`duck_shot`	248	Ducks in the moment of being shot
3	`duck_flying`	248	Flying ducks in all directions (left, right, diagonal)

📁 Dataset Structure

yolo_dataset_augmented/
├── images/
│  ├── train/      # 711 training images
│  └── val/       # 260 validation images
├── labels/
│  ├── train/      # 711 YOLO annotation files
│  └── val/       # 260 YOLO annotation files
├── classes.txt     # Class names mapping
├── dataset.yaml     # YOLO configuration file
└── augmented_dataset_stats.json # Detailed statistics

🔧 Data Augmentation Details

The original 47 images were enhanced using advanced data augmentation techniques to create a balanced dataset:

Augmentation Techniques Applied:

Geometric Transformations: Rotation (±15°), horizontal/vertical flipping, scaling (0.8-1.2x), translation
Color Adjustments: Brightness (0.7-1.3x), contrast (0.8-1.2x), saturation (0.8-1.2x)
Quality Variations: Gaussian noise, slight blur for robustness
Advanced Techniques: Mosaic augmentation (YOLO-style 4-image combination)

Augmentation Parameters:

{
  'rotation_range': (-15, 15),    # Small rotations for game sprites
  'brightness_range': (0.7, 1.3),  # Brightness variations
  'contrast_range': (0.8, 1.2),   # Contrast adjustments
  'saturation_range': (0.8, 1.2),  # Color saturation
  'noise_intensity': 0.02,      # Gaussian noise
  'horizontal_flip_prob': 0.5,    # 50% chance horizontal flip
  'scaling_range': (0.8, 1.2),    # Scale variations
}

🚀 Usage Examples

Loading with YOLOv8 (Ultralytics)

from ultralytics import YOLO

# Load and train
model = YOLO('yolov8n.pt') # Load pretrained model
results = model.train(data='dataset.yaml', epochs=100, imgsz=640)

# Validate
metrics = model.val()

# Predict
results = model('path/to/test/image.png')

Loading with PyTorch

import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os

class DuckHuntDataset(Dataset):
  def _init_(self, images_dir, labels_dir, transform=None):
    self.images_dir = images_dir
    self.labels_dir = labels_dir
    self.transform = transform
    self.images = os.listdir(images_dir)
  
  def _len_(self):
    return len(self.images)
  
  def _getitem_(self, idx):
    img_path = os.path.join(self.images_dir, self.images[idx])
    label_path = os.path.join(self.labels_dir, 
                 self.images[idx].replace('.png', '.txt'))
    
    image = Image.open(img_path)
    # Load YOLO annotations
    with open(label_path, 'r') as f:
      labels = f.readlines()
    
    if self.transform:
      image = self.transform(image)
      
    return image, labels

# Usage
dataset = DuckHuntDataset('images/train', 'labels/train')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

YOLO Annotation Format

Each .txt file contains one line per object: class_id center_x center_y width height

Example annotation: 0 0.492 0.403 0.212 0.315 Where values are normalized (0-1) relative to image dimensions.

📊 Technical Specifications

Image Dimensions: Variable (original sprite sizes preserved)
Color Channels: RGB (3 channels)
Annotation Precision: Float32 (normalized coordinates)
File Naming: Descriptive names indicating class and augmentation type
Quality: High-resolution pixel art sprites

🎮 Dataset Context

This dataset is based on sprites from the iconic 1984 NES game "Duck Hunt," one of the most recognizable video games in history. The game featured:

The Dog: Your hunting companion who retrieves ducks and ...

3xM 10 10 (RGB-D Instance Seg. for bin-picking)
kaggle.com
zip
Updated Nov 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobia Ippolito (2024). 3xM 10 10 (RGB-D Instance Seg. for bin-picking) [Dataset]. https://www.kaggle.com/datasets/tobiaippolito/3xm-10-10
Explore at:
zip(67215581908 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
Tobia Ippolito
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
In short

This dataset used to investigate the influence of the unique amount of 3D-Models (Shapes) and Materials (Textures) towards the shape-textures bias, performance and generalization of deep neural network instance segmentation in my bachelor exam.

one of nine datasets created in Unreal Engine 5 with an NVIDIA RTX A4500

It uses 160 unique shapes and 80 unique textures

RGB, depth and solution masks are available

20.000 Scenes

Ready to use Dataloader, training and inference -> see next section

Usage

You can load the images like:

import cv2 image = cv2.imread(img_path) if image is None: raise FileNotFoundError(f"Error during data loading: there is no '{img_path}'") image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) depth = cv2.imread(depth_path, cv2.IMREAD_UNCHANGED) if len(depth.shape) > 2: _, depth, _, _ = cv2.split(depth) mask = cv2.imread(mask_path, cv2.IMREAD_UNCHANGED) # cv2.IMREAD_GRAYSCALE)

For easy use I recommend to use my own code. You can directly use it to train Mask R-CNN or just use the dataloader. Both are shown now:

First: Clone my torch github project into your project terminal cd ./path/to/your/project git clone https://github.com/xXAI-botXx/torch-mask-rcnn-instance-segmentation.git Second: Install the anaconda env (optional) terminal cd ./path/to/your/project cd ./torch-mask-rcnn-instance-segmentation conda env create -f conda_env.yml Third: You are ready to use

Using only the dataloader for your custom project: ```python import os import numpy as np import matplotlib.pyplot as plt import cv2 from torch.utils.data import DataLoader

import sys sys.path.append("./torch-mask-rcnn-instance-segmentation")

from maskrcnn_toolkit import DATA_LOADING_MODE, Dual_Dir_Dataset, collate_fn, extract_and_visualize_mask

data_mode = DATA_LOADING_MODE.ALL

dataset = Dual_Dir_Dataset(img_dir="/path/to/rgb-folder", depth_dir="/path/to/depth-folder", mask_dir="/path/to/mask-folder", transform=None, amount=1, start_idx=0, end_idx=0, image_name="...", data_mode=data_mode, use_mask=True, use_depth=False, log_path="./logs", width=1920, height=1080, should_log=True, should_print=True, should_verify=False) data_loader = DataLoader(dataset, batch_size=5, shuffle=True, num_workers=4, collate_fn=collate_fn)

plot

for data in data_loader: for batch_idx in range(len(data[0])): if len(data) == 3: image = data[0][batch_idx].cpu().unsqueeze(0) masks = data[1][batch_idx]["masks"] masks = masks.cpu() name = data[2][batch_idx] else: image = data[0][batch_idx].cpu().unsqueeze(0) name = data[1][batch_idx]

image = image.cpu().numpy().squeeze(0) image = np.transpose(image, (1, 2, 0)) # Convert to HWC # Remove 4.th channel if existing if image.shape[2] == 4: depth = image[:, :, 3] image = image[:, :, :3] else: depth = None masks_gt = masks.cpu().numpy() masks_gt = np.transpose(masks_gt, (1, 2, 0)) mask = extract_and_visualize_mask(masks_gt, image=None, ax=None, visualize=False, color_map=None, soft_join=False) # plot cols = 1 if depth is not None: cols += 1 if mask is not None: cols += 1 fig, ax = plt.subplots(nrows=1, ncols=cols, figsize=(20, 15*cols)) fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.05, hspace=0.05) plot_idx = 0 ax[plot_idx].imshow(image) ax[plot_idx].set_title("RGB Input Image") ax[plot_idx].axis("off") if depth is not None: plot_idx += 1 ax[plot_idx].imshow(depth, cmap="gray") ax[plot_idx].set_title("Depth Input Image") ax[plot_idx].axis("off") if mask is not None: plot_idx += 1 ax[plot_idx].imshow(mask) ax[plot_idx].set_title("Mask Ground Truth") ax[plot_idx].axis("off") plt.show()

**Using the whole Mask R-CNN training pipeline:** ```python import sys sys.path.append("./torch-mask-rcnn-instance-segmentation") from maskrcnn_toolkit import DATA_LOADING_MODE, train # set the vars as you need WEIGHTS_PATH = None # Path to the model weights file USE_DEPTH = False # Whether to include depth information -> as rgb and depth on green channel VERIFY_DATA = False # True is recommended GROUND_PATH = "D:/3xM" DATASET_NAME = "3xM_Dataset_10_10" IMG_DIR = os.path.join(GR...
feral-cat-segmentation_dataset
kaggle.com
universe.roboflow.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
Explore at:
zip(971125684 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
lu hou yang
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Feral Cat Segmentation Dataset

Overview

This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

Dataset Source

Original Author: Paul Cashman

Original Source: Roboflow Universe

Extended by: Lu Hou Yang

GitHub: https://github.com/luhouyang/open_circles

License: Public Domain

Dataset Contents

The dataset is organized into three standard splits: - Train set - Validation set - Test set

Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

Data Formats

1. Image Files

Format: JPG

Resolution: 224×224 pixels

Directory Structure:

train/: Original training images

valid/: Original validation images

test/: Original test images

train_mask/: Corresponding segmentation masks for training

valid_mask/: Corresponding segmentation masks for validation

test_mask/: Corresponding segmentation masks for testing

2. Parquet Files

Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet

Content: Flattened image data and corresponding masks combined in a single table

Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask

Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels

Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])

Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])

Benefits: Efficient storage and faster loading compared to individual image files

3. Pickle Files

Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl

Content: Serialized Python objects containing images and their corresponding masks

Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle

Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels

Benefits: Preserves original data structure and enables quick loading in Python

4. CSV Files

Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv

Content: Same data as parquet files but in CSV format

Structure: No headers, raw flattened pixel values

Data Division: Same split point as parquet files

Image Preprocessing

All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

Data Normalization

When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

PyTorch Integration

A custom CatDataset class is included for easy integration with PyTorch:

from cat_dataset import CatDataset # Load from parquet format dataset = CatDataset( root="path/to/dataset", split="train", # Options: "train", "valid", "test" format="parquet", # Options: "parquet", "pkl" image_size=[224, 224], image_channels=3, mask_channels=1 ) # Use with PyTorch DataLoader from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Performance Comparison

Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

The pickle format provides the fastest loading times and is recommended for most use cases.

Citation

If you use this dataset in your research or projects, please cite:

@misc{feral-cat-segmentation_dataset, title = {feral-cat-segmentation Dataset}, type = {Open Source Dataset}, author = {Paul Cashman}, howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}}, url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}, journal = {Roboflow Universe}, publisher = {Roboflow}, year = {2025}, month = {mar}, note = {visited on 2025-03-19}, }

Sample Usage Code

Basic Dataset Loading

from ca...
Animals (Cats, Dogs, and Snakes)
kaggle.com
zip
Updated Nov 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
Explore at:
zip(40219983 bytes)Available download formats
Dataset updated
Nov 18, 2025
Authors
Omar Rehan
Description
Cats, Dogs, and Snakes Dataset

Dataset Overview

The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

Class Number of Images Description
Cats 1,000 Includes multiple breeds and poses
Dogs 1,000 Covers various breeds and backgrounds
Snakes 1,000 Includes multiple species and natural settings

Total Images: 3,000

Image Properties:

Resolution: 224×224 pixels (resized for consistency)

Color Mode: RGB

Format: JPEG/PNG

Cleaned: Duplicate, blurry, and irrelevant images removed

Data Split Recommendation

Set Percentage Number of Images
Training 70% 2,100
Validation 15% 450
Test 15% 450

Preprocessing

Images in the dataset have been standardized to support machine learning pipelines:

Resizing to 224×224 pixels.

Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.

Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

Example: Loading and Using the Dataset (Python)

import os import tensorflow as tf from tensorflow.keras.preprocessing.image import ImageDataGenerator # Path to dataset dataset_path = "path/to/dataset" # ImageDataGenerator for preprocessing datagen = ImageDataGenerator( rescale=1./255, validation_split=0.15 # 15% for validation ) # Load training data train_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='training', shuffle=True ) # Load validation data validation_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='validation', shuffle=False ) # Example: Iterate over one batch images, labels = next(train_generator) print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)

Key Features

Balanced: Equal number of samples per class reduces bias.

Cleaned: High-quality, relevant images improve model performance.

Diverse: Covers multiple breeds, species, and environments to ensure generalization.

Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.

Class	Number of Images	Description
Cats	1,000	Includes multiple breeds and poses
Dogs	1,000	Covers various breeds and backgrounds
Snakes	1,000	Includes multiple species and natural settings

Set	Percentage	Number of Images
Training	70%	2,100
Validation	15%	450
Test	15%	450

Face Mask Detection - OpenCV University dataset

kaggle.com

zip

Updated Jul 20, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Radim Közl (2024). Face Mask Detection - OpenCV University dataset [Dataset]. https://www.kaggle.com/datasets/radimkzl/face-mask-detection-opencv-university-dataset

Explore at:

zip(225548404 bytes)Available download formats

Dataset updated

Jul 20, 2024

Authors

Radim Közl

License

http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

Description

This is dataset was created for Computer Vision & Deep Learning Applications of OpenCV University

This dataset is created for Project 4 for YOLO Face Mask Detector. it was used for training your own YOLO model.

The dataset contains 2,712 image and caption files, which is 1,356 samples (*.jpg, *.txt files)

For YOLOv5, YOLOv8 to YOLOv10 as PyThorch version. For Colab use this prebuild code:

if not os.path.exists('face_mask_dataset.zip'):
  !curl -L "https://storage.googleapis.com/kaggle-data-sets/5418712/8996055/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240721%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240721T204520Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=65e5e938f51c2f80f23fbed8b4d5460669729108f266f1977a3d3af260eeea4b7a413c4bd8050e5bba034dde8a7f7fc2fc06a43e48b3e4a43c9dd6a6f0747739e338b3ca89db762dd1797afa4cccc78bead9d39bb85bd86720cbb8d33628b37aeadda551e1394b45faaa93288d385bfbbc9b0b57ac793ed5a53917c1ba1303238a40b599abb9f3063d3a3d34bd289992d58cbf10ecf836242767ec139d24a1e78b9f11d6e897d245163fa1d5d555bffbc06eb60411dcdd28594dd0582bbe09add0fb269565a2f4a714f285ec018c463e01179794185cf5010cba2974fa3cf58ccaa1513c619b0a434707c9b22c958e61b71633540935ee6c1b804d5831002a9a" > face_mask_dataset.zip;

# Funkce pro rozbalení ZIP souboru
def unzip_file(zip_path, extract_to):
  with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)

zip_path = './face_mask_dataset.zip'
os.makedirs('face_mask_dataset', exist_ok=True)
extract_to = './face_mask_dataset/'
unzip_file(zip_path, extract_to)

!rm face_mask_dataset.zip

!mkdir train
!mkdir valid
!mkdir test

dataset_dir = './face_mask_dataset/'
train_dir = './train/'
test_dir = './test/'
valid_dir = './valid/'

# create subfolder
for folder in [train_dir, test_dir, valid_dir]:
  os.makedirs(os.path.join(folder, 'images'), exist_ok=True)
  os.makedirs(os.path.join(folder, 'labels'), exist_ok=True)

# Get all image and corresponding description files
image_files = [f for f in os.listdir(dataset_dir) if f.endswith('.jpg')]
label_files = [f.replace('.jpg', '.txt') for f in image_files]

# Checking that a corresponding description file exists for each image
assert all(os.path.isfile(os.path.join(dataset_dir, lbl)) for lbl in label_files), "Some matching descriptor files are missing!"

# Shuffle list of files
combined = list(zip(image_files, label_files))
random.shuffle(combined)
image_files[:], label_files[:] = zip(*combined)

# Split files according to the specified ratio
total_files = len(image_files)
train_split = int(0.6 * total_files)
test_split = int(0.2 * total_files)

train_files = image_files[:train_split]
train_labels = label_files[:train_split]

test_files = image_files[train_split:train_split + test_split]
test_labels = label_files[train_split:train_split + test_split]

valid_files = image_files[train_split + test_split:]
valid_labels = label_files[train_split + test_split:]

# Function to move files to their respective folders
def move_files(files, labels, dest_dir):
  for img_file, lbl_file in zip(files, labels):
    shutil.move(os.path.join(dataset_dir, img_file), os.path.join(dest_dir, 'images', img_file))
    shutil.move(os.path.join(dataset_dir, lbl_file), os.path.join(dest_dir, 'labels', lbl_file))

# Přesun souborů
move_files(train_files, train_labels, train_dir)
move_files(test_files, test_labels, test_dir)
move_files(valid_files, valid_labels, valid_dir)

print(f"Split complete: {len(train_files)} train, {len(test_files)} test, {len(valid_files)} valid.")

!rm -r face_mask_dataset

For Kaggle is possible use Face Mask Detection prebuild - OpenCV University

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code

OGBG-Code (Processed for PyG)

Abstract syntax trees obtained from 450 thousands Python method definitions

Explore at:

zip(1314604183 bytes)Available download formats

Dataset updated

Feb 27, 2021

Authors

Redao da Taupl

Description

OGBN-Code

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

Usage in Python

from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset

dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') 

batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

Summary

Package	#Graphs	#Nodes per Graph	#Edges per Graph	Split Type	Task Type	Metric
`ogb>=1.2.0`	452,741	125.2	124.2	Project	Sub-token prediction	F1 score

License: MIT License

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

Clear search

Close search

Google apps

Main menu

OGBG-Code (Processed for PyG)

OGBN-Code

Usage in Python

Description

Summary

License: MIT License

Open Graph Benchmark

References

Disclaimer

🧩 Maze Dataset

About

Python Code (Perfect mazes)

Python Code (Imperfect mazes)

YALM-instruct1-1M

MEG Ladder Stats: Reproducible Shuffle Tests for Quantized Steepness...

Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

new cars prices

human_proteome_doublets

Data and code underlying the publications: 'Configuration models for random...

OGBG-MolClinTox (Processed for PyG)

OGBN-MolClinTox

Usage in Python

Description

human_proteome_triplets

OGBG-MolBBBP (Processed for PyG)

OGBN-MolBBBP

Usage in Python

Description

Data from: Duck Hunt

Duck Hunt Object Detection Dataset

🎯 Dataset Statistics

Class Distribution

📁 Dataset Structure

🔧 Data Augmentation Details

Augmentation Techniques Applied:

Augmentation Parameters:

🚀 Usage Examples

Loading with YOLOv8 (Ultralytics)

Loading with PyTorch

YOLO Annotation Format

📊 Technical Specifications

🎮 Dataset Context

3xM 10 10 (RGB-D Instance Seg. for bin-picking)

In short

Usage

plot

feral-cat-segmentation_dataset

Feral Cat Segmentation Dataset

Overview

Dataset Source

Dataset Contents

Data Formats

1. Image Files

2. Parquet Files

3. Pickle Files

4. CSV Files

Image Preprocessing

Data Normalization

PyTorch Integration

Performance Comparison

Citation

Sample Usage Code

Basic Dataset Loading

Animals (Cats, Dogs, and Snakes)

Cats, Dogs, and Snakes Dataset

Dataset Overview

Data Split Recommendation

Preprocessing

Example: Loading and Using the Dataset (Python)

Key Features

Face Mask Detection - OpenCV University dataset

OGBG-Code (Processed for PyG)

Abstract syntax trees obtained from 450 thousands Python method definitions

OGBN-Code

Usage in Python

Description

Summary

License: MIT License

Open Graph Benchmark

References