16 datasets found
  1. OGBG-Code (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code
    Explore at:
    zip(1314604183 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    Description

    OGBN-Code

    Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

    Usage in Python

    from torch_geometric.data import DataLoader
    from ogb.graphproppred import PygGraphPropPredDataset
    
    dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') 
    
    batch_size = 32
    split_idx = dataset.get_idx_split()
    train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
    valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
    test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
    

    Description

    Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

    Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

    Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

    Summary

    Package#Graphs#Nodes per Graph#Edges per GraphSplit TypeTask TypeMetric
    ogb>=1.2.0452,741125.2124.2ProjectSub-token predictionF1 score

    License: MIT License

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  2. 🧩 Maze Dataset

    • kaggle.com
    zip
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2025). 🧩 Maze Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/maze-dataset
    Explore at:
    zip(16391387 bytes)Available download formats
    Dataset updated
    Apr 18, 2025
    Authors
    mexwell
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About

    This dataset comprises 3,000 (1,5k perfect and 1,5k imperfect) randomly generated mazes, each represented in a structured format suitable for algorithmic pathfinding, reinforcement learning, or maze-solving research. The mazes vary in dimensions from 10×10 to 150×150, offering a diverse range of complexity and size to support various levels of algorithmic challenge and scalability testing.

    Python Code (Perfect mazes)

    def create_maze(dim, i):
      # Create a grid filled with walls
      maze = np.ones((dim * 2 + 1, dim * 2 + 1))
    
      # Define the starting point
      x, y = (0, 0)
      maze[2 * x + 1, 2 * y + 1] = 0
    
      # Initialize the stack with the starting point
      stack = [(x, y)]
      while len(stack) > 0:
        x, y = stack[-1]
    
        # Define possible directions
        directions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
        random.shuffle(directions)
    
        for dx, dy in directions:
          nx, ny = x + dx, y + dy
          if nx >= 0 and ny >= 0 and nx < dim and ny < dim and maze[2 * nx + 1, 2 * ny + 1] == 1:
            maze[2 * nx + 1, 2 * ny + 1] = 0
            maze[2 * x + 1 + dx, 2 * y + 1 + dy] = 0
            stack.append((nx, ny))
            break
        else:
          stack.pop()
    
      # Create an entrance and an exit
      maze[1, 0] = 0
      maze[-2, -1] = 0
    

    The original author from the function above is Michael Gold. Can you read his awesome Python’s Path Through Mazes: A Journey of Creation and Solution article on Medium.

    Python Code (Imperfect mazes)

    def create_imperfect_maze(dim, extra_wall_removals=0.05):
      def create_perfect_maze(dim):
        maze = np.ones((dim * 2 + 1, dim * 2 + 1), dtype=int)
        x, y = (0, 0)
        maze[2 * x + 1, 2 * y + 1] = 0
        stack = [(x, y)]
        while stack:
          x, y = stack[-1]
          directions = [(0, 1), (1, 0), (0, -1), (-1, 0)]
          random.shuffle(directions)
          for dx, dy in directions:
            nx, ny = x + dx, y + dy
            if 0 <= nx < dim and 0 <= ny < dim and maze[2 * nx + 1, 2 * ny + 1] == 1:
              maze[2 * nx + 1, 2 * ny + 1] = 0
              maze[2 * x + 1 + dx, 2 * y + 1 + dy] = 0
              stack.append((nx, ny))
              break
          else:
            stack.pop()
        maze[1, 0] = 0 # entrance
        maze[-2, -1] = 0 # exit
        return maze
    
      maze = create_perfect_maze(dim)
    
      wall_candidates = []
      for i in range(1, maze.shape[0] - 1):
        for j in range(1, maze.shape[1] - 1):
          if maze[i, j] == 1:
            if maze[i - 1, j] == 0 and maze[i + 1, j] == 0:
              wall_candidates.append((i, j))
            elif maze[i, j - 1] == 0 and maze[i, j + 1] == 0:
              wall_candidates.append((i, j))
    
      num_to_remove = int(len(wall_candidates) * extra_wall_removals)
      walls_to_remove = random.sample(wall_candidates, num_to_remove)
      for i, j in walls_to_remove:
        maze[i, j] = 0
    
      return maze
    

    Created by myself.

    Foto von Mitchell Luo auf Unsplash

  3. h

    YALM-instruct1-1M

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kuldip Patel (2025). YALM-instruct1-1M [Dataset]. https://huggingface.co/datasets/kp7742/YALM-instruct1-1M
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Kuldip Patel
    Description

    YALM Instruct Data - 1

    The YALM Instruct Data - 1 is a mix of instruction tuning data of English, Hindi, Math and Python Code taken from various sources for the Supervised Finetuning Task of YALM(Yet Another Language Model). Total Samples: 1.31M Shuffle Seed: 101 Datasets:

    HuggingFaceTB/smoltalk

    Language: English, Math, Python

    damerajee/Instruct-hindi

    Language: Hindi, Hinglish

    smangrul/hindi_instruct_v1

    Language: Hindi, Hinglish

  4. MEG Ladder Stats: Reproducible Shuffle Tests for Quantized Steepness...

    • figshare.com
    zip
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Shreenan-Dyck (2025). MEG Ladder Stats: Reproducible Shuffle Tests for Quantized Steepness Alignments (v0.1.0) [Dataset]. http://doi.org/10.6084/m9.figshare.30276487.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 4, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chris Shreenan-Dyck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This item provides a self-contained, runnable package that reproduces the statistical validation of rung alignment used in the MEG harmonic ladder manuscript. It includes the pre-specified 12-feature dataset (in rung units) and Python code to run three null models, compute effect-size metrics, and test robustness, so referees (and readers) can verify all reported results with a single command.

  5. Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. http://doi.org/10.5281/zenodo.6568778
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

    We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.

    An example showing how to use the dataset is shown below.

    # code for testing robustness of a model
    import os.path
    
    from torchvision import datasets, transforms, models
    import torch.utils.data
    
    
    class ImageFolderWithEmptyDirs(datasets.ImageFolder):
      """
      This is required for handling empty folders from the ImageFolder Class.
      """
    
      def find_classes(self, directory):
        classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
        if not classes:
          raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
        class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
                len(os.listdir(os.path.join(directory, cls_name))) > 0}
        return classes, class_to_idx
    
    
    # extract and unzip the dataset, then write top folder here
    dataset_folder = 'data/ImageNet-Patch'
    
    available_labels = {
      487: 'cellular telephone',
      513: 'cornet',
      546: 'electric guitar',
      585: 'hair spray',
      804: 'soap dispenser',
      806: 'sock',
      878: 'typewriter keyboard',
      923: 'plate',
      954: 'banana',
      968: 'cup'
    }
    
    # select folder with specific target
    target_label = 954
    
    dataset_folder = os.path.join(dataset_folder, str(target_label))
    normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
    transforms = transforms.Compose([
      transforms.ToTensor(),
      normalizer
    ])
    
    dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms)
    model = models.resnet50(pretrained=True)
    loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5)
    model.eval()
    
    batches = 10
    correct, attack_success, total = 0, 0, 0
    for batch_idx, (images, labels) in enumerate(loader):
      if batch_idx == batches:
        break
      pred = model(images).argmax(dim=1)
      correct += (pred == labels).sum()
      attack_success += sum(pred == target_label)
      total += pred.shape[0]
    
    accuracy = correct / total
    attack_sr = attack_success / total
    
    print("Robust Accuracy: ", accuracy)
    print("Attack Success: ", attack_sr)
    

  6. new cars prices

    • kaggle.com
    zip
    Updated Oct 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeevan Sai123 (2024). new cars prices [Dataset]. https://www.kaggle.com/datasets/jeevansai123/new-cars-prices
    Explore at:
    zip(37952 bytes)Available download formats
    Dataset updated
    Oct 25, 2024
    Authors
    Jeevan Sai123
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Shuffled Car Dataset This project contains a shuffled dataset of cars, derived from the original dataset "CARS_1.csv." The data includes information about various car models, their specifications, and market information. Dataset Overview The dataset provides detailed attributes for multiple car models, including engine specifications, body type, pricing, fuel type, and user reviews. The rows have been randomly shuffled to ensure data randomness. Dataset Columns car_name: The name of the car model. reviews_count: The number of reviews the car has received. fuel_type: The type of fuel the car uses (Petrol, Diesel, Electric, etc.). engine_displacement: The engine displacement volume in cubic centimeters (cc). no_cylinder: The number of cylinders in the engine. seating_capacity: The seating capacity of the car (number of passengers). transmission_type: The type of transmission (Automatic, Manual, Electric). fuel_tank_capacity: The capacity of the fuel tank in liters. body_type: The classification of the car based on its shape and design (SUV, Sedan, Hatchback, etc.). rating: The user rating of the car, typically out of 5. starting_price: The starting price of the car in local currency. ending_price: The highest price of the car in local currency. max_torque_nm: The maximum torque the car engine can produce (in Newton meters). max_torque_rpm: The RPM (Revolutions Per Minute) at which the car delivers its maximum torque. max_power_bhp: The maximum power the car engine can produce (in Brake Horsepower). max_power_rp: The RPM at which the car delivers its maximum power. Usage This dataset can be used for various data analysis and machine learning tasks, including: Predicting car prices based on engine specifications and other attributes. Clustering cars by their specifications (e.g., body type, fuel type). Analyzing customer preferences based on review counts and ratings. How to Use Load the dataset into your environment (e.g., Python, R, Excel, etc.). Use appropriate data analysis and visualization tools to gain insights. Perform machine learning tasks such as regression or classification using the car specifications. File Information Source File: CARS_1.csv Shuffled File: You may shuffle the dataset or access the already shuffled dataset for analysis. Let me know if you'd like to modify or expand any sections of this README!

  7. h

    human_proteome_doublets

    • huggingface.co
    Updated Aug 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaron Geffen (2022). human_proteome_doublets [Dataset]. https://huggingface.co/datasets/yarongef/human_proteome_doublets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2022
    Authors
    Yaron Geffen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description

    Out of 20,577 human proteins (from UniProt human proteome), sequences shorter than 20 amino acids or longer than 512 amino acids were removed, resulting in a set of 12,703 proteins. The uShuffle algorithm (python pacakge) was then used to shuffle these protein sequences while maintaining their doublet distribution. The very few sequences for which uShuffle failed to create a shuffled version were eliminated. Afterwards, h-CD-HIT algorithm (web server) was used… See the full description on the dataset page: https://huggingface.co/datasets/yarongef/human_proteome_doublets.

  8. 4

    Data and code underlying the publications: 'Configuration models for random...

    • data.4tu.nl
    zip
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ir. Y.J. Kraakman; Clara Stegehuis (2025). Data and code underlying the publications: 'Configuration models for random directed hypergraphs' and 'Hypercurveball algorithm for sampling hypergraphs with fixed degrees' [Dataset]. http://doi.org/10.4121/9beea11f-2e93-473d-9d22-8d8a6bec9d5a.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    ir. Y.J. Kraakman; Clara Stegehuis
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Dataset funded by
    Nederlandse Organisatie voor Wetenschappelijk Onderzoek
    Description

    This folder contains the code and data used to compare the performance of two algorithms for generating random hypergraphs with prescribed degree sequences. The comparison is conducted by simulating and analyzing the mixing time of each algorithm.

    Specifically, the folder includes:

    1. Python scripts (.py) for generating random directed or undirected hypergraphs using either the Hypercurveball algorithm or the Hyperedge-shuffle algorithm.
    2. A Python script (.py) for analyzing the mixing time of each algorithm. For each algorithm, this script outputs a .csv file that contains the perturbation degree at each step of the simulation.
    3. 25 hypergraph datasets (.csv), containing both undirected and directed hypergraphs.
    4. For each dataset: perturbation degree files (.csv), containing the perturbation degree value at each step of a simulation, for both algorithms. Each algorithm is simulated either 10 or 100 times per dataset.
    5. A Python script (.py) for computing various statistics of a hypergraph.


    The folder accompanies these papers:

    - Yanna J. Kraakman and Clara Stegehuis (2024). Configuration models for random directed hypergraphs. arXiv:2402.06466.

    - Yanna J. Kraakman and Clara Stegehuis (2024). Hypercurveball algorithm for sampling hypergraphs with fixed degrees. arXiv:2412.05100

  9. OGBG-MolClinTox (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBG-MolClinTox (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbg-molclintox
    Explore at:
    zip(325594 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    Description

    OGBN-MolClinTox

    Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-mol

    Usage in Python

    import os
    import os.path as osp
    import pandas as pd
    import torch
    from ogb.graphproppred import PygGraphPropPredDataset
    
    class PygOgbgMol(PygGraphPropPredDataset):
      def _init_(self, name, transform = None, pre_transform = None, meta_csv = None):
        root = '../input'
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbg-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, pre_transform = pre_transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
          
        path = osp.join(self.root, 'split', split_type)
    
        # short-cut if split_dict.pt exists
        if os.path.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
    
        train_idx = pd.read_csv(osp.join(path, 'train.csv'), header = None).values.T[0]
        valid_idx = pd.read_csv(osp.join(path, 'valid.csv'), header = None).values.T[0]
        test_idx = pd.read_csv(osp.join(path, 'test.csv'), header = None).values.T[0]
    
        return {'train': torch.tensor(train_idx, dtype = torch.long), 'valid': torch.tensor(valid_idx, dtype = torch.long), 'test': torch.tensor(test_idx, dtype = torch.long)}
    
    dataset = PygOgbgMol('ogbg-molclintox')
    
    from torch_geometric.data import DataLoader
    
    batch_size = 32
    split_idx = dataset.get_idx_split()
    train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
    valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
    test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
    

    Description

    Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDKit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].

    Beside the two main datasets, the dataset authors additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].

    For encoding these raw input features, the dataset authors prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.

    from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
    atom_encoder = AtomEncoder(emb_dim = 100)
    bond_encoder = BondEncoder(emb_dim = 100)
    
    atom_emb = atom_encoder(x) # x is the input atom feature
    edge_emb = bond_encoder(edge_attr) # edge_attr is the input edge feature
    

    Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, the dataset authors closely follow [2]. Specifically, for ogbg-molhiv, the dataset authors use ROC...

  10. h

    human_proteome_triplets

    • huggingface.co
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaron Geffen (2022). human_proteome_triplets [Dataset]. https://huggingface.co/datasets/yarongef/human_proteome_triplets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2022
    Authors
    Yaron Geffen
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description

    Out of 20,577 human proteins (from UniProt human proteome), sequences shorter than 20 amino acids or longer than 512 amino acids were removed, resulting in a set of 12,703 proteins. The uShuffle algorithm (python pacakge) was then used to shuffle these protein sequences while maintaining their triplet distribution. The sequences for which uShuffle failed to create a shuffled version were eliminated. Afterwards, h-CD-HIT algorithm (web server) was used with three… See the full description on the dataset page: https://huggingface.co/datasets/yarongef/human_proteome_triplets.

  11. OGBG-MolBBBP (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBG-MolBBBP (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbg-molbbbp
    Explore at:
    zip(471366 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    Description

    OGBN-MolBBBP

    Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-mol

    Usage in Python

    import os
    import os.path as osp
    import pandas as pd
    import torch
    from ogb.graphproppred import PygGraphPropPredDataset
    
    class PygOgbgMol(PygGraphPropPredDataset):
      def _init_(self, name, transform = None, pre_transform = None, meta_csv = None):
        root = '../input'
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbg-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, pre_transform = pre_transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
          
        path = osp.join(self.root, 'split', split_type)
    
        # short-cut if split_dict.pt exists
        if os.path.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
    
        train_idx = pd.read_csv(osp.join(path, 'train.csv'), header = None).values.T[0]
        valid_idx = pd.read_csv(osp.join(path, 'valid.csv'), header = None).values.T[0]
        test_idx = pd.read_csv(osp.join(path, 'test.csv'), header = None).values.T[0]
    
        return {'train': torch.tensor(train_idx, dtype = torch.long), 'valid': torch.tensor(valid_idx, dtype = torch.long), 'test': torch.tensor(test_idx, dtype = torch.long)}
    
    dataset = PygOgbgMol('ogbg-molbbbp')
    
    from torch_geometric.data import DataLoader
    
    batch_size = 32
    split_idx = dataset.get_idx_split()
    train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
    valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
    test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
    

    Description

    Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDKit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].

    Beside the two main datasets, the dataset authors additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].

    For encoding these raw input features, the dataset authors prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.

    from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
    atom_encoder = AtomEncoder(emb_dim = 100)
    bond_encoder = BondEncoder(emb_dim = 100)
    
    atom_emb = atom_encoder(x) # x is the input atom feature
    edge_emb = bond_encoder(edge_attr) # edge_attr is the input edge feature
    

    Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, the dataset authors closely follow [2]. Specifically, for ogbg-molhiv, the dataset authors use ROC-AUC f...

  12. Data from: Duck Hunt

    • kaggle.com
    zip
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Zanini (2025). Duck Hunt [Dataset]. https://www.kaggle.com/datasets/hugozanini1/duck-hunt
    Explore at:
    zip(7379197 bytes)Available download formats
    Dataset updated
    Jul 26, 2025
    Authors
    Hugo Zanini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Duck Hunt Object Detection Dataset

    This dataset contains 1,004 labeled images from the classic NES game "Duck Hunt" (1984), specifically prepared for YOLO (You Only Look Once) object detection training. The dataset includes sprites of the iconic hunting dog and ducks in various states, augmented to provide a balanced and comprehensive training set for computer vision models.

    Perfect for: - Object detection model training - Computer vision research - Retro gaming AI projects - YOLO algorithm benchmarking - Educational purposes

    🎯 Dataset Statistics

    MetricValue
    Total Images1,004
    Dataset Size12 MB
    Image FormatPNG
    Annotation FormatYOLO (.txt)
    Classes4
    Train/Val Split711/260 (73%/27%)

    Class Distribution

    Class IDClass NameCountDescription
    0dog252The hunting dog in various poses (jumping, laughing, sniffing, etc.)
    1duck_dead256Dead ducks (both black and red variants)
    2duck_shot248Ducks in the moment of being shot
    3duck_flying248Flying ducks in all directions (left, right, diagonal)

    📁 Dataset Structure

    yolo_dataset_augmented/
    ├── images/
    │  ├── train/      # 711 training images
    │  └── val/       # 260 validation images
    ├── labels/
    │  ├── train/      # 711 YOLO annotation files
    │  └── val/       # 260 YOLO annotation files
    ├── classes.txt     # Class names mapping
    ├── dataset.yaml     # YOLO configuration file
    └── augmented_dataset_stats.json # Detailed statistics
    

    🔧 Data Augmentation Details

    The original 47 images were enhanced using advanced data augmentation techniques to create a balanced dataset:

    Augmentation Techniques Applied:

    • Geometric Transformations: Rotation (±15°), horizontal/vertical flipping, scaling (0.8-1.2x), translation
    • Color Adjustments: Brightness (0.7-1.3x), contrast (0.8-1.2x), saturation (0.8-1.2x)
    • Quality Variations: Gaussian noise, slight blur for robustness
    • Advanced Techniques: Mosaic augmentation (YOLO-style 4-image combination)

    Augmentation Parameters:

    {
      'rotation_range': (-15, 15),    # Small rotations for game sprites
      'brightness_range': (0.7, 1.3),  # Brightness variations
      'contrast_range': (0.8, 1.2),   # Contrast adjustments
      'saturation_range': (0.8, 1.2),  # Color saturation
      'noise_intensity': 0.02,      # Gaussian noise
      'horizontal_flip_prob': 0.5,    # 50% chance horizontal flip
      'scaling_range': (0.8, 1.2),    # Scale variations
    }
    

    🚀 Usage Examples

    Loading with YOLOv8 (Ultralytics)

    from ultralytics import YOLO
    
    # Load and train
    model = YOLO('yolov8n.pt') # Load pretrained model
    results = model.train(data='dataset.yaml', epochs=100, imgsz=640)
    
    # Validate
    metrics = model.val()
    
    # Predict
    results = model('path/to/test/image.png')
    

    Loading with PyTorch

    import torch
    from torch.utils.data import Dataset, DataLoader
    from PIL import Image
    import os
    
    class DuckHuntDataset(Dataset):
      def _init_(self, images_dir, labels_dir, transform=None):
        self.images_dir = images_dir
        self.labels_dir = labels_dir
        self.transform = transform
        self.images = os.listdir(images_dir)
      
      def _len_(self):
        return len(self.images)
      
      def _getitem_(self, idx):
        img_path = os.path.join(self.images_dir, self.images[idx])
        label_path = os.path.join(self.labels_dir, 
                     self.images[idx].replace('.png', '.txt'))
        
        image = Image.open(img_path)
        # Load YOLO annotations
        with open(label_path, 'r') as f:
          labels = f.readlines()
        
        if self.transform:
          image = self.transform(image)
          
        return image, labels
    
    # Usage
    dataset = DuckHuntDataset('images/train', 'labels/train')
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    

    YOLO Annotation Format

    Each .txt file contains one line per object: class_id center_x center_y width height

    Example annotation: 0 0.492 0.403 0.212 0.315 Where values are normalized (0-1) relative to image dimensions.

    📊 Technical Specifications

    • Image Dimensions: Variable (original sprite sizes preserved)
    • Color Channels: RGB (3 channels)
    • Annotation Precision: Float32 (normalized coordinates)
    • File Naming: Descriptive names indicating class and augmentation type
    • Quality: High-resolution pixel art sprites

    🎮 Dataset Context

    This dataset is based on sprites from the iconic 1984 NES game "Duck Hunt," one of the most recognizable video games in history. The game featured:

    • The Dog: Your hunting companion who retrieves ducks and ...
  13. 3xM 10 10 (RGB-D Instance Seg. for bin-picking)

    • kaggle.com
    zip
    Updated Nov 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobia Ippolito (2024). 3xM 10 10 (RGB-D Instance Seg. for bin-picking) [Dataset]. https://www.kaggle.com/datasets/tobiaippolito/3xm-10-10
    Explore at:
    zip(67215581908 bytes)Available download formats
    Dataset updated
    Nov 12, 2024
    Authors
    Tobia Ippolito
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    In short

    This dataset used to investigate the influence of the unique amount of 3D-Models (Shapes) and Materials (Textures) towards the shape-textures bias, performance and generalization of deep neural network instance segmentation in my bachelor exam.

    • one of nine datasets created in Unreal Engine 5 with an NVIDIA RTX A4500
    • It uses 160 unique shapes and 80 unique textures
    • RGB, depth and solution masks are available
    • 20.000 Scenes
    • Ready to use Dataloader, training and inference -> see next section

    Usage

    You can load the images like:

    import cv2
    
    image = cv2.imread(img_path)
    if image is None:
      raise FileNotFoundError(f"Error during data loading: there is no '{img_path}'")
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        
    depth = cv2.imread(depth_path, cv2.IMREAD_UNCHANGED)
    if len(depth.shape) > 2:
      _, depth, _, _ = cv2.split(depth)
          
    mask = cv2.imread(mask_path, cv2.IMREAD_UNCHANGED)  # cv2.IMREAD_GRAYSCALE)
    

    For easy use I recommend to use my own code. You can directly use it to train Mask R-CNN or just use the dataloader. Both are shown now:

    First: Clone my torch github project into your project terminal cd ./path/to/your/project git clone https://github.com/xXAI-botXx/torch-mask-rcnn-instance-segmentation.git Second: Install the anaconda env (optional) terminal cd ./path/to/your/project cd ./torch-mask-rcnn-instance-segmentation conda env create -f conda_env.yml Third: You are ready to use

    Using only the dataloader for your custom project: ```python import os import numpy as np import matplotlib.pyplot as plt import cv2 from torch.utils.data import DataLoader

    import sys sys.path.append("./torch-mask-rcnn-instance-segmentation")

    from maskrcnn_toolkit import DATA_LOADING_MODE, Dual_Dir_Dataset, collate_fn, extract_and_visualize_mask

    data_mode = DATA_LOADING_MODE.ALL

    dataset = Dual_Dir_Dataset(img_dir="/path/to/rgb-folder", depth_dir="/path/to/depth-folder", mask_dir="/path/to/mask-folder", transform=None, amount=1, start_idx=0, end_idx=0, image_name="...", data_mode=data_mode, use_mask=True, use_depth=False, log_path="./logs", width=1920, height=1080, should_log=True, should_print=True, should_verify=False) data_loader = DataLoader(dataset, batch_size=5, shuffle=True, num_workers=4, collate_fn=collate_fn)

    plot

    for data in data_loader: for batch_idx in range(len(data[0])): if len(data) == 3: image = data[0][batch_idx].cpu().unsqueeze(0) masks = data[1][batch_idx]["masks"] masks = masks.cpu() name = data[2][batch_idx] else: image = data[0][batch_idx].cpu().unsqueeze(0) name = data[1][batch_idx]

      image = image.cpu().numpy().squeeze(0)
      image = np.transpose(image, (1, 2, 0)) # Convert to HWC
    
      # Remove 4.th channel if existing
      if image.shape[2] == 4:
        depth = image[:, :, 3]
        image = image[:, :, :3]
      else:
        depth = None
    
      masks_gt = masks.cpu().numpy()
      masks_gt = np.transpose(masks_gt, (1, 2, 0))
      mask = extract_and_visualize_mask(masks_gt, image=None, ax=None, visualize=False, color_map=None, soft_join=False)
    
      # plot
      cols = 1
      if depth is not None:
        cols += 1
      if mask is not None:
        cols += 1
    
      fig, ax = plt.subplots(nrows=1, ncols=cols, figsize=(20, 15*cols))
      fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.05, hspace=0.05)
    
      plot_idx = 0
      ax[plot_idx].imshow(image)
      ax[plot_idx].set_title("RGB Input Image")
      ax[plot_idx].axis("off")
    
      if depth is not None:
        plot_idx += 1
        ax[plot_idx].imshow(depth, cmap="gray")
        ax[plot_idx].set_title("Depth Input Image")
        ax[plot_idx].axis("off")
    
      if mask is not None:
        plot_idx += 1
        ax[plot_idx].imshow(mask)
        ax[plot_idx].set_title("Mask Ground Truth")
        ax[plot_idx].axis("off")
    
      plt.show()
    
    
    **Using the whole Mask R-CNN training pipeline:**
    ```python
    import sys
    sys.path.append("./torch-mask-rcnn-instance-segmentation")
    
    from maskrcnn_toolkit import DATA_LOADING_MODE, train
    
    
    # set the vars as you need
    
    WEIGHTS_PATH = None   # Path to the model weights file
    USE_DEPTH = False      # Whether to include depth information -> as rgb and depth on green channel
    VERIFY_DATA = False     # True is recommended
    
    GROUND_PATH = "D:/3xM"  
    DATASET_NAME = "3xM_Dataset_10_10"
    IMG_DIR = os.path.join(GR...
    
  14. feral-cat-segmentation_dataset

    • kaggle.com
    • universe.roboflow.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
    Explore at:
    zip(971125684 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    lu hou yang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Feral Cat Segmentation Dataset

    Overview

    This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

    Dataset Source

    Dataset Contents

    The dataset is organized into three standard splits: - Train set - Validation set - Test set

    Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

    Data Formats

    1. Image Files

    • Format: JPG
    • Resolution: 224×224 pixels
    • Directory Structure:
      • train/: Original training images
      • valid/: Original validation images
      • test/: Original test images
      • train_mask/: Corresponding segmentation masks for training
      • valid_mask/: Corresponding segmentation masks for validation
      • test_mask/: Corresponding segmentation masks for testing

    2. Parquet Files

    • Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet
    • Content: Flattened image data and corresponding masks combined in a single table
    • Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask
    • Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels
      • Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])
      • Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])
    • Benefits: Efficient storage and faster loading compared to individual image files

    3. Pickle Files

    • Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl
    • Content: Serialized Python objects containing images and their corresponding masks
    • Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle
    • Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels
    • Benefits: Preserves original data structure and enables quick loading in Python

    4. CSV Files

    • Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv
    • Content: Same data as parquet files but in CSV format
    • Structure: No headers, raw flattened pixel values
    • Data Division: Same split point as parquet files

    Image Preprocessing

    All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

    Data Normalization

    When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

    PyTorch Integration

    A custom CatDataset class is included for easy integration with PyTorch:

    from cat_dataset import CatDataset
    
    # Load from parquet format
    dataset = CatDataset(
      root="path/to/dataset",
      split="train", # Options: "train", "valid", "test"
      format="parquet", # Options: "parquet", "pkl"
      image_size=[224, 224],
      image_channels=3,
      mask_channels=1
    )
    
    # Use with PyTorch DataLoader
    from torch.utils.data import DataLoader
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    

    Performance Comparison

    Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

    The pickle format provides the fastest loading times and is recommended for most use cases.

    Citation

    If you use this dataset in your research or projects, please cite:

    @misc{feral-cat-segmentation_dataset,
     title = {feral-cat-segmentation Dataset},
     type = {Open Source Dataset},
     author = {Paul Cashman},
     howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}},
     url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation},
     journal = {Roboflow Universe},
     publisher = {Roboflow},
     year = {2025},
     month = {mar},
     note = {visited on 2025-03-19},
    }
    

    Sample Usage Code

    Basic Dataset Loading

    from ca...
    
  15. Animals (Cats, Dogs, and Snakes)

    • kaggle.com
    zip
    Updated Nov 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
    Explore at:
    zip(40219983 bytes)Available download formats
    Dataset updated
    Nov 18, 2025
    Authors
    Omar Rehan
    Description

    Cats, Dogs, and Snakes Dataset

    Dataset Overview

    The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

    ClassNumber of ImagesDescription
    Cats1,000Includes multiple breeds and poses
    Dogs1,000Covers various breeds and backgrounds
    Snakes1,000Includes multiple species and natural settings

    Total Images: 3,000

    Image Properties:

    • Resolution: 224×224 pixels (resized for consistency)
    • Color Mode: RGB
    • Format: JPEG/PNG
    • Cleaned: Duplicate, blurry, and irrelevant images removed

    Data Split Recommendation

    SetPercentageNumber of Images
    Training70%2,100
    Validation15%450
    Test15%450

    Preprocessing

    Images in the dataset have been standardized to support machine learning pipelines:

    1. Resizing to 224×224 pixels.
    2. Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.
    3. Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

    Example: Loading and Using the Dataset (Python)

    import os
    import tensorflow as tf
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Path to dataset
    dataset_path = "path/to/dataset"
    
    # ImageDataGenerator for preprocessing
    datagen = ImageDataGenerator(
      rescale=1./255,
      validation_split=0.15 # 15% for validation
    )
    
    # Load training data
    train_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='training',
      shuffle=True
    )
    
    # Load validation data
    validation_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='validation',
      shuffle=False
    )
    
    # Example: Iterate over one batch
    images, labels = next(train_generator)
    print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)
    

    Key Features

    • Balanced: Equal number of samples per class reduces bias.
    • Cleaned: High-quality, relevant images improve model performance.
    • Diverse: Covers multiple breeds, species, and environments to ensure generalization.
    • Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.
  16. Face Mask Detection - OpenCV University dataset

    • kaggle.com
    zip
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radim Közl (2024). Face Mask Detection - OpenCV University dataset [Dataset]. https://www.kaggle.com/datasets/radimkzl/face-mask-detection-opencv-university-dataset
    Explore at:
    zip(225548404 bytes)Available download formats
    Dataset updated
    Jul 20, 2024
    Authors
    Radim Közl
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    This is dataset was created for Computer Vision & Deep Learning Applications of OpenCV University

    This dataset is created for Project 4 for YOLO Face Mask Detector. it was used for training your own YOLO model.

    The dataset contains 2,712 image and caption files, which is 1,356 samples (*.jpg, *.txt files)

    For YOLOv5, YOLOv8 to YOLOv10 as PyThorch version. For Colab use this prebuild code:

    if not os.path.exists('face_mask_dataset.zip'):
      !curl -L "https://storage.googleapis.com/kaggle-data-sets/5418712/8996055/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240721%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240721T204520Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=65e5e938f51c2f80f23fbed8b4d5460669729108f266f1977a3d3af260eeea4b7a413c4bd8050e5bba034dde8a7f7fc2fc06a43e48b3e4a43c9dd6a6f0747739e338b3ca89db762dd1797afa4cccc78bead9d39bb85bd86720cbb8d33628b37aeadda551e1394b45faaa93288d385bfbbc9b0b57ac793ed5a53917c1ba1303238a40b599abb9f3063d3a3d34bd289992d58cbf10ecf836242767ec139d24a1e78b9f11d6e897d245163fa1d5d555bffbc06eb60411dcdd28594dd0582bbe09add0fb269565a2f4a714f285ec018c463e01179794185cf5010cba2974fa3cf58ccaa1513c619b0a434707c9b22c958e61b71633540935ee6c1b804d5831002a9a" > face_mask_dataset.zip;
    
    # Funkce pro rozbalení ZIP souboru
    def unzip_file(zip_path, extract_to):
      with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)
    
    zip_path = './face_mask_dataset.zip'
    os.makedirs('face_mask_dataset', exist_ok=True)
    extract_to = './face_mask_dataset/'
    unzip_file(zip_path, extract_to)
    
    !rm face_mask_dataset.zip
    
    !mkdir train
    !mkdir valid
    !mkdir test
    
    dataset_dir = './face_mask_dataset/'
    train_dir = './train/'
    test_dir = './test/'
    valid_dir = './valid/'
    
    # create subfolder
    for folder in [train_dir, test_dir, valid_dir]:
      os.makedirs(os.path.join(folder, 'images'), exist_ok=True)
      os.makedirs(os.path.join(folder, 'labels'), exist_ok=True)
    
    # Get all image and corresponding description files
    image_files = [f for f in os.listdir(dataset_dir) if f.endswith('.jpg')]
    label_files = [f.replace('.jpg', '.txt') for f in image_files]
    
    # Checking that a corresponding description file exists for each image
    assert all(os.path.isfile(os.path.join(dataset_dir, lbl)) for lbl in label_files), "Some matching descriptor files are missing!"
    
    # Shuffle list of files
    combined = list(zip(image_files, label_files))
    random.shuffle(combined)
    image_files[:], label_files[:] = zip(*combined)
    
    # Split files according to the specified ratio
    total_files = len(image_files)
    train_split = int(0.6 * total_files)
    test_split = int(0.2 * total_files)
    
    train_files = image_files[:train_split]
    train_labels = label_files[:train_split]
    
    test_files = image_files[train_split:train_split + test_split]
    test_labels = label_files[train_split:train_split + test_split]
    
    valid_files = image_files[train_split + test_split:]
    valid_labels = label_files[train_split + test_split:]
    
    # Function to move files to their respective folders
    def move_files(files, labels, dest_dir):
      for img_file, lbl_file in zip(files, labels):
        shutil.move(os.path.join(dataset_dir, img_file), os.path.join(dest_dir, 'images', img_file))
        shutil.move(os.path.join(dataset_dir, lbl_file), os.path.join(dest_dir, 'labels', lbl_file))
    
    # Přesun souborů
    move_files(train_files, train_labels, train_dir)
    move_files(test_files, test_labels, test_dir)
    move_files(valid_files, valid_labels, valid_dir)
    
    print(f"Split complete: {len(train_files)} train, {len(test_files)} test, {len(valid_files)} valid.")
    
    !rm -r face_mask_dataset
    

    For Kaggle is possible use Face Mask Detection prebuild - OpenCV University

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code
Organization logo

OGBG-Code (Processed for PyG)

Abstract syntax trees obtained from 450 thousands Python method definitions

Explore at:
zip(1314604183 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Redao da Taupl
Description

OGBN-Code

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

Usage in Python

from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset

dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') 

batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

Summary

Package#Graphs#Nodes per Graph#Edges per GraphSplit TypeTask TypeMetric
ogb>=1.2.0452,741125.2124.2ProjectSub-token predictionF1 score

License: MIT License

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

Search
Clear search
Close search
Google apps
Main menu