75 datasets found
  1. Stanford Background Dataset

    • kaggle.com
    zip
    Updated Sep 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balraj Ashwath (2020). Stanford Background Dataset [Dataset]. https://www.kaggle.com/datasets/balraj98/stanford-background-dataset/data
    Explore at:
    zip(17888030 bytes)Available download formats
    Dataset updated
    Sep 26, 2020
    Authors
    Balraj Ashwath
    Description

    Content

    The Stanford Background Dataset was introduced in Gould et al. (ICCV 2009) for evaluating methods for geometric and semantic scene understanding. The dataset contains 715 images chosen from public datasets: LabelMe, MSRC, PASCAL VOC and Geometric Context. The selection criteria were for the images were of outdoor scenes, having approximately 320-by-240 pixels, containing at least one foreground object, and having the horizon position within the image (it need not be visible). Semantic and geometric labels were obtained using Amazon's Mechanical Turk (AMT).

    Acknowledgements

    The dataset is derived from Stanford DAGS Lab's Stanford Background Dataset from their Scene Understanding Datasets page. If you use this dataset in your work, you should reference:

    S. Gould, R. Fulton, D. Koller. Decomposing a Scene into Geometric and Semantically Consistent Regions. Proceedings International Conference on Computer Vision (ICCV), 2009. [pdf]

    Inspiration

    Rapid advances in Image Understanding using Computer Vision techniques have brought us many state-of-the-art deep learning models across various benchmark datasets in Scene Understanding. But most of these datasets are large and require several days of training time. Can we train sufficiently accurate scene understanding models using less data? How well do SOTA scene understanding models perform when trained under data constraints?

  2. Replication Package for 'How do Machine Learning Models Change?'

    • zenodo.org
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2024). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.14128997
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.

    Our research addresses three main aspects:

    1. Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models.
    2. Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes.
    3. Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

    This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.

    Data Collection and Preprocessing

    Data Collection

    We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:

    • Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards.
    • Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit.
    • Release Information: Information on model releases marked by tags in their repositories.

    To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.

    Data Preprocessing

    Commit Diffs

    We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.

    Commit Classification

    We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.

    Model Metadata

    We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.

    Folder Structure

    The replication package is organized as follows:

    - code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.

    • Collection/: Contains two Jupyter notebooks for data collection:
      • HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.
      • HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.
    • Preprocessing/: Contains preprocessing scripts:
      • HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.
      • HFCommitsPreprocessing.ipynb: Processes commit data, including:
        • Retrieval of diff information between commits.
        • Classification of commits following Bhatia et al.'s taxonomy using LLMs.
        • Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis.
      • HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.
    • Analysis/: Contains three Jupyter notebooks with the analysis for each research question:
      • RQ1_Analysis.ipynb: Analysis for RQ1.
      • RQ2_Analysis.ipynb: Analysis for RQ2.
      • RQ3_Analysis.ipynb: Analysis for RQ3.

    - datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.

    • Main Datasets:
      • HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.
      • HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.
      • HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.
      • model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.
      • These datasets correspond to the following dataset splits:
        • +200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models.
        • +200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study.
        • +1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution.
        • Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations.
    • Additional Datasets:
      • HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.
      • HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.
      • Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps.

    - metadata/: Contains the tags_metadata.yaml file used during preprocessing.

    - models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.

    - requirements.txt: Lists the required Python packages to set up the environment and run the code.

    Setup and Execution

    Prerequisites

    • Python 3.10.11 or later.
    • Jupyter Notebook or JupyterLab.

    Installation

    1. Download and Extract the Replication Package
    2. Create a Virtual Environment (Recommended):bash
      python -m venv venv
      source venv/bin/activate # On Windows, use venv\Scripts\activate
    3. Install Required Packages:bash
      pip install -r requirements.txt

    Notes

    - LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.

    - Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.

    - Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.

    Additional Information

    Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.

    This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.

  3. Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America...

    • technavio.com
    pdf
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/open-source-llm-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 10, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Germany, Canada, United Kingdom, United States
    Description

    Snapshot img

    Open-Source LLM Market Size 2025-2029

    The open-source LLM market size is valued to increase by USD 54 billion, at a CAGR of 33.7% from 2024 to 2029. Increasing democratization and compelling economics will drive the open-source LLM market.

    Market Insights

    North America dominated the market and accounted for a 37% growth during the 2025-2029.
    By Application - Technology and software segment was valued at USD 4.02 billion in 2023
    By Deployment - On-premises segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 575.60 million 
    Market Future Opportunities 2024: USD 53995.50 million
    CAGR from 2024 to 2029 : 33.7%
    

    Market Summary

    The Open-Source Large Language Model (LLM) market has experienced significant growth due to the increasing democratization of artificial intelligence (AI) technology and its compelling economics. This global trend is driven by the proliferation of smaller organizations seeking to leverage advanced language models for various applications, including supply chain optimization, compliance, and operational efficiency. Open-source LLMs offer several advantages over proprietary models. They provide greater flexibility, as users can modify and adapt the models to their specific needs. Additionally, open-source models often have larger training datasets, leading to improved performance and accuracy. However, there are challenges to implementing open-source LLMs, such as the prohibitive computational costs and critical hardware dependency. These obstacles necessitate the development of more efficient algorithms and the exploration of cloud computing solutions.
    A real-world business scenario illustrates the potential benefits of open-source LLMs. A manufacturing company aims to optimize its supply chain by implementing an AI-powered system to analyze customer demand patterns and predict inventory needs. The company chooses an open-source LLM due to its flexibility and cost-effectiveness. By integrating the LLM into its supply chain management system, the company can improve forecasting accuracy and reduce inventory costs, ultimately increasing operational efficiency and customer satisfaction. Despite the challenges, the market continues to grow as organizations recognize the potential benefits of advanced language models. The democratization of AI technology and the compelling economics of open-source solutions make them an attractive option for businesses of all sizes.
    

    What will be the size of the Open-Source LLM Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    The Open-Source Large Language Model (LLM) Market continues to evolve, offering businesses innovative solutions for various applications. One notable trend is the increasing adoption of explainable AI (XAI) methods in LLMs. XAI models provide transparency into the reasoning behind their outputs, addressing concerns around bias mitigation and interpretability. This transparency is crucial for industries with stringent compliance requirements, such as finance and healthcare. For instance, a recent study reveals that companies implementing XAI models have achieved a 25% increase in model acceptance rates among stakeholders, leading to more informed decisions. This improvement can significantly impact product strategy and budgeting, as businesses can confidently invest in AI solutions that align with their ethical and regulatory standards.
    Moreover, advancements in LLM architecture include encoder-decoder architectures, multi-head attention, and self-attention layers, which enhance feature extraction and model scalability. These improvements contribute to better performance and more accurate results, making LLMs an essential tool for businesses seeking to optimize their operations and gain a competitive edge. In summary, the market is characterized by continuous innovation and a strong focus on delivering human-centric solutions. The adoption of explainable AI methods and advancements in neural network architecture are just a few examples of how businesses can benefit from these technologies. By investing in Open-Source LLMs, organizations can improve efficiency, enhance decision-making, and maintain a responsible approach to AI implementation.
    

    Unpacking the Open-Source LLM Market Landscape

    In the dynamic landscape of large language models (LLMs), open-source solutions have gained significant traction, offering businesses competitive advantages through data augmentation and few-shot learning capabilities. Compared to traditional models, open-source LLMs enable a 30% reduction in optimizer selection time and a 25% improvement in model accuracy for summarization tasks. Furthermore, distributed training and model compression techniques allow businesses to process larger training dataset sizes with minimal tokenization process disruptions, result

  4. n

    Data from: Using convolutional neural networks to efficiently extract...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing (2022). Using convolutional neural networks to efficiently extract immense phenological data from community science images [Dataset]. http://doi.org/10.5061/dryad.mkkwh7123
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 4, 2022
    Dataset provided by
    Carnegie Museum of Natural History
    University of Pittsburgh
    Authors
    Rachel Reeb; Naeem Aziz; Samuel Lapp; Justin Kitzes; J. Mason Heberling; Sara Kuebbing
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Community science image libraries offer a massive, but largely untapped, source of observational data for phenological research. The iNaturalist platform offers a particularly rich archive, containing more than 49 million verifiable, georeferenced, open access images, encompassing seven continents and over 278,000 species. A critical limitation preventing scientists from taking full advantage of this rich data source is labor. Each image must be manually inspected and categorized by phenophase, which is both time-intensive and costly. Consequently, researchers may only be able to use a subset of the total number of images available in the database. While iNaturalist has the potential to yield enough data for high-resolution and spatially extensive studies, it requires more efficient tools for phenological data extraction. A promising solution is automation of the image annotation process using deep learning. Recent innovations in deep learning have made these open-source tools accessible to a general research audience. However, it is unknown whether deep learning tools can accurately and efficiently annotate phenophases in community science images. Here, we train a convolutional neural network (CNN) to annotate images of Alliaria petiolata into distinct phenophases from iNaturalist and compare the performance of the model with non-expert human annotators. We demonstrate that researchers can successfully employ deep learning techniques to extract phenological information from community science images. A CNN classified two-stage phenology (flowering and non-flowering) with 95.9% accuracy and classified four-stage phenology (vegetative, budding, flowering, and fruiting) with 86.4% accuracy. The overall accuracy of the CNN did not differ from humans (p = 0.383), although performance varied across phenophases. We found that a primary challenge of using deep learning for image annotation was not related to the model itself, but instead in the quality of the community science images. Up to 4% of A. petiolata images in iNaturalist were taken from an improper distance, were physically manipulated, or were digitally altered, which limited both human and machine annotators in accurately classifying phenology. Thus, we provide a list of photography guidelines that could be included in community science platforms to inform community scientists in the best practices for creating images that facilitate phenological analysis.

    Methods Creating a training and validation image set

    We downloaded 40,761 research-grade observations of A. petiolata from iNaturalist, ranging from 1995 to 2020. Observations on the iNaturalist platform are considered “research-grade if the observation is verifiable (includes image), includes the date and location observed, is growing wild (i.e. not cultivated), and at least two-thirds of community users agree on the species identification. From this dataset, we used a subset of images for model training. The total number of observations in the iNaturalist dataset are heavily skewed towards more recent years. Less than 5% of the images we downloaded (n=1,790) were uploaded between 1995-2016, while over 50% of the images were uploaded in 2020. To mitigate temporal bias, we used all available images between the years 1995 and 2016 and we randomly selected images uploaded between 2017-2020. We restricted the number of randomly-selected images in 2020 by capping the number of 2020 images to approximately the number of 2019 observations in the training set. The annotated observation records are available in the supplement (supplementary data sheet 1). The majority of the unprocessed records (those which hold a CC-BY-NC license) are also available on GBIF.org (2021).

    One of us (R. Reeb) annotated the phenology of training and validation set images using two different classification schemes: two-stage (non-flowering, flowering) and four-stage (vegetative, budding, flowering, fruiting). For the two-stage scheme, we classified 12,277 images and designated images as ‘flowering’ if there was one or more open flowers on the plant. All other images were classified as non-flowering. For the four-stage scheme, we classified 12,758 images. We classified images as ‘vegetative’ if no reproductive parts were present, ‘budding’ if one or more unopened flower buds were present, ‘flowering’ if at least one opened flower was present, and ‘fruiting’ if at least one fully-formed fruit was present (with no remaining flower petals attached at the base). Phenology categories were discrete; if there was more than one type of reproductive organ on the plant, the image was labeled based on the latest phenophase (e.g. if both flowers and fruits were present, the image was classified as fruiting).

    For both classification schemes, we only included images in the model training and validation dataset if the image contained one or more plants with clearly visible reproductive parts were clear and we could exclude the possibility of a later phenophase. We removed 1.6% of images from the two-stage dataset that did not meet this requirement, leaving us with a total of 12,077 images, and 4.0% of the images from the four-stage leaving us with a total of 12,237 images. We then split the two-stage and four-stage datasets into a model training dataset (80% of each dataset) and a validation dataset (20% of each dataset).

    Training a two-stage and four-stage CNN

    We adapted techniques from studies applying machine learning to herbarium specimens for use with community science images (Lorieul et al. 2019; Pearson et al. 2020). We used transfer learning to speed up training of the model and reduce the size requirements for our labeled dataset. This approach uses a model that has been pre-trained using a large dataset and so is already competent at basic tasks such as detecting lines and shapes in images. We trained a neural network (ResNet-18) using the Pytorch machine learning library (Psake et al. 2019) within Python. We chose the ResNet-18 neural network because it had fewer convolutional layers and thus was less computationally intensive than pre-trained neural networks with more layers. In early testing we reached desired accuracy with the two-stage model using ResNet-18. ResNet-18 was pre-trained using the ImageNet dataset, which has 1,281,167 images for training (Deng et al. 2009). We utilized default parameters for batch size (4), learning rate (0.001), optimizer (stochastic gradient descent), and loss function (cross entropy loss). Because this led to satisfactory performance, we did not further investigate hyperparameters.

    Because the ImageNet dataset has 1,000 classes while our data was labeled with either 2 or 4 classes, we replaced the final fully-connected layer of the ResNet-18 architecture with fully-connected layers containing an output size of 2 for the 2-class problem and 4 for the 4-class problem. We resized and cropped the images to fit ResNet’s input size of 224x224 pixels and normalized the distribution of the RGB values in each image to a mean of zero and a standard deviation of one, to simplify model calculations. During training, the CNN makes predictions on the labeled data from the training set and calculates a loss parameter that quantifies the model’s inaccuracy. The slope of the loss in relation to model parameters is found and then the model parameters are updated to minimize the loss value. After this training step, model performance is estimated by making predictions on the validation dataset. The model is not updated during this process, so that the validation data remains ‘unseen’ by the model (Rawat and Wang 2017; Tetko et al. 1995). This cycle is repeated until the desired level of accuracy is reached. We trained our model for 25 of these cycles, or epochs. We stopped training at 25 epochs to prevent overfitting, where the model becomes trained too specifically for the training images and begins to lose accuracy on images in the validation dataset (Tetko et al. 1995).

    We evaluated model accuracy and created confusion matrices using the model’s predictions on the labeled validation data. This allowed us to evaluate the model’s accuracy and which specific categories are the most difficult for the model to distinguish. For using the model to make phenology predictions on the full, 40,761 image dataset, we created a custom dataloader function in Pytorch using the Custom Dataset function, which would allow for loading images listed in a csv and passing them through the model associated with unique image IDs.

    Hardware information

    Model training was conducted using a personal laptop (Ryzen 5 3500U cpu and 8 GB of memory) and a desktop computer (Ryzen 5 3600 cpu, NVIDIA RTX 3070 GPU and 16 GB of memory).

    Comparing CNN accuracy to human annotation accuracy

    We compared the accuracy of the trained CNN to the accuracy of seven inexperienced human scorers annotating a random subsample of 250 images from the full, 40,761 image dataset. An expert annotator (R. Reeb, who has over a year’s experience in annotating A. petiolata phenology) first classified the subsample images using the four-stage phenology classification scheme (vegetative, budding, flowering, fruiting). Nine images could not be classified for phenology and were removed. Next, seven non-expert annotators classified the 241 subsample images using an identical protocol. This group represented a variety of different levels of familiarity with A. petiolata phenology, ranging from no research experience to extensive research experience (two or more years working with this species). However, no one in the group had substantial experience classifying community science images and all were naïve to the four-stage phenology scoring protocol. The trained CNN was also used to classify the subsample images. We compared human annotation accuracy in each phenophase to the accuracy of the CNN using students

  5. Cats vs Dogs Redux Transfer Features

    • kaggle.com
    zip
    Updated Aug 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanwalinder Singh (2018). Cats vs Dogs Redux Transfer Features [Dataset]. https://www.kaggle.com/kanwalinder/cats-vs-dogs-redux-transfer-features
    Explore at:
    zip(1345261572 bytes)Available download formats
    Dataset updated
    Aug 22, 2018
    Authors
    Kanwalinder Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Most machine learning courses start by implementing a fully-connected Deep Neural Network (DNN) and proceed towards Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), teaching skills on how to manage training, inference, and deployment along the way. For most beginners, the problem with building DNNs from scratch is that either the input data has to be grossly simplified ( working with 64x64x3 images for example) or the network has so many parameters that it is very hard to train. Meanwhile, Transfer Learning has made building even CNNs and RNNs from scratch unnecessary and one can reuse and/or fine tune publicly available CNNs like Inception V3 with very little data for a new problem.

    The purpose of this dataset is to make a large dataset of 25000 training examples and 12500 test examples available from the ever popular Dogs vs Cats Redux competition, suitable for students just starting on machine learning. The base dataset, which consists of fairly large image sizes, has been transferred through publicly available CNNs like Inception V3, Inception Resnet V2, Resnet 50, Xception, and MobileNet, creating features that are very easy to build a pretty good DNN classifier with. This should make learning to build DNNs from scratch easy to do, while learning a bit of transfer learning and even "competing" in Dogs vs Cats Redux for kicks!

    Content

    As mentioned, the input data for this dataset are images from the Dogs vs Cats Redux competition. All transfer learning CNN models were obtained from keras.applications. The features derived by processing the input images through the transfer models are flat (25000x2048 training examples and 12500x2048 test examples when using Inception V3) and ready for ingestion into a DNN. In addition, the dataset provides ids from the original training and test examples so classification results can be reviewed against the base data.

    Note that while the classic goal of transfer learning is to apply a network on a smaller dataset and/or fine tune the transferred network on said dataset, the purpose of this dataset is subtly different: make a large dataset available for beginners to build DNNs with. Of course, a subset of the dataset can be used for classification and the base transfer models can be fine tuned.

    Acknowledgements

    Francois Chollet's Keras framework, specifically keras.applications.

    Dr. Andrew Ng's deeplearning.ai specialization on Coursera. In my spare time, I mentor students in Coursera's Neural Networks and Deep Learning and Convolutional Neural Networks courses.

    Inspiration

    Initially I am posting just the dataset, and will later post the kernel that produced the dataset and a kernel that will use the dataset to classify for Dogs vs Cats Redux. Can you duplicate the log loss score of 0.21 currently possible with reusing the transfer models with no fine-tuning? Can you get into the top 50 by fine tuning the base models and/or augmenting the input data?

  6. O

    Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

    • data.openei.org
    • osti.gov
    code, data, website
    Updated Dec 31, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Emami; Peter Graf; Patrick Emami; Peter Graf (2018). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. http://doi.org/10.25984/1986147
    Explore at:
    code, website, dataAvailable download formats
    Dataset updated
    Dec 31, 2018
    Dataset provided by
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    Open Energy Data Initiative (OEDI)
    National Renewable Energy Laboratory
    Authors
    Patrick Emami; Peter Graf; Patrick Emami; Peter Graf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BuildingsBench datasets consist of:

    • Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock.
    • 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF.

    Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).

    BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:

    1. ElectricityLoadDiagrams20112014
    2. Building Data Genome Project-2
    3. Individual household electric power consumption (Sceaux)
    4. Borealis
    5. SMART
    6. IDEAL
    7. Low Carbon London

    A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.

  7. D

    Large-Scale AI Models

    • epoch.ai
    csv
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Epoch AI (2025). Large-Scale AI Models [Dataset]. https://epoch.ai/data/ai-models
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 15, 2025
    Dataset authored and provided by
    Epoch AI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Global
    Variables measured
    https://epoch.ai/data/ai-models-documentation
    Measurement technique
    https://epoch.ai/data/ai-models-documentation
    Description

    The Large-Scale AI Models database documents over 200 models trained with more than 10²³ floating point operations, at the leading edge of scale and capabilities.

  8. D

    Distributed Training Platform Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Distributed Training Platform Market Research Report 2033 [Dataset]. https://dataintelo.com/report/distributed-training-platform-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Distributed Training Platform Market Outlook



    According to our latest research, the global distributed training platform market size reached USD 2.47 billion in 2024. The market is experiencing robust momentum, driven by the increasing need for scalable and efficient AI model training across industries. With a projected compound annual growth rate (CAGR) of 21.5% from 2025 to 2033, the market is forecasted to reach USD 17.2 billion by 2033. This substantial growth is largely attributed to the rapid adoption of deep learning, the proliferation of large-scale data sets, and the growing complexity of AI models that necessitate distributed training solutions.




    One of the primary growth factors fueling the distributed training platform market is the exponential rise in data generation and the corresponding demand for high-performance computing resources. As organizations increasingly leverage artificial intelligence and machine learning for business intelligence, automation, and customer engagement, the volume and complexity of data sets have surged. Distributed training platforms enable faster and more efficient model training by leveraging multiple computing nodes, significantly reducing the time required to train sophisticated models. This capability is particularly critical in sectors such as healthcare, finance, and autonomous vehicles, where real-time insights and rapid model iteration are essential for maintaining a competitive edge.




    Another significant driver is the widespread adoption of cloud computing and the evolution of hybrid cloud environments. Cloud-based distributed training platforms offer unparalleled scalability, flexibility, and cost-effectiveness, allowing organizations to dynamically allocate resources based on workload demands. The integration of advanced hardware accelerators such as GPUs and TPUs in cloud environments has further enhanced the performance of distributed training systems. Moreover, the emergence of edge computing and federated learning is expanding the applicability of distributed training platforms, enabling organizations to process and analyze data closer to the source while maintaining data privacy and compliance.




    Additionally, the increasing focus on democratizing AI and making advanced machine learning accessible to organizations of all sizes is shaping the distributed training platform market. Small and medium enterprises (SMEs) are now able to harness the power of distributed training through user-friendly platforms and managed services, leveling the playing field with large enterprises. Open-source frameworks and collaborative ecosystems are also accelerating innovation and reducing the barriers to entry for organizations seeking to implement distributed AI training. The ecosystem is further enriched by partnerships between technology providers, academic institutions, and industry consortia, fostering the development of standardized protocols and best practices.




    From a regional perspective, North America continues to dominate the distributed training platform market, owing to its advanced technological infrastructure, strong presence of leading AI companies, and significant investments in research and development. Asia Pacific is emerging as a high-growth region, driven by rapid digital transformation, expanding cloud adoption, and government initiatives supporting AI innovation. Europe is also witnessing substantial growth, particularly in sectors such as automotive, manufacturing, and healthcare, where distributed training platforms are enabling breakthroughs in automation and predictive analytics. Latin America and Middle East & Africa, while currently representing smaller shares, are expected to experience steady growth as digitalization efforts accelerate and access to cloud-based AI solutions improves.



    Component Analysis



    The distributed training platform market is segmented by component into software, hardware, and services. The software segment remains the largest contributor, accounting for a significant portion of the market revenue in 2024. This dominance is driven by the proliferation of advanced distributed training frameworks such as TensorFlow, PyTorch, and Horovod, which facilitate seamless scaling and management of AI workloads across multiple nodes. These software platforms offer robust orchestration, fault tolerance, and optimization capabilities, making them indispensable for organizations seeking to accelerate AI model development and deploy

  9. R

    Face Features Test Dataset

    • universe.roboflow.com
    zip
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Lin (2021). Face Features Test Dataset [Dataset]. https://universe.roboflow.com/peter-lin/face-features-test/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 6, 2021
    Dataset authored and provided by
    Peter Lin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Face Features Bounding Boxes
    Description

    A simple dataset for benchmarking CreateML object detection models. The images are sampled from COCO dataset with eyes and nose bounding boxes added. It’s not meant to be serious or useful in a real application. The purpose is to look at how long it takes to train CreateML models with varying dataset and batch sizes.

    Training performance is affected by model configuration, dataset size and batch configuration. Larger models and batches require more memory. I used CreateML object detection project to compare the performance.

    Hardware

    M1 Macbook Air * 8 GPU * 4/4 CPU * 16G memory * 512G SSD

    M1 Max Macbook Pro * 24 GPU * 2/8 CPU * 32G memory * 2T SSD

    Small Dataset Train: 144 Valid: 16 Test: 8

    Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 16 | 11 | 1.5 | |32 | 29 | 17 | 2.8 | |64 | 56 | 30 | 5.4 | |128 | 170 | 57 | 12 |

    Larger Dataset Train: 301 Valid: 29 Test: 18

    Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 21 | 10 | 1.5 | |32 | 42 | 17 | 3.5 | |64 | 85 | 30 | 8.4 | |128 | 281 | 54 | 16.5 |

    CreateML Settings

    For all tests, training was set to Full Network. I closed CreateML between each run to make sure memory issues didn't cause a slow down. There is a bug with Monterey as of 11/2021 that leads to memory leak. I kept an eye on the memory usage. If it looked like there was a memory leak, I restarted MacOS.

    Observations

    In general, more GPU and memory with MBP reduces the training time. Having more memory lets you train with larger datasets. On M1 Macbook Air, the practical limit is 12G before memory pressure impacts performance. On M1 Max MBP, the practical limit is 26G before memory pressure impacts performance. To work around memory pressure, use smaller batch sizes.

    On the larger dataset with batch size 128, the M1Max is 5x faster than Macbook Air. Keep in mind a real dataset should have thousands of samples like Coco or Pascal. Ideally, you want a dataset with 100K images for experimentation and millions for the real training. The new M1 Max Macbooks is a cost effective alternative to building a Windows/Linux workstation with RTX 3090 24G. For most of 2021, the price of RTX 3090 with 24G is around $3,000.00. That means an equivalent windows workstation would cost the same as the M1Max Macbook pro I used to run the benchmarks.

    Full Network vs Transfer Learning

    As of CreateML 3, training with full network doesn't fully utilize the GPU. I don't know why it works that way. You have to select transfer learning to fully use the GPU. The results of transfer learning with the larger dataset. In general, the training time is faster and loss is better.

    batchET minTrain AccVal AccTest AccTop IU TrainTop IU ValidTop IU TestPeak mem Gloss
    1647519127823131.50.41
    3287521107826112.760.02
    641375238782495.30.017
    128257522137825148.40.012

    Github Project

    The source code and full results are up on Github https://github.com/woolfel/createmlbench

  10. n

    Data from: Exploring Human-Like Mathematical Reasoning: Perspectives on...

    • curate.nd.edu
    pdf
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhenwen Liang (2024). Exploring Human-Like Mathematical Reasoning: Perspectives on Generalizability and Efficiency [Dataset]. http://doi.org/10.7274/27895872.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    Zhenwen Liang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Mathematical reasoning, a fundamental aspect of human cognition, poses significant challenges for artificial intelligence (AI) systems. Despite recent advancements in natural language processing (NLP) and large language models (LLMs), AI's ability to replicate human-like reasoning, generalization, and efficiency remains an ongoing research challenge. In this dissertation, we address key limitations in MWP solving, focusing on the accuracy, generalization ability and efficiency of AI-based mathematical reasoners by applying human-like reasoning methods and principles.

    This dissertation introduces several innovative approaches in mathematical reasoning. First, a numeracy-driven framework is proposed to enhance math word problem (MWP) solvers by integrating numerical reasoning into model training, surpassing human-level performance on benchmark datasets. Second, a novel multi-solution framework captures the diversity of valid solutions to math problems, improving the generalization capabilities of AI models. Third, a customized knowledge distillation technique, termed Customized Exercise for Math Learning (CEMAL), is developed to create tailored exercises for smaller models, significantly improving their efficiency and accuracy in solving MWPs. Additionally, a multi-view fine-tuning paradigm (MinT) is introduced to enable smaller models to handle diverse annotation styles from different datasets, improving their adaptability and generalization. To further advance mathematical reasoning, a benchmark, MathChat, is introduced to evaluate large language models (LLMs) in multi-turn reasoning and instruction-following tasks, demonstrating significant performance improvements. Finally, new inference-time verifiers, Math-Rev and Code-Rev, are developed to enhance reasoning verification, combining language-based and code-based solutions for improved accuracy in both math and code reasoning tasks.

    In summary, this dissertation provides a comprehensive exploration of these challenges and contributes novel solutions that push the boundaries of AI-driven mathematical reasoning. Potential future research directions are also discussed to further extend the impact of this dissertation.

  11. EyeOnWater training dataset for assessing the inclusion of water images

    • data.europa.eu
    • zenodo.org
    unknown
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, EyeOnWater training dataset for assessing the inclusion of water images [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10777441?locale=no
    Explore at:
    unknown(2062095307)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training dataset The EyeOnWater app is designed to assess the ocean's water quality using images captured by regular citizens. In order to have an extra helping hand in determining whether an image meets the criteria for inclusion in the app, the YOLOv8 model for image classification is employed. With the help of this model all uploaded pictures are assessed. If the model deems a water image unsuitable, it is excluded from the app's online database. In order to train this model a training dataset containing a large pool of different images is required. The training dataset includes 12,357 'good' and 10,019 'bad' water quality images that were submitted to the EyeOnWater app. Technical details Data preprocessing In order to create a larger training dataset the set of original images (containing a total of 1700 images) are augmented, by rotating, displacing and resizing them. Using the following settings: Maximum rotation of 45 degrees in both directions Maximum displacement of 20% times the width or height Horizontal and vertical flip Maximum shear range of 20% times the width Pixel range of 10 units Data splitting The training dataset is 80% used for training, 10% for validation and 10% for prediction. Classes, labels and annotations The training dataset contains 2 classes with 2 labels 'good' and 'bad'. The 'good' images contain water images that are suited to determine the water quality using the Forel-Ule scale. The 'bad' images can include for example too much water reflection, a visible bottom surface, objects or not even include water at all. Parameters From the images the water quality can be obtained by comparing the water color to the 21 colors in the Forel-Ule scale. Parameter: http://vocab.nerc.ac.uk/collection/P01/current/CLFORULE/ Data sources The images are taken by citizen scientists, often with a smartphone. Data quality As the images are taken by smartphones, the image quality can be low. Next to this, the images are taken outside, in a non-confined space, meaning that there can be bad lightning, reflections and other problems occurring. Therefore, the images need first to be checked before they can be included in the app. Image resolution Larger images are resized to 256px by 256px, smaller images are excluded from the training dataset. Spatial coverage Images are taken on a global scale. Contact information For more information on the training dataset and/or the app, you can contact tjerk@maris.nl.

  12. 4

    Data from: Papyrus - A large scale curated dataset aimed at bioactivity...

    • data.4tu.nl
    • figshare.com
    zip
    Updated Oct 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivier Béquignon; Brandon Bongers; W. (Willem) Jespers; Adriaan P. IJzerman; Bob van de Water; Gerard JP Van westen (2021). Papyrus - A large scale curated dataset aimed at bioactivity predictions [Dataset]. http://doi.org/10.4121/16896406.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Olivier Béquignon; Brandon Bongers; W. (Willem) Jespers; Adriaan P. IJzerman; Bob van de Water; Gerard JP Van westen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    European Commission
    Description

    This repository contains the Papyrus dataset, an aggregated dataset of small molecule bioactivities, as described in the manuscript "Papyrus - A large scale curated dataset aimed at bioactivity predictions" (Work in Progress).

    With the recent rapid growth of publicly available ligand-protein bioactivity data, there is a trove of viable data that can be used to train machine learning algorithms. However, not all data is equal in terms of size and quality, and a significant portion of researcher’s time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. As an answer to that, we have constructed the Papyrus dataset, comprised of around 60 million datapoints. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets containing high quality data. This aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways, and also perform some rudimentary quantitative structure-activity relationship and proteochemometrics modeling. Our ambition is to create a benchmark set that can be used for constructing predictive models, while also providing a solid baseline for related research.

  13. Z

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keshavarz, Hossein; Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
    Explore at:
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
    Authors
    Keshavarz, Hossein; Nagappan, Meiyappan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    1. PyDriller
    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  14. d

    Data from: Subsurface Characterization and Machine Learning Predictions at...

    • catalog.data.gov
    • gdr.openei.org
    • +5more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). Subsurface Characterization and Machine Learning Predictions at Brady Hot Springs Results [Dataset]. https://catalog.data.gov/dataset/subsurface-characterization-and-machine-learning-predictions-at-brady-hot-springs-results-6c85f
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    Geothermal power plants typically show decreasing heat and power production rates over time. Mitigation strategies include optimizing the management of existing wells - increasing or decreasing the fluid flow rates across the wells - and drilling new wells at appropriate locations. The latter is expensive, time-consuming, and subject to many engineering constraints, but the former is a viable mechanism for periodic adjustment of the available fluid allocations. Data and supporting literature from a study describing a new approach combining reservoir modeling and machine learning to produce models that enable strategies for the mitigation of decreased heat and power production rates over time for geothermal power plants. The computational approach used enables translation of sets of potential flow rates for the active wells into reservoir-wide estimates of produced energy and discovery of optimal flow allocations among the studied sets. In our computational experiments, we utilize collections of simulations for a specific reservoir (which capture subsurface characterization and realize history matching) along with machine learning models that predict temperature and pressure timeseries for production wells. We evaluate this approach using an "open-source" reservoir we have constructed that captures many of the characteristics of Brady Hot Springs, a commercially operational geothermal field in Nevada, USA. Selected results from a reservoir model of Brady Hot Springs itself are presented to show successful application to an existing system. In both cases, energy predictions prove to be highly accurate: all observed prediction errors do not exceed 3.68% for temperatures and 4.75% for pressures. In a cumulative energy estimation, we observe prediction errors that are less than 4.04%. A typical reservoir simulation for Brady Hot Springs completes in approximately 4 hours, whereas our machine learning models yield accurate 20-year predictions for temperatures, pressures, and produced energy in 0.9 seconds. This paper aims to demonstrate how the models and techniques from our study can be applied to achieve rapid exploration of controlled parameters and optimization of other geothermal reservoirs. Includes a synthetic, yet realistic, model of a geothermal reservoir, referred to as open-source reservoir (OSR). OSR is a 10-well (4 injection wells and 6 production wells) system that resembles Brady Hot Springs (a commercially operational geothermal field in Nevada, USA) at a high level but has a number of sufficiently modified characteristics (which renders any possible similarity between specific characteristics like temperatures and pressures as purely random). We study OSR through CMG simulations with a wide range of flow allocation scenarios. Includes a dataset with 101 simulated scenarios that cover the period of time between 2020 and 2040 and a link to the published paper about this project, where we focus on the Machine Learning work for predicting OSR's energy production based on the simulation data, as well as a link to the GitHub repository where we have published the code we have developed (please refer to the repository's readme file to see instructions on how to run the code). Additional links are included to associated work led by the USGS to identify geologic factors associated with well productivity in geothermal fields. Below are the high-level steps for applying the same modeling + ML process to other geothermal reservoirs: 1. Develop a geologic model of the geothermal field. The location of faults, upflow zones, aquifers, etc. need to be accounted for as accurately as possible 2. The geologic model needs to be converted to a reservoir model that can be used in a reservoir simulator, such as, for instance, CMG STARS, TETRAD, or FALCON 3. Using native state modeling, the initial temperature and pressure distributions are evaluated, and they become the initial conditions for dynamic reservoir simulations 4. Using history matching with tracers and available production data, the model should be tuned to represent the subsurface reservoir as accurately as possible 5. A large number of simulations is run using the history-matched reservoir model. Each simulation assumes a different wellbore flow rate allocation across the injection and production wells, where the individual selected flow rates do not violate the practical constraints for the corresponding wells. 6. ML models are trained using the simulation data. The code in our GitHub repository demonstrates how these models can be trained and evaluated. 7. The trained ML models can be used to evaluate a large set of candidate flow allocations with the goal of selecting the most optimal allocations, i.e., producing the largest amounts of thermal energy over the modeled period of time. The referenced paper provides more details about this optimization process

  15. VGG16 ImageNet Weights: Boost Your CV Models

    • kaggle.com
    zip
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evil Spirit05 (2024). VGG16 ImageNet Weights: Boost Your CV Models [Dataset]. https://www.kaggle.com/datasets/evilspirit05/vgg16-title/code
    Explore at:
    zip(54730430 bytes)Available download formats
    Dataset updated
    Aug 31, 2024
    Authors
    Evil Spirit05
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    The file vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5 contains pre-trained weights for the VGG16 convolutional neural network architecture, specifically designed for TensorFlow and Keras frameworks. This file is a crucial resource for researchers and practitioners in the field of deep learning, particularly those working on computer vision tasks.
    

    What is VGG16?

    VGG16 is a convolutional neural network architecture proposed by Karen Simonyan and Andrew Zisserman from the University of Oxford in their 2014 paper "Very Deep Convolutional Networks for Large-Scale Image Recognition". This network achieved top results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, demonstrating exceptional performance in image classification tasks.
    

    Contents of the Weights File

    The vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5 file contains:

    • Pre-trained weights for all convolutional layers of the VGG16 network.
    • Weights for the max-pooling layers.
    • The file does NOT include weights for the top (fully connected) layers, as indicated by "notop" in the filename.

    Key Features

    • TensorFlow Compatibility: The weights are specifically formatted for use with TensorFlow and Keras, as indicated by "tf_dim_ordering" in the filename. Transfer Learning Ready: By excluding the top layers, this file is ideal for transfer learning applications where you want to use VGG16 as a feature extractor or fine-tune it for your specific task.
    • Keras Integration: The .h5 format allows for easy loading into Keras models using the load_weights() function.
    • Pretrained on ImageNet: These weights are the result of training on the vast ImageNet dataset, capturing a rich set of features useful for a wide range of computer vision tasks.

    Use Cases

    • Feature Extraction: Use the pre-trained layers as a fixed feature extractor for your own image datasets.
    • Transfer Learning: Fine-tune the model on your specific dataset, potentially achieving high performance with less training data.
    • Baseline Model: Utilize as a strong baseline for computer vision tasks such as image classification, object detection, or semantic segmentation.
    • Comparative Studies: Use in research to compare against newer architectures or as part of ensemble models.

    How to Use

    Here's a basic example of how to use these weights in a Keras model:

    from tensorflow.keras.applications import VGG16
    from tensorflow.keras.models import Model
    
    # Load the VGG16 model without top layers
    base_model = VGG16(weights='path/to/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5', 
              include_top=False, 
              input_shape=(224, 224, 3))
    
    # Add your own top layers
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dense(1024, activation='relu')(x)
    predictions = Dense(num_classes, activation='softmax')(x)
    
    # Create your new model
    model = Model(inputs=base_model.input, outputs=predictions)
    

    Benefits for Your Projects

    • Reduced Training Time: Start with pre-learned features, significantly reducing the time needed to train your models.
    • Improved Generalization: Leverage features learned from a diverse and large-scale dataset (ImageNet), potentially improving your model's ability to generalize.
    • Resource Efficiency: Achieve high performance even with limited computational resources or smaller datasets.
    • Flexibility: Easily adapt the VGG16 architecture to various image-related tasks beyond simple classification.

    File Details

    • Size: Approximately 58.89 MB
    • Format: HDF5 (.h5)
    • Compatibility: TensorFlow 2.x, Keras
    • Source: Usually downloaded from official Keras repositories

    Ethical Considerations

    When using these weights, be aware of potential biases inherent in the ImageNet dataset. Consider the ethical implications and potential biases in your specific application.
    By incorporating this weights file into your projects, you're building upon years of research and development in deep learning for computer vision. It's an excellent starting point for many image-related tasks and can significantly boost the performance of your models.
    
  16. d

    Data from: Input Files and Code for: Machine learning can accurately assign...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters [Dataset]. https://catalog.data.gov/dataset/input-files-and-code-for-machine-learning-can-accurately-assign-geologic-basin-to-produced
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

  17. D

    AI Lip Sync Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). AI Lip Sync Market Research Report 2033 [Dataset]. https://dataintelo.com/report/ai-lip-sync-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI Lip Sync Market Outlook



    According to our latest research, the global AI Lip Sync market size reached USD 412.4 million in 2024 and is projected to grow at a robust CAGR of 23.1% from 2025 to 2033. By the end of the forecast period, the market is expected to achieve a value of USD 2,494.7 million in 2033. This significant growth is largely driven by the increasing adoption of AI technologies in content creation, the surge in demand for hyper-realistic animation, and the proliferation of digital media platforms worldwide. As per our latest research, the AI lip sync market is witnessing rapid transformation, with advancements in deep learning and neural network-based solutions enabling more accurate and efficient lip-syncing across various applications.




    The primary growth factor for the AI lip sync market is the exponential increase in demand for automated, high-quality content production in the media and entertainment industry. With the rise of streaming platforms, animation studios, and video game developers, there is a growing need for tools that can streamline the traditionally labor-intensive process of lip synchronization. AI-powered lip sync solutions not only reduce production time and costs but also enhance the realism and emotional impact of digital characters. This technology is particularly valuable in dubbing for international audiences, where accurate lip movements are essential for viewer immersion. As content consumption continues to globalize, the industry is increasingly relying on AI lip sync to localize content efficiently and effectively.




    Another major driver is the integration of AI lip sync technology in emerging applications such as virtual reality (VR), augmented reality (AR), and social media. The immersive nature of VR and AR experiences demands lifelike avatars and characters that can interact with users in real time. AI lip sync bridges the gap between audio input and visual output, enabling seamless avatar communication that enhances user engagement. On social media platforms, content creators are leveraging AI lip sync tools to produce engaging short-form videos, deepfakes, and interactive content at scale. The democratization of these tools has empowered independent creators and smaller studios to compete with larger production houses, further fueling market expansion.




    The evolution of AI algorithms and the increasing availability of large datasets for training deep learning models have significantly improved the accuracy and efficiency of AI lip sync solutions. Innovations in neural rendering, generative adversarial networks (GANs), and speech-to-face models have enabled the creation of highly realistic and expressive facial animations. These technological advancements are also making AI lip sync more accessible and affordable, paving the way for adoption across a broader range of industries beyond entertainment, such as education, advertising, and corporate training. As the technology matures, we expect to see even greater integration with real-time communication platforms and interactive applications.




    From a regional perspective, North America currently leads the AI lip sync market, driven by the presence of major technology companies, a vibrant entertainment industry, and high digital adoption rates. However, the Asia Pacific region is emerging as a significant growth engine, with countries like China, Japan, and South Korea investing heavily in AI research and digital content production. Europe is also witnessing substantial adoption, particularly in the gaming and advertising sectors. The Middle East & Africa and Latin America are expected to post steady growth as digital infrastructure improves and local content production accelerates. Overall, the global AI lip sync market is poised for dynamic growth, with regional markets contributing to a diverse and competitive landscape.



    Component Analysis



    The AI lip sync market is segmented by component into software, hardware, and services, each playing a critical role in the overall value chain. Software forms the backbone of the market, accounting for the largest share due to the proliferation of advanced AI algorithms, deep learning models, and user-friendly interfaces designed for content creators. These software solutions offer a range of functionalities, from real-time facial animation to automated dubbing and lip movement correction, making them indispensable for studios and independent creators alike. The continuou

  18. Summary of datasets used in the study.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kazi Rafat; Sadia Islam; Abdullah Al Mahfug; Md. Ismail Hossain; Fuad Rahman; Sifat Momen; Shafin Rahman; Nabeel Mohammed (2023). Summary of datasets used in the study. [Dataset]. http://doi.org/10.1371/journal.pone.0285668.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kazi Rafat; Sadia Islam; Abdullah Al Mahfug; Md. Ismail Hossain; Fuad Rahman; Sifat Momen; Shafin Rahman; Nabeel Mohammed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Deep learning techniques have recently demonstrated remarkable success in numerous domains. Typically, the success of these deep learning models is measured in terms of performance metrics such as accuracy and mean average precision (mAP). Generally, a model’s high performance is highly valued, but it frequently comes at the expense of substantial energy costs and carbon footprint emissions during the model building step. Massive emission of CO2 has a deleterious impact on life on earth in general and is a serious ethical concern that is largely ignored in deep learning research. In this article, we mainly focus on environmental costs and the means of mitigating carbon footprints in deep learning models, with a particular focus on models created using knowledge distillation (KD). Deep learning models typically contain a large number of parameters, resulting in a ‘heavy’ model. A heavy model scores high on performance metrics but is incompatible with mobile and edge computing devices. Model compression techniques such as knowledge distillation enable the creation of lightweight, deployable models for these low-resource devices. KD generates lighter models and typically performs with slightly less accuracy than the heavier teacher model (model accuracy by the teacher model on CIFAR 10, CIFAR 100, and TinyImageNet is 95.04%, 76.03%, and 63.39%; model accuracy by KD is 91.78%, 69.7%, and 60.49%). Although the distillation process makes models deployable on low-resource devices, they were found to consume an exorbitant amount of energy and have a substantial carbon footprint (15.8, 17.9, and 13.5 times more carbon compared to the corresponding teacher model). The enormous environmental cost is primarily attributable to the tuning of the hyperparameter, Temperature (τ). In this article, we propose measuring the environmental costs of deep learning work (in terms of GFLOPS in millions, energy consumption in kWh, and CO2 equivalent in grams). In order to create lightweight models with low environmental costs, we propose a straightforward yet effective method for selecting a hyperparameter (τ) using a stochastic approach for each training batch fed into the models. We applied knowledge distillation (including its data-free variant) to problems involving image classification and object detection. To evaluate the robustness of our method, we ran experiments on various datasets (CIFAR 10, CIFAR 100, Tiny ImageNet, and PASCAL VOC) and models (ResNet18, MobileNetV2, Wrn-40-2). Our novel approach reduces the environmental costs by a large margin by eliminating the requirement of expensive hyperparameter tuning without sacrificing performance. Empirical results on the CIFAR 10 dataset show that the stochastic technique achieves an accuracy of 91.67%, whereas tuning achieves an accuracy of 91.78%—however, the stochastic approach reduces the energy consumption and CO2 equivalent each by a factor of 19. Similar results have been obtained with CIFAR 100 and TinyImageNet dataset. This pattern is also observed in object detection classification on the PASCAL VOC dataset, where the tuning technique performs similarly to the stochastic technique, with a difference of 0.03% mAP favoring the stochastic technique while reducing the energy consumptions and CO2 emission each by a factor of 18.5.

  19. YouTube-8M Dataset

    • berd-platform.de
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sami Abu-El-Haija; Nisarg Kothari; Lee Nisarg; Paul Natsev; George Toderici; Balakrishnan Varadarajan; Sudheendra Vijayanarasimhan; Sami Abu-El-Haija; Nisarg Kothari; Lee Nisarg; Paul Natsev; George Toderici; Balakrishnan Varadarajan; Sudheendra Vijayanarasimhan (2025). YouTube-8M Dataset [Dataset]. http://doi.org/10.82939/3m5k1-zhn69
    Explore at:
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    YouTubehttp://youtube.com/
    Authors
    Sami Abu-El-Haija; Nisarg Kothari; Lee Nisarg; Paul Natsev; George Toderici; Balakrishnan Varadarajan; Sudheendra Vijayanarasimhan; Sami Abu-El-Haija; Nisarg Kothari; Lee Nisarg; Paul Natsev; George Toderici; Balakrishnan Varadarajan; Sudheendra Vijayanarasimhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 1, 2019
    Description

    YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and with high-quality machine-generated & partially human-verified annotations from a diverse vocabulary of 3,800+ visual entities.

    It comprises two subsets:

    8M Segments Dataset: 230K human-verified segment labels, 1000 classes, 5 segments/video
    8M Dataset: May 2018 version (current): 6.1M videos, 3862 classes, 3.0 labels/video, 2.6B audio-visual features

    Thus, it comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to train a strong baseline model on this dataset in less than a day on a single GPU! At the same time, the dataset's scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

    YouTube offers the YouTube8M dataset for download as TensorFlow Record files on their website. Starter code for the dataset can be found on their GitHubpage.

  20. D

    Ai Powered Video Generator Market Report | Global Forecast From 2025 To 2033...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Ai Powered Video Generator Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/ai-powered-video-generator-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI Powered Video Generator Market Outlook



    The global AI-powered video generator market size was valued at approximately USD 1.5 billion in 2023 and is forecasted to reach around USD 8.7 billion by 2032, growing at a robust compound annual growth rate (CAGR) of 21.7% during the period. This remarkable growth can be attributed to the increasing demand for automated video content production across various sectors and the continuous advancements in AI technology.



    One of the primary growth factors driving the AI-powered video generator market is the burgeoning need for high-quality video content. As businesses across industries increasingly rely on video for marketing, training, and customer engagement, there is a significant demand for tools that can automate video production without compromising on quality. AI-powered video generators provide an efficient and cost-effective solution, enabling companies to produce professional-grade videos quickly and at scale.



    Another significant driver is the rapid adoption of artificial intelligence and machine learning technologies across various sectors. With advancements in AI algorithms and the availability of massive datasets, AI-powered video generators can now create highly customized and dynamic content. These tools are capable of understanding context, recognizing patterns, and adapting to specific requirements, making them invaluable for personalized video marketing, virtual training sessions, and other applications.



    The growing popularity of video content on social media platforms and the increasing consumption of video on digital channels also contribute to the market's expansion. Platforms like YouTube, TikTok, and Instagram have seen exponential growth in video viewership, prompting brands and influencers to produce more video content. AI-powered video generators help meet this demand by streamlining the content creation process, allowing users to focus more on creativity and strategy rather than the technical aspects of video production.



    AI-Powered Video Analytics is emerging as a transformative force within the video content industry, offering enhanced capabilities for understanding and interpreting video data. By leveraging advanced AI algorithms, these analytics tools can automatically detect and analyze patterns, behaviors, and events within video footage. This capability is particularly beneficial for sectors such as security, retail, and sports, where real-time insights from video data can drive decision-making and operational efficiency. As the demand for intelligent video solutions grows, AI-powered video analytics is set to play a crucial role in optimizing content delivery and enhancing viewer experiences.



    Regionally, North America is expected to dominate the AI-powered video generator market during the forecast period, driven by the early adoption of advanced technologies and the presence of key market players. The Asia Pacific region is also anticipated to witness significant growth, owing to the increasing digitalization efforts and rising demand for video content in emerging economies like China and India. Europe and Latin America are expected to see steady growth, fueled by technological advancements and the growing importance of video in marketing and communication strategies.



    Component Analysis



    In the AI-powered video generator market, the component segment is broadly categorized into software, hardware, and services. Each component plays a crucial role in the functionality and performance of AI video generation systems, catering to various needs and preferences of end-users.



    The software segment is expected to hold the largest market share, driven by the continuous advancements in AI algorithms and machine learning models. Software solutions for AI video generation encompass a wide range of functionalities, including video editing, motion graphics, special effects, and content personalization. Companies are investing heavily in research and development to enhance the capabilities of their software, making it more intuitive and user-friendly. The integration of cloud-based services also adds to the flexibility and scalability of software solutions, allowing users to access advanced features without significant upfront investments.



    The hardware segment, though smaller than software, is critical for the optimal performance of AI video generators. High-performance GPUs, specialized pro

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Balraj Ashwath (2020). Stanford Background Dataset [Dataset]. https://www.kaggle.com/datasets/balraj98/stanford-background-dataset/data
Organization logo

Stanford Background Dataset

Lightweight Scene Understanding Dataset

Explore at:
357 scholarly articles cite this dataset (View in Google Scholar)
zip(17888030 bytes)Available download formats
Dataset updated
Sep 26, 2020
Authors
Balraj Ashwath
Description

Content

The Stanford Background Dataset was introduced in Gould et al. (ICCV 2009) for evaluating methods for geometric and semantic scene understanding. The dataset contains 715 images chosen from public datasets: LabelMe, MSRC, PASCAL VOC and Geometric Context. The selection criteria were for the images were of outdoor scenes, having approximately 320-by-240 pixels, containing at least one foreground object, and having the horizon position within the image (it need not be visible). Semantic and geometric labels were obtained using Amazon's Mechanical Turk (AMT).

Acknowledgements

The dataset is derived from Stanford DAGS Lab's Stanford Background Dataset from their Scene Understanding Datasets page. If you use this dataset in your work, you should reference:

S. Gould, R. Fulton, D. Koller. Decomposing a Scene into Geometric and Semantically Consistent Regions. Proceedings International Conference on Computer Vision (ICCV), 2009. [pdf]

Inspiration

Rapid advances in Image Understanding using Computer Vision techniques have brought us many state-of-the-art deep learning models across various benchmark datasets in Scene Understanding. But most of these datasets are large and require several days of training time. Can we train sufficiently accurate scene understanding models using less data? How well do SOTA scene understanding models perform when trained under data constraints?

Search
Clear search
Close search
Google apps
Main menu