100+ datasets found
  1. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  2. A Large Scale Fish Dataset

    • kaggle.com
    zip
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oğuzhan Ulucan (2021). A Large Scale Fish Dataset [Dataset]. https://www.kaggle.com/datasets/crowww/a-large-scale-fish-dataset/discussion
    Explore at:
    zip(3482438117 bytes)Available download formats
    Dataset updated
    Apr 28, 2021
    Authors
    Oğuzhan Ulucan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Large-Scale Dataset for Segmentation and Classification

    Authors: O. Ulucan, D. Karakaya, M. Turkan Department of Electrical and Electronics Engineering, Izmir University of Economics, Izmir, Turkey Corresponding author: M. Turkan Contact Information: mehmet.turkan@ieu.edu.tr

    General Introduction

    This dataset contains 9 different seafood types collected from a supermarket in Izmir, Turkey for a university-industry collaboration project at Izmir University of Economics, and this work was published in ASYU 2020. The dataset includes gilt head bream, red sea bream, sea bass, red mullet, horse mackerel, black sea sprat, striped red mullet, trout, shrimp image samples.

    If you use this dataset in your work, please consider to cite:

    @inproceedings{ulucan2020large, title={A Large-Scale Dataset for Fish Segmentation and Classification}, author={Ulucan, Oguzhan and Karakaya, Diclehan and Turkan, Mehmet}, booktitle={2020 Innovations in Intelligent Systems and Applications Conference (ASYU)}, pages={1--5}, year={2020}, organization={IEEE} }

    • O.Ulucan, D.Karakaya, and M.Turkan.(2020) A large-scale dataset for fish segmentation and classification. In Conf. Innovations Intell. Syst. Appli. (ASYU)

    Purpose of the work

    This dataset was collected in order to carry out segmentation, feature extraction, and classification tasks and compare the common segmentation, feature extraction, and classification algorithms (Semantic Segmentation, Convolutional Neural Networks, Bag of Features). All of the experiment results prove the usability of our dataset for purposes mentioned above.

    Data Gathering Equipment and Data Augmentation

    Images were collected via 2 different cameras, Kodak Easyshare Z650 and Samsung ST60. Therefore, the resolution of the images are 2832 x 2128, 1024 x 768, respectively.

    Before the segmentation, feature extraction, and classification process, the dataset was resized to 590 x 445 by preserving the aspect ratio. After resizing the images, all labels in the dataset were augmented (by flipping and rotating).

    At the end of the augmentation process, the number of total images for each class became 2000; 1000 for the RGB fish images and 1000 for their pair-wise ground truth labels.

    Description of the dataset

    The dataset contains 9 different seafood types. For each class, there are 1000 augmented images and their pair-wise augmented ground truths. Each class can be found in the "Fish_Dataset" file with their ground truth labels. All images for each class are ordered from "00000.png" to "01000.png".

    For example, if you want to access the ground truth images of the shrimp in the dataset, the order should be followed is "Fish->Shrimp->Shrimp GT".

  3. Clustering of samples and variables with mixed-type data

    • plos.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.

  4. Rescaled CIFAR-10 dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Description

    Motivation

    The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled CIFAR-10 dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

    and is therefore significantly more challenging.

    Access and rights

    The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

    [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

    The h5 files containing the dataset

    The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

    Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

  5. Geospatial Data Pack for Visualization

    • kaggle.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vega Datasets (2025). Geospatial Data Pack for Visualization [Dataset]. https://www.kaggle.com/datasets/vega-datasets/geospatial-data-pack
    Explore at:
    zip(1422109 bytes)Available download formats
    Dataset updated
    Oct 21, 2025
    Dataset authored and provided by
    Vega Datasets
    Description

    Geospatial Data Pack for Visualization 🗺️

    Learn Geographic Mapping with Altair, Vega-Lite and Vega using Curated Datasets

    Complete geographic and geophysical data collection for mapping and visualization. This consolidation includes 18 complementary datasets used by 31+ Vega, Vega-Lite, and Altair examples 📊. Perfect for learning geographic visualization techniques including projections, choropleths, point maps, vector fields, and interactive displays.

    Source data lives on GitHub and can also be accessed via CDN. The vega-datasets project serves as a common repository for example datasets used across these visualization libraries and related projects.

    Why Use This Dataset? 🤔

    • Comprehensive Geospatial Types: Explore a variety of core geospatial data models:
      • Vector Data: Includes points (like airports.csv), lines (like londonTubeLines.json), and polygons (like us-10m.json).
      • Raster-like Data: Work with gridded datasets (like windvectors.csv, annual-precip.json).
    • Diverse Formats: Gain experience with standard and efficient geospatial formats like GeoJSON (see Table 1, 2, 4), compressed TopoJSON (see Table 1), and plain CSV/TSV (see Table 2, 3, 4) for point data and attribute tables ready for joining.
    • Multi-Scale Coverage: Practice visualization across different geographic scales, from global and national (Table 1, 4) down to the city level (Table 1).
    • Rich Thematic Mapping: Includes multiple datasets (Table 3) specifically designed for joining attributes to geographic boundaries (like states or counties from Table 1) to create insightful choropleth maps.
    • Ready-to-Use & Example-Driven: Cleaned datasets tightly integrated with 31+ official examples (see Appendix) from Altair, Vega-Lite, and Vega, allowing you to immediately practice techniques like projections, point maps, network maps, and interactive displays.
    • Python Friendly: Works seamlessly with essential Python libraries like Altair (which can directly read TopoJSON/GeoJSON), Pandas, and GeoPandas, fitting perfectly into the Kaggle notebook environment.

    Table of Contents

    Dataset Inventory 🗂️

    This pack includes 18 datasets covering base maps, reference points, statistical data for choropleths, and geophysical data.

    1. BASE MAP BOUNDARIES (Topological Data)

    DatasetFileSizeFormatLicenseDescriptionKey Fields / Join Info
    US Map (1:10m)us-10m.json627 KBTopoJSONCC-BY-4.0US state and county boundaries. Contains states and counties objects. Ideal for choropleths.id (FIPS code) property on geometries
    World Map (1:110m)world-110m.json117 KBTopoJSONCC-BY-4.0World country boundaries. Contains countries object. Suitable for world-scale viz.id property on geometries
    London BoroughslondonBoroughs.json14 KBTopoJSONCC-BY-4.0London borough boundaries.properties.BOROUGHN (name)
    London CentroidslondonCentroids.json2 KBGeoJSONCC-BY-4.0Center points for London boroughs.properties.id, properties.name
    London Tube LineslondonTubeLines.json78 KBGeoJSONCC-BY-4.0London Underground network lines.properties.name, properties.color

    2. GEOGRAPHIC REFERENCE POINTS (Point Data) 📍

    DatasetFileSizeFormatLicenseDescriptionKey Fields / Join Info
    US Airportsairports.csv205 KBCSVPublic DomainUS airports with codes and coordinates.iata, state, `l...
  6. Z

    CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lele Cao; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Richard Anselmo Stahl; Drew McCornack; Armin Catovic; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957401
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    EQT
    Authors
    Lele Cao; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Richard Anselmo Stahl; Drew McCornack; Armin Catovic; Dhiana Deva Cavacanti Rocha
    Description

    CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

    Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

    Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

    Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

    Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

    Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

    Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

    Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

    Background and Motivation

    In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

    While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

    In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

    However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

    Source Code and Tutorial:https://github.com/llcresearch/CompanyKG2

    Paper: to be published

  7. Translation Application Evaluation Dataset

    • kaggle.com
    zip
    Updated Nov 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DatasetEngineer (2024). Translation Application Evaluation Dataset [Dataset]. https://www.kaggle.com/datasets/datasetengineer/translation-application-evaluation-dataset
    Explore at:
    zip(16759014 bytes)Available download formats
    Dataset updated
    Nov 22, 2024
    Authors
    DatasetEngineer
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The dataset is named "Multilingual Translation Application Performance Dataset". It contains information on the performance of various translation applications, capturing key metrics for evaluation and analysis. The data is generated from a survey that spans multiple regions and features translation tools used in diverse languages.

    Description: The dataset consists of 202,000 records, each corresponding to a translation request. The records include various features that evaluate the performance and usability of translation applications, offering valuable insights into the effectiveness of these tools.

    Features: Application Name: The name of the translation application (e.g., Google Translate, DeepL, etc.). Source Language: The language from which the text is translated (e.g., French, Spanish, German, etc.). Translation Quality Score: A human-assessed score (on a scale of 3 to 10) indicating the overall quality of the translation. Fluency Rating: A rating (on a scale of 3 to 10) reflecting how fluent and natural the translated text is. Accuracy Rating: A score (on a scale of 4 to 10) representing how accurately the translated text conveys the original meaning. Grammar Score: A score (on a scale of 4 to 10) assessing the grammatical correctness of the translation. Translation Speed (seconds): The time taken to complete the translation in seconds. Cost Efficiency: An indicator of how cost-effective each translation application is (e.g., Low, Medium, High). User Experience Rating: A rating (on a scale of 3 to 10) representing user feedback on the application's ease of use and overall experience. Device Compatibility: The types of devices compatible with the translation tool (e.g., Mobile, Desktop, Both). File Format Support: The types of file formats supported by the application (e.g., PDF, DOC). Offline Availability: Indicates whether the translation tool supports offline functionality. Customization Features: The availability of customizable features, such as industry-specific glossaries. Cultural Nuance: A score that indicates the application’s ability to handle culturally-specific expressions and idioms. Customer Support: Rating of the application's customer support services. Updates Frequency: The frequency at which translation algorithms are updated. Multilingual Translation Capability: The number of languages supported by the application. Security and Privacy Rating: Assessment of the application's data privacy and security. Overall Evaluation Score: An aggregate score that reflects the overall performance of the translation application. This dataset is ideal for machine learning models focused on the classification, prediction, and evaluation of translation application performance. It can be used for comparative analysis across various tools and features, offering valuable insights into the effectiveness of multilingual translation services.

  8. d

    Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia -...

    • data.gov.au
    • researchdata.edu.au
    • +1more
    zip
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014 [Dataset]. https://data.gov.au/data/dataset/6f72f73c-8a61-4ae9-b8b5-3f67ec918826
    Explore at:
    zip(100753883)Available download formats
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    Bioregional Assessment Program
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Australia
    Description

    Abstract

    This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

    This dataset is the most current national compilation of catchment scale land use data for Australia (CLUM), as at March 2014. It is a seamless raster dataset that combines land use data for all state and territory jurisdictions, compiled at a resolution of 50 metres by 50 metres. It has been compiled from vector land use datasets collected as part of state and territory mapping programs through the Australian Collaborative Land Use and Management Program (ACLUMP). Catchment scale land use data was produced by combining land tenure and other types of land use information, fine-scale satellite data and information collected in the field. The date of mapping (1997 to 2012) and scale of mapping (1:25 000 to 1:250 000) vary, reflecting the source data capture date and scale. This information is provided in a supporting polygon dataset.

    The CLUM data shows a single dominant land use for a given area, based on the primary management objective of the land manager (as identified by state and territory agencies). Land use is classified according to the Australian Land Use and Management (ALUM) Classification version 7, a three-tiered hierarchical structure. There are five primary classes, identified in order of increasing levels of intervention or potential impact on the natural landscape. Water is included separately as a sixth primary class. Primary and secondary levels relate to the principal land use. Tertiary classes may include additional information on commodity groups, specific commodities, land management practices or vegetation information. The primary, secondary and tertiary codes work together to provide increasing levels of detail about the land use. Land may be subject to a number of concurrent land uses. For example, while the main management objective of a multiple-use production forest may be timber production, it may also provide conservation, recreation, grazing and water catchment land uses. In these cases, production forestry is commonly identified in the ALUM code as the prime land use.

    The operational scales of catchment scale mapping vary according to the intensity of land use activities and landscape context. Scales range from 1:10 000 and 1:25 000 for irrigated and peri-urban areas, to 1:100 000 for broadacre cropping regions and 1:250 000 for the semi-arid and arid pastoral zone. The date of mapping generally reflects the intensity of land use. The most current mapping occurs in intensive agricultural areas; older mapping generally occurs in the semi-arid and pastoral zones.The primary classes of land use in the ALUM Classification are:

    Conservation and natural environments-land used primarily for conservation purposes, based on maintaining the essentially natural ecosystems present;

    Production from relatively natural environments-land used mainly for primary production with limited change to the native vegetation;

    Production from dryland agriculture and plantations-land used mainly for primary production based on dryland farming systems;

    Production from irrigated agriculture and plantations-land used mostly for primary production based on irrigated farming;

    Intensive uses-land subject to extensive modification, generally in association with closer residential settlement, commercial or industrial uses;

    Water-water features (water is regarded as an essential aspect of the classification, even though it is primarily a land cover type, not a land use).

    The following areas have been updated since the November 2012 release: the entire state of Victoria; Queensland natural resource management regions Border Rivers-Maranoa, Condamine, South East Queensland (part), and South West Queensland.

    Purpose

    Land use information is critical to developing sustainable long-term solutions for natural resource management, and is used to underpin investment decisions. Users include local government, catchment authorities, emergency services, quarantine and pest management authorities, industry and community groups. Landscape processes involving soils and water generally operate at catchment scale. Land use information at catchment scale therefore has an important role to play in developing effective solutions to Australia's natural resource management issues.

    Dataset History

    Lineage:

    ABARES has produced this raster dataset from vector catchment scale land use data provided by state and territory agencies, as follows: Land Use: New South Wales (2009); Land Use Mapping of the Northern Territory 2008 (LUMP 2008); Land use mapping - Queensland current (January 2014); Land Use South Australia 2008; Tasmanian Summer 2009/2010 Land Use; Victorian Land Use Information System (VLUIS) 2010 version 4; Land Use in Western Australia, Version 5, (1997); and, Land Use in Western Australia, v7 (2008). Links to land use mapping datasets and metadata are available at the ACLUMP data download page at http://www.daff.gov.au/abares/aclump/pages/land-use/data-download.aspx State and territory vector catchment scale land use data were produced by combining land tenure and other types of land use information, fine-scale satellite data and information collected in the field, as outlined in the document 'Guidelines for land use mapping in Australia: principles, procedures and definitions, Edition 4'. Specifically, the attributes adhere to the ALUM classification, version 7. For Victoria, ABARES converted the VLUIS vector data to the ALUM classification, based on an agreed method using Valuer General Victoria land use codes, land cover and land tenure information. This method has been updated since the previous release. All contributing polygon datasets were gridded by ABARES on the ALUM code and mosaiced to minimise resampling errors. NODATA voids in Sydney, Adelaide and parts of the Australian Capital Territory were filled with Australian Bureau of Statistics Mesh blocks land use attributes with modifications based on: 1:250 000 scale topographic data for built up areas from GEODATA TOPO 250K Series 3 (Geoscience Australia 2006); land tenure data from Tenure of Australia's Forests (ABARES 2008); and, native and plantation forest data from Forests of Australia (ABARES 2008). All other NODATA voids were filled using data from Land Use of Australia, Version 4, 2005/2006 (ABARES 2010).

    Land use mapped should be regarded as a REPRESENTATION of land use only. The CLUM data shows a single dominant land use for each area mapped, even if multiple land uses occur within that area. The CLUM data is produced from datasets compiled for various dates from 1997 to 2012. The CLUM data is produced from datasets compiled at various scales from 1:25 000 to 1:2 500 000

    Dataset Citation

    Australian Bureau of Agricultural and Resource Economics and Sciences (2014) Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/6f72f73c-8a61-4ae9-b8b5-3f67ec918826.

  9. m

    THVD (Talking Head Video Dataset)

    • data.mendeley.com
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Peedor (2025). THVD (Talking Head Video Dataset) [Dataset]. http://doi.org/10.17632/ykhw8r7bfx.2
    Explore at:
    Dataset updated
    Apr 29, 2025
    Authors
    Mario Peedor
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    About

    We provide a comprehensive talking-head video dataset with over 50,000 videos, totaling more than 500 hours of footage and featuring 20,841 unique identities from around the world.

    Distribution

    Detailing the format, size, and structure of the dataset: Data Volume: -Total Size: 2.7TB

    -Total Videos: 47,547

    -Identities Covered: 20,841

    -Resolution: 60% 4k(1980), 33% fullHD(1080)

    -Formats: MP4

    -Full-length videos with visible mouth movements in every frame.

    -Minimum face size of 400 pixels.

    -Video durations range from 20 seconds to 5 minutes.

    -Faces have not been cut out, full screen videos including backgrounds.

    Usage

    This dataset is ideal for a variety of applications:

    Face Recognition & Verification: Training and benchmarking facial recognition models.

    Action Recognition: Identifying human activities and behaviors.

    Re-Identification (Re-ID): Tracking identities across different videos and environments.

    Deepfake Detection: Developing methods to detect manipulated videos.

    Generative AI: Training high-resolution video generation models.

    Lip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatars.

    Background AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.

    Coverage

    Explaining the scope and coverage of the dataset:

    Geographic Coverage: Worldwide

    Time Range: Time range and size of the videos have been noted in the CSV file.

    Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.

    Languages Covered (Videos):

    English: 23,038 videos

    Portuguese: 1,346 videos

    Spanish: 677 videos

    Norwegian: 1,266 videos

    Swedish: 1,056 videos

    Korean: 848 videos

    Polish: 1,807 videos

    Indonesian: 1,163 videos

    French: 1,102 videos

    German: 1,276 videos

    Japanese: 1,433 videos

    Dutch: 1,666 videos

    Indian: 1,163 videos

    Czech: 590 videos

    Chinese: 685 videos

    Italian: 975 videos

    Philipeans: 920 videos

    Bulgaria: 340 videos

    Romanian: 1144 videos

    Arabic: 1691 videos

    Who Can Use It

    List examples of intended users and their use cases:

    Data Scientists: Training machine learning models for video-based AI applications.

    Researchers: Studying human behavior, facial analysis, or video AI advancements.

    Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.

    Additional Notes

    Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, down sampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 100GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file.

  10. S

    The global industrial value-added dataset under different global change...

    • scidb.cn
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Song Wei; li huan huan; Duan Jianping; Li Han; Xue Qian; Zhang Xuyang (2024). The global industrial value-added dataset under different global change scenarios (2010, 2030, and 2050) [Dataset]. http://doi.org/10.57760/sciencedb.11406
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Song Wei; li huan huan; Duan Jianping; Li Han; Xue Qian; Zhang Xuyang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Temporal Coverage of Data: The data collection periods are 2010, 2030, and 2050.2. Spatial Coverage and Projection:Spatial Coverage: GlobalLongitude: -180° - 180°Latitude: -90° - 90°Projection: GCS_WGS_19843. Disciplinary Scope: The data pertains to the fields of Earth Sciences and Geography.4. Data Volume: The total data volume is approximately 31.5 MB.5. Data Type: Raster (GeoTIFF)6. Thumbnail (illustrating dataset content or observation process/scene): · 7. Field (Feature) Name Explanation:a. Name Explanation: IND: Industrial Value Addedb. Unit of Measurement: Unit: US Dollars (USD)8. Data Source Description:a. Remote Sensing Data:2010 Global Vegetation Index data (Enhanced Vegetation Index, EVI, from MODIS monthly average data) and 2010 Nighttime Light Remote Sensing data (DMSP/OLS)b. Meteorological Data:From the CMCC-CM model in the Fifth International Coupled Model Intercomparison Project (CMIP5) published by the United Nations Intergovernmental Panel on Climate Change (IPCC)c. Statistical Data:From the World Development Indicators dataset of the World Bank and various national statistical agenciesd. Gross Domestic Product Data:Sourced from the project "Study on the Harmful Processes of Population and Economic Systems under Global Change" under the National Key R&D Program "Mechanisms and Assessment of Risks in Population and Economic Systems under Global Change," led by Researcher Sun Fubao at the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciencese. Other Data:Rivers, roads, settlements, and DEM, sourced from the National Oceanic and Atmospheric Administration (NOAA), Global Risk Data Platform, and Natural Earth9. Data Processing Methods(1) Spatialization of Baseline Industrial Value Added: Using 2010 global EVI vegetation index data and nighttime light remote sensing data, we addressed the oversaturation issue in nighttime light data by constructing an adjusted nighttime light index to obtain the optimal global light data. The EANTIL model was developed using NTL, NTLn, and EVI data, with the following formula:Here, EANTLI represents the adjusted nighttime light index, NTL represents the original nighttime light intensity value, and NTLn represents the normalized nighttime light intensity value. Based on the optimal light index EANTLI and the industrial value-added data from the World Bank, we constructed a regression allocation model to derive industrial value added (I), generating the global 2010 industrial value-added data with the formula:Here, I represents the industrial value added for each grid cell, and Ii represents the industrial value added for each country, EANTLi derived from ArcGIS statistical analysis and the regression allocation model.(2) Spatial Boundaries for Future Industrial Value Added: Using the Logistic-CA-Markov simulation principle and global land use data from 2010 and 2015 (from the European Space Agency), we simulated national land use changes for 2030 and 2050 and extracted urban land data as the spatial boundaries for future industrial value added. To comprehensively characterize the influence of different factors on land use and considering the research scale, we selected elevation, slope, population, GDP, distance to rivers, and distance to roads as land use driving factors. Accuracy validation using global 2015 land use data showed an average accuracy of 91.89%.(3) Estimation of Future Industrial Value Added: Based on machine learning and using the random forest model, we constructed spatialization models for industrial value added under different climate change scenarios: Here, tem represents temperature, prep represents precipitation, GDP represents national economic output, L represents urban land, D represents slope, and P represents population. The random forest model was constructed using factors such as 2010 industrial value added, urban land distribution, elevation, slope, distances to rivers, roads, railways (considering transportation), and settlements (considering noise and environmental pollution from industrial buildings), along with temperature and precipitation as climate scenario data. Except for varying temperature and precipitation values across scenarios, other variables remained constant. The model comprised 100 decision trees, with each iteration randomly selecting 90% of the samples for model construction and using the remaining 10% as test data, achieving a training sample accuracy of 0.94 and a test sample accuracy of 0.81.By analyzing the proportion of industrial value added to GDP (average from 2000 to 2020, data from the World Bank) and projected GDP under future Shared Socioeconomic Pathways (SSPs), we derived future industrial value added for each country under different SSP scenarios. Using these projections, we constructed regression models to allocate future industrial value added proportionally, resulting in spatial distribution data for 2030 and 2050 under different SSP scenarios.10. Applications and Achievements of the Dataseta. Primary Application Areas: This dataset is mainly applied in environmental protection, ecological construction, pollution prevention and control, and the prevention and forecasting of natural disasters.b. Achievements in Application (Awards, Published Reports and Articles):Achievements: Developed a method for downscaling national-scale industrial value-added data by integrating DMSP/OLS nighttime light data, vegetation distribution, and other data. Published the global industrial value-added dataset.
  11. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • zenodo.org
    • data.europa.eu
    zip
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema 

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    {
      _id: 
  12. Z

    Fused Image dataset for convolutional neural Network-based crack Detection...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanglian Zhou; Carlos Canchila; Wei Song (2023). Fused Image dataset for convolutional neural Network-based crack Detection (FIND) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6383043
    Explore at:
    Dataset updated
    Apr 20, 2023
    Authors
    Shanglian Zhou; Carlos Canchila; Wei Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The “Fused Image dataset for convolutional neural Network-based crack Detection” (FIND) is a large-scale image dataset with pixel-level ground truth crack data for deep learning-based crack segmentation analysis. It features four types of image data including raw intensity image, raw range (i.e., elevation) image, filtered range image, and fused raw image. The FIND dataset consists of 2500 image patches (dimension: 256x256 pixels) and their ground truth crack maps for each of the four data types.

    The images contained in this dataset were collected from multiple bridge decks and roadways under real-world conditions. A laser scanning device was adopted for data acquisition such that the captured raw intensity and raw range images have pixel-to-pixel location correspondence (i.e., spatial co-registration feature). The filtered range data were generated by applying frequency domain filtering to eliminate image disturbances (e.g., surface variations, and grooved patterns) from the raw range data [1]. The fused image data were obtained by combining the raw range and raw intensity data to achieve cross-domain feature correlation [2,3]. Please refer to [4] for a comprehensive benchmark study performed using the FIND dataset to investigate the impact from different types of image data on deep convolutional neural network (DCNN) performance.

    If you share or use this dataset, please cite [4] and [5] in any relevant documentation.

    In addition, an image dataset for crack classification has also been published at [6].

    References:

    [1] Shanglian Zhou, & Wei Song. (2020). Robust Image-Based Surface Crack Detection Using Range Data. Journal of Computing in Civil Engineering, 34(2), 04019054. https://doi.org/10.1061/(asce)cp.1943-5487.0000873

    [2] Shanglian Zhou, & Wei Song. (2021). Crack segmentation through deep convolutional neural networks and heterogeneous image fusion. Automation in Construction, 125. https://doi.org/10.1016/j.autcon.2021.103605

    [3] Shanglian Zhou, & Wei Song. (2020). Deep learning–based roadway crack classification with heterogeneous image data fusion. Structural Health Monitoring, 20(3), 1274-1293. https://doi.org/10.1177/1475921720948434

    [4] Shanglian Zhou, Carlos Canchila, & Wei Song. (2023). Deep learning-based crack segmentation for civil infrastructure: data types, architectures, and benchmarked performance. Automation in Construction, 146. https://doi.org/10.1016/j.autcon.2022.104678

    5 Shanglian Zhou, Carlos Canchila, & Wei Song. (2022). Fused Image dataset for convolutional neural Network-based crack Detection (FIND) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6383044

    [6] Wei Song, & Shanglian Zhou. (2020). Laser-scanned roadway range image dataset (LRRD). Laser-scanned Range Image Dataset from Asphalt and Concrete Roadways for DCNN-based Crack Classification, DesignSafe-CI. https://doi.org/10.17603/ds2-bzv3-nc78

  13. Z

    Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plachetka, Christopher; Sertolli, Benjamin; Fricke, Jenny; Klingner, Marvin; Fingscheidt, Tim (2024). 3DHD CityScenes: High-Definition Maps in High-Density Point Clouds [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7085089
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    TU Braunschweig
    Volkswagen AG
    Authors
    Plachetka, Christopher; Sertolli, Benjamin; Fricke, Jenny; Klingner, Marvin; Fingscheidt, Tim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.

    Our corresponding paper (published at ITSC 2022) is available here. Further, we have applied 3DHD CityScenes to map deviation detection here.

    Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:

    Python tools to read, generate, and visualize the dataset,

    3DHDNet deep learning pipeline (training, inference, evaluation) for map deviation detection and 3D object detection.

    The DevKit is available here:

    https://github.com/volkswagen/3DHD_devkit.

    The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.

    When using our dataset, you are welcome to cite:

    @INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}}

    Acknowledgements

    We thank the following interns for their exceptional contributions to our work.

    Benjamin Sertolli: Major contributions to our DevKit during his master thesis

    Niels Maier: Measurement campaign for data collection and data preparation

    The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.

    The Dataset

    After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.

    1. Dataset

    This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.

    During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.

    To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.

    import json

    json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data)

    1. HD_Map

    Map items are stored as lists of items in JSON format. In particular, we provide:

    traffic signs,

    traffic lights,

    pole-like objects,

    construction site locations,

    construction site obstacles (point-like such as cones, and line-like such as fences),

    line-shaped markings (solid, dashed, etc.),

    polygon-shaped markings (arrows, stop lines, symbols, etc.),

    lanes (ordinary and temporary),

    relations between elements (only for construction sites, e.g., sign to lane association).

    1. HD_Map_MetaData

    Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.

    Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.

    1. HD_PointCloud_Tiles

    The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.

    x-coordinates: 4 byte integer

    y-coordinates: 4 byte integer

    z-coordinates: 4 byte integer

    intensity of reflected beams: 2 byte unsigned integer

    ground classification flag: 1 byte unsigned integer

    After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.

    import numpy as np import pptk

    file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['

  14. Life Sciences dataset used in INFORE project, part 1

    • data.europa.eu
    unknown
    Updated Jun 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2020). Life Sciences dataset used in INFORE project, part 1 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3921049?locale=hr
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jun 29, 2020
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Life Sciences dataset used in INFORE project, part 1 The dataset comprises the output of several simulations of a model of tumor growth with different parameter values. The model is a multi-scale agent-based model of a tumor spheroid that is treated with periodic pulses of the cytokine tumor necrosis factor (TNF). The multi-scale model simulates processes including i) the diffusion, uptake, and secretion of molecular entities such as oxygen, or TNF; ii) the mechanical interaction between cells; and iii) cellular processes including cell life cycle, cell death models, signal transduction. The multi-scale model was implemented and simulated using the PhysiBoSS framework (Letort et al. 2019). The dataset corresponds to different examples of parameters combinations of our use case that correspond to the different panels of Figure 4 in Documentation folder. This figure comes from the paper in the same folder. You can find a broad discussion of our use case in the Biological Use Case Documentation file. Also, The results of the cell simulations can be found in example_XXX/run0/outputs. The results of the microenvironment simulations can be found in example_XXX/run0/microutputs. Details on how these files are built can be found in Biological Use Case output format file (which is a snippet of the broad documentation file that I detached for your convenience). Briefly: each time step defined, the software writes an output and microutput file. For instance, ecm_t00030.txt correspond to time step 30. Each line of these files corresponds to a cell or microenvironment entity (oxygen, TNF, etc). Columns are defined by the first row for output folder. For the microutputs, the first three columns correspond to spatial coordinates and the fourth to the value of the density. The examples are: - example_spheroid_TNF_nopulse: corresponds to Figure 4 A. - example_spheroid_TNF_onepulse: corresponds to Figure 4 C. - example_spheroid_TNF_pulse150: corresponds to Figure 4 D left. This is the simulation outcome desired: proliferative cells die out with increasing number of pulses of TNF. - example_spheroid_TNF_pulse600: corresponds to Figure 4 D right. - example_spheroid_TNF_pulsecont: corresponds to Figure 4 B. - example_cells_with_ECM_mutants: does NOT correspond to Figure 4. This is an example in which microutput folder is full of two entities: oxygen and ECM. Also, in this example you can find a folder (ECM_mut) with the kind of visualisation that we perform to showcase results. - example_spheroid_TNF_pulsecont_oxy: 21 simulations with slightly different oxygen tolerance conditions using as a base the simulation with one continuous pulse (Figure 4 B from the presentation). The only difference among parameters file is the "oxygen_necrotic" value, which controls the threshold above which cells commit to necrosis due to lack of oxygen. In the original simulation this value was zero and the maximum available oxygen is 40 fg/µm^3. Here, we have studied the parameter value from 0 to 40 in steps of 5.

  15. FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

    • zenodo.org
    bin, png, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
    Explore at:
    bin, zip, pngAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # FiN-2 Large-Scale Real-World PLC-Dataset

    ## About
    #### FiN-2 dataset in a nutshell:
    FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
    FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
    We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

    * * *
    ## Content
    FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

    All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
    - https://zenodo.org/record/8328105
    - https://zenodo.org/record/8328108
    - https://zenodo.org/record/8328111

    ### Node data

    | id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
    |----|----|----|----|----|----|----|----|----|----|----|----|----|----|
    |112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
    |112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
    |112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

    - id / ts: Unique identifier of the node that is measured and timestemp of the measurement
    - v1/v2/v3: Voltage measurements of all three phases
    - thd1/thd2/thd3: Total harmonic distortion of all three phases
    - phase_angle1/2/3: Phase angle of all three phases
    - temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

    ### Edge data
    | src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
    |----|----|----|----|----|----|----|----|
    |62|94|1605528900|70|72|45|...|-53|
    |62|32|1605529800|16|24|13|...|-51|
    |17|94|1605530700|37|25|24|...|-55|

    - src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
    - snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

    ### Metadata
    Metadata that is provided along with the data covers:

    - Number of cable joints
    - Cable properties (length, type, number of sections)
    - Relative position of the nodes (location, zero-centered gps)
    - Adjacent PV or wallbox installations
    - Year of installation w.r.t. the nodes and cables

    Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

    * * *
    ## Usage
    Simple data access using pandas:

    ```
    import pandas as pd

    nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
    edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

    # read the first 10 rows
    data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

    # read the row number 5 to 15
    data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

    # ... same for the edges
    ```

    Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
    To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).


    ### Example use case (voltage forecasting)

    Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.

  16. Iris Dataset - various format types

    • kaggle.com
    zip
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nanda Prasetia (2024). Iris Dataset - various format types [Dataset]. https://www.kaggle.com/datasets/nandaprasetia/iris-dataset-various-format-types
    Explore at:
    zip(24187 bytes)Available download formats
    Dataset updated
    May 3, 2024
    Authors
    Nanda Prasetia
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Iris Dataset consists of 150 iris samples, each having four numerical features: sepal length, sepal width, petal length, and petal width. Each sample is categorized into one of three iris species: Setosa, Versicolor, or Virginica. This dataset is widely used as a sample dataset in machine learning and statistics due to its simple and easily understandable structure.

    Feature Information : - Sepal Length (cm) - Sepal Width (cm) - Petal Length (cm) - Petal Width (cm)

    Target Information : - Iris Species : 1. Setosa 1. Versicolor 1. Virginica

    Source : The Iris Dataset is obtained from the scikit-learn (sklearn) library under the BSD (Berkeley Software Distribution) license.

    File Formats :

    1. CSV (Comma-Separated Values): CSV format is the most commonly used and easily readable format. Each row represents one sample with its features separated by commas.
    2. Excel (.xlsx): Excel format is suitable for further data analysis, visualization, and integration with other software.
    3. JSON (JavaScript Object Notation): JSON format allows data to be stored in a more complex structure, suitable for web-based data processing or applications.
    4. Parquet: Parquet format is an efficient columnar data format for large and complex data.
    5. HDF5 (Hierarchical Data Format version 5): HDF5 format stores data in hierarchical groups and datasets, excellent for storing large scientific and numerical data.
    6. Feather: Feather format is a lightweight binary format for storing data frames. It provides excellent performance for reading and writing data.
    7. SQLite Database (.db, .sqlite): SQLite is a lightweight database format suitable for local data storage and querying. It is widely used for small to medium-scale applications.
    8. Msgpack: Msgpack format is a binary serialization format that is efficient in terms of storage and speed. It is suitable for storing and transmitting data efficiently between systems.

    The Iris Dataset is one of the most iconic datasets in the world of machine learning and data science. This dataset contains information about three species of iris flowers: Setosa, Versicolor, and Virginica. With features like sepal and petal length and width, the Iris dataset has been a stepping stone for many beginners in understanding the fundamental concepts of classification and data analysis. With its clarity and diversity of features, the Iris dataset is perfect for exploring various machine learning techniques and building accurate classification models. I present the Iris dataset from scikit-learn with the hope of providing an enjoyable and inspiring learning experience for the Kaggle community!

  17. SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest...

    • zenodo.org
    bin
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhengpeng Feng; Yihang She; Keshav Srinivasan; Zhengpeng Feng; Yihang She; Keshav Srinivasan (2024). SPREAD: A Large-scale, High-fidelity Synthetic Dataset for Multiple Forest Vision Tasks (Part II) [Dataset]. http://doi.org/10.5281/zenodo.14525290
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zhengpeng Feng; Yihang She; Keshav Srinivasan; Zhengpeng Feng; Yihang She; Keshav Srinivasan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This page only provides the drone-view image dataset.

    The dataset contains drone-view RGB images, depth maps and instance segmentation labels collected from different scenes. Data from each scene is stored in a separate .7z file, along with a color_palette.xlsx file, which contains the RGB_id and corresponding RGB values.

    All files follow the naming convention: {central_tree_id}_{timestamp}, where {central_tree_id} represents the ID of the tree centered in the image, which is typically in a prominent position, and timestamp indicates the time when the data was collected.

    Specifically, each 7z file includes the following folders:

    • rgb: This folder contains the RGB images (PNG) of the scenes and their metadata (TXT). The metadata describes the weather conditions and the world time when the image was captured. An example metadata entry is: Weather:Snow_Blizzard,Hour:10,Minute:56,Second:36.

    • depth_pfm: This folder contains absolute depth information of the scenes, which can be used to reconstruct the point cloud of the scene through reprojection.

    • instance_segmentation: This folder stores instance segmentation labels (PNG) for each tree in the scene, along with metadata (TXT) that maps tree_id to RGB_id. The tree_id can be used to look up detailed information about each tree in obj_info_final.xlsx, while the RGB_id can be matched to the corresponding RGB values in color_palette.xlsx. This mapping allows for identifying which tree corresponds to a specific color in the segmentation image.

    • obj_info_final.xlsx: This file contains detailed information about each tree in the scene, such as position, scale, species, and various parameters, including trunk diameter (in cm), tree height (in cm), and canopy diameter (in cm).

    • landscape_info.txt: This file contains the ground location information within the scene, sampled every 0.5 meters.

    For birch_forest, broadleaf_forest, redwood_forest and rainforest, we also provided COCO-format annotation files (.json). Two such files can be found in these datasets:

    • {name}_coco.json: This file contains the annotation of each tree in the scene.
    • {name}_filtered.json: This file is derived from the previous one, but filtering is applied to rule out overlapping instances.

    ⚠️: 7z files that begin with "!" indicate that the RGB values in the images within the instance_segmentation folder cannot be found in color_palette.xlsx. Consequently, this prevents matching the trees in the segmentation images to their corresponding tree information, which may hinder the application of the dataset to certain tasks. This issue is related to a bug in Colossium/AirSim, which has been reported in link1 and link2.

  18. d

    Data for: Ignoring species availability biases occupancy estimates in...

    • catalog.data.gov
    • data.usgs.gov
    • +4more
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data for: Ignoring species availability biases occupancy estimates in single-scale occupancy models [Dataset]. https://catalog.data.gov/dataset/data-for-ignoring-species-availability-biases-occupancy-estimates-in-single-scale-occupanc
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    We simulate over 28,000 datasets and saved their model outputs to answer the following three questions: (1) what is an adequate sampling design for the multi-scale occupancy model when there are a priori expectations of parameter estimates?, (2) what is an adequate sampling design when we have no expectations of parameter estimates?, and (3) what is the cost (in terms of bias, accuracy, precision and coverage) in occupancy estimates) if availability is not accounted for? Specifically, we simulated data under four scenarios: Scenario 1 (n = 10,000): Species availability is constant across sites (but less than one), Scenario 2 (n = 9,358): Species availability is heterogenous across sites, Scenario 3 (n = 2,815): Species availability is heterogenous across years, and Scenario 4 (n = 5,942): Species availability is correlated to their detection probability. Then, for each scenario except the first, we analyzed the data using four different estimators: (i) constant multi-scale occupancy model, (ii) multi-scale occupancy model with a random-effects term in the availability part of the model, (iii) constant single-scale occupancy model, and (iv) single-scale occupancy model with a random-effects term in the detection part of the model. Note the formulation of the random-effects terms included in the models mimicked the way that data were simulated (e.g., if species availability was heterogenous across sites, then a site random-effects term was included in the models). The first scenario was analyzed using models (i) and (iii) only. For simplicity, we refer to models (i) and (iii) as ‘constant’ models or 'fixed-effects' models. We refer to models (ii) and (iv) as ‘random-effects’ models. The summary of simulated data and model estimates are located in four folders, each corresponding to a different simulated scenario: Scenario 1 (n = 10,000): Folder ModelOutput_Scen1_TwolevelSim = csv files holding data are named Results_TwoLevelAvail_2lev_x.csv Scenario 2 (n = 9,358): Folder ModelOutput_Scen2_HeteroSite = csv files holding data are named Results_TwoLevelAvail_Hetero_x.csv Scenario 3 (n = 2,815): Folder ModelOutput_Scen3_HeteroYear = csv files holding data are named Results_TwoLevelAvail_HeteroSeason_x.csv Scenario 4 (n = 5,942): Folder ModelOutput_Scen4_Cor = csv files holding data are named Results_TwoLevelAvail_Cor_x.csv Each row in each of the csv files contains information related to a different simulated dataset and includes information related to: sampling design, true parameter values, and model estimates. Other files in the folder correspond to the entire model output (.rda files), time for model run to complete (time_..csv), and a file indicating whether or not the model run finished (nsim...csv). For more information related to those files, we point the user to the code that generated them: Scenario 1 (n = 10,000): Scen1_Constant.R Scenario 2 (n = 9,358): Scen2_HeteroSite.R Scenario 3 (n = 2,815): Scen3_HeteroYear.R Scenario 4 (n = 5,942): Scen4_Corr.R

  19. t

    Demir, Nurullah, Urban, Tobias, Pohlmann, Norbert, Wressnegger, Christian...

    • service.tib.eu
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Demir, Nurullah, Urban, Tobias, Pohlmann, Norbert, Wressnegger, Christian (2023). Dataset: Dataset: a large-scale study of cookie banner interaction tools and their impact on users' privacy / part1. https://doi.org/10.35097/1708 [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1708
    Explore at:
    Dataset updated
    Nov 28, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: Cookie notices (or cookie banners) are a popular mechanism for websites to provide (European) Internet users a tool to choose which cookies the site may set. Banner implementations range from merely providing information that a site uses cookies over offering the choice to accepting or denying all cookies to allowing fine-grained control of cookie usage. Users frequently get annoyed by the banner's pervasiveness as they interrupt natural'' browsing on the Web. As a remedy, different browser extensions have been developed to automate the interaction with cookie banners. In this work, we perform a large-scale measurement study comparing the effectiveness of extensions for cookie banner interaction.'' We configured the extensions to express different privacy choices (e.g., accepting all cookies, accepting functional cookies, or rejecting all cookies) to understand their capabilities to execute a user's preferences. The results show statistically significant differences in which cookies are set, how many of them are set, and which types are set---even for extensions that aim to implement the same cookie choice. Extensions forcookie banner interaction'' can effectively reduce the number of set cookies compared to no interaction with the banners. However, all extensions increase the tracking requests significantly except when rejecting all cookies. Abstract: Cookie notices (or cookie banners) are a popular mechanism for websites to provide (European) Internet users a tool to choose which cookies the site may set. Banner implementations range from merely providing information that a site uses cookies over offering the choice to accepting or denying all cookies to allowing fine-grained control of cookie usage. Users frequently get annoyed by the banner's pervasiveness as they interruptnatural'' browsing on the Web. As a remedy, different browser extensions have been developed to automate the interaction with cookie banners. In this work, we perform a large-scale measurement study comparing the effectiveness of extensions for cookie banner interaction.'' We configured the extensions to express different privacy choices (e.g., accepting all cookies, accepting functional cookies, or rejecting all cookies) to understand their capabilities to execute a user's preferences. The results show statistically significant differences in which cookies are set, how many of them are set, and which types are set---even for extensions that aim to implement the same cookie choice. Extensions forcookie banner interaction'' can effectively reduce the number of set cookies compared to no interaction with the banners. However, all extensions increase the tracking requests significantly except when rejecting all cookies. TechnicalRemarks: This repository hosts the dataset corresponding to the paper "A Large-Scale Study of Cookie Banner Interaction Tools and their Impact on Users’ Privacy", which was published at the Privacy Enhancing Technologies Symposium (PETS) in 2024.

  20. f

    Data_Sheet_1_Community Detection in Large-Scale Bipartite Biological...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Genís Calderer; Marieke L. Kuijjer (2023). Data_Sheet_1_Community Detection in Large-Scale Bipartite Biological Networks.XLSX [Dataset]. http://doi.org/10.3389/fgene.2021.649440.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Genís Calderer; Marieke L. Kuijjer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Networks are useful tools to represent and analyze interactions on a large, or genome-wide scale and have therefore been widely used in biology. Many biological networks—such as those that represent regulatory interactions, drug-gene, or gene-disease associations—are of a bipartite nature, meaning they consist of two different types of nodes, with connections only forming between the different node sets. Analysis of such networks requires methodologies that are specifically designed to handle their bipartite nature. Community structure detection is a method used to identify clusters of nodes in a network. This approach is especially helpful in large-scale biological network analysis, as it can find structure in networks that often resemble a “hairball” of interactions in visualizations. Often, the communities identified in biological networks are enriched for specific biological processes and thus allow one to assign drugs, regulatory molecules, or diseases to such processes. In addition, comparison of community structures between different biological conditions can help to identify how network rewiring may lead to tissue development or disease, for example. In this mini review, we give a theoretical basis of different methods that can be applied to detect communities in bipartite biological networks. We introduce and discuss different scores that can be used to assess the quality of these community structures. We then apply a wide range of methods to a drug-gene interaction network to highlight the strengths and weaknesses of these methods in their application to large-scale, bipartite biological networks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
Organization logo

Rescaled Fashion-MNIST dataset

Explore at:
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Time period covered
Apr 10, 2025
Description

Motivation

The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled Fashion-MNIST dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights

The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

The h5 files containing the dataset

The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

Search
Clear search
Close search
Google apps
Main menu