84 datasets found

Sales Dataset with Natural Language Statement
kaggle.com
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gurpreet Singh India (2024). Sales Dataset with Natural Language Statement [Dataset]. https://www.kaggle.com/datasets/gurpreetsinghindia/sales-data-with-natural-language
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gurpreet Singh India
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains 10,000 simulated sales transaction records, each represented in natural language with diverse sentence structures. It is designed to mimic how different users might describe the same type of transaction in varying ways, making it ideal for Natural Language Processing (NLP) tasks, text-based data extraction, and accounting automation projects.

Each record in the dataset includes the following fields:

Sale Date: The date on which the transaction took place. Customer Name: A randomly generated customer name. Product: The type of product purchased. Quantity: The quantity of the product purchased. Unit Price: The price per unit of the product. Total Amount: The total price for the purchased products. Tax Rate: The percentage of tax applied to the transaction. Payment Method: The method by which the payment was made (e.g., Credit Card, Debit Card, UPI, etc.). Sentence: A natural language description of the sales transaction. The sentence structure is varied to simulate different ways people describe the same type of sales event.

Use Cases: NLP Training: This dataset is suitable for training models to extract structured information (e.g., date, customer, amount) from natural language descriptions of sales transactions. Accounting Automation: The dataset can be used to build or test systems that automate posting of sales transactions based on unstructured text input. Text Data Preprocessing: It provides a good resource for developing methods to preprocess and standardize varying formats of text descriptions. Chatbot Training: This dataset can help train chatbots or virtual assistants that handle accounting or customer inquiries by understanding different ways of expressing the same transaction details.

Key Features: High Variability: Sentences are structured in numerous ways to simulate natural human language variations. Randomized Data: Names, dates, products, quantities, prices, and payment methods are randomized, ensuring no duplication. Multi-Field Information: Each record contains key sales information essential for accounting and business use cases.

Potential Applications: Use for Named Entity Recognition (NER) tasks. Apply for information extraction challenges. Create pattern recognition models to understand different sentence structures. Test rule-based systems or machine learning models for sales data entry and accounting automation.

License: Ensure that the dataset is appropriately licensed according to your intended use. For general public and research purposes, choose a CC0: Public Domain license, unless specific restrictions apply.
Common Metadata Elements for Cataloging Biomedical Datasets
figshare.com
xlsx
Updated Jan 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Read (2016). Common Metadata Elements for Cataloging Biomedical Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.1496573.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1496573.v1
Dataset updated
Jan 20, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kevin Read
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset outlines a proposed set of core, minimal metadata elements that can be used to describe biomedical datasets, such as those resulting from research funded by the National Institutes of Health. It can inform efforts to better catalog or index such data to improve discoverability. The proposed metadata elements are based on an analysis of the metadata schemas used in a set of NIH-supported data sharing repositories. Common elements from these data repositories were identified, mapped to existing data-specific metadata standards from to existing multidisciplinary data repositories, DataCite and Dryad, and compared with metadata used in MEDLINE records to establish a sustainable and integrated metadata schema. From the mappings, we developed a preliminary set of minimal metadata elements that can be used to describe NIH-funded datasets. Please see the readme file for more details about the individual sheets within the spreadsheet.
H
Dataset metadata of known Dataverse installations, August 2023
dataverse.harvard.edu
search.dataone.org
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8FEGUV
Dataset updated
Aug 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
Describe Art Dataset
universe.roboflow.com
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow Project (2024). Describe Art Dataset [Dataset]. https://universe.roboflow.com/roboflow-project-erv3h/describe-art
Explore at:
zipAvailable download formats
Dataset updated
Aug 2, 2024
Dataset provided by
Roboflow, Inc.
Authors
Roboflow Project
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Art Images Descriptions
Description
Describe Art

## Overview Describe Art is a dataset for vision language (multimodal) tasks - it contains Art Images annotations for 6,402 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
A fMRI dataset in response to large number of short natural dynamic facial...
openneuro.org
Updated Oct 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panpan Chen; Chi Zhang; Bao Li; Li Tong; Linyuan Wang; Shuxiao Ma; Long Cao; Ziya Yu; Bin Yan (2024). A fMRI dataset in response to large number of short natural dynamic facial expression videos [Dataset]. http://doi.org/10.18112/openneuro.ds005047.v1.0.4
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds005047.v1.0.4
Dataset updated
Oct 10, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Panpan Chen; Chi Zhang; Bao Li; Li Tong; Linyuan Wang; Shuxiao Ma; Long Cao; Ziya Yu; Bin Yan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Summary

Facial expression is among the most natural methods for human beings to convey their emotional information in daily life. Although the neural mechanism of facial expression has been extensively studied employing lab-controlled images and a small number of lab-controlled video stimuli, how the human brain processes natural facial expressions still needs to be investigated. To our knowledge, this type of data specifically on large number of natural facial expression videos is currently missing. We describe here the natural Facial Expressions Dataset (NFED), a fMRI dataset including responses to 1,320 short (3-second) natural facial expression video clips. These video clips is annotated with three types of labels: emotion, gender, and ethnicity, along with accompanying metadata. We validate that the dataset has good quality within and across participants and, notably, can capture temporal and spatial stimuli features. NFED provides researchers with fMRI data for understanding of the visual processing of large number of natural facial expression videos.

Data Records

The data, which were structured following the BIDS format53, were accessible at https://openneuro.org/datasets/ds00504754. The “sub-

Stimulus. Distinct folders store the stimuli for distinct fMRI experiments: "stimuli/face-video", "stimuli/floc", and "stimuli/prf" (Fig. 2b). The category labels and metadata corresponding to video stimuli are stored in the "videos-stimuli_category_metadata.tsv”. The “videos-stimuli_description.json” file describes category and metadata information of video stimuli(Fig. 2b).

Raw MRI data. Each participant's folder is comprised of 11 session folders: “sub-

Volume data from pre-processing. The pre-processed volume-based fMRI data were in the folder named “pre-processed_volume_data/sub-

Surface data from pre-processing. The pre-processed surface-based data were stored in a file named “volumetosurface/sub-

FreeSurfer recon-all. The results of reconstructing the cortical surface were saved as “recon-all-FreeSurfer/sub-

Surface-based GLM analysis data. We have conducted GLMsingle on the data of the main experiment. There is a file named “sub--

Validation. The code of technical validation was saved in the “derivatives/validation/code” folder. The results of technical validation were saved in the “derivatives/validation/results” folder (Fig. 2h). “README.md” describes the detailed information of code and results.
Z
Empathy dataset
data.niaid.nih.gov
zenodo.org
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathematical Research Data Initiative (2024). Empathy dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7683906
Explore at:
Dataset updated
Dec 18, 2024
Dataset authored and provided by
Mathematical Research Data Initiative
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.

The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.

Size: A dataset of size 1973*28

Number of features: 28

Ground truth: No

Type of Graph: Mixed graph

The following gives the description of the variables:

Feature FeatureLabel Domain Item meaning from Davis 1980

001 1FS Green I daydream and fantasize, with some regularity, about things that might happen to me.

002 2EC Purple I often have tender, concerned feelings for people less fortunate than me.

003 3PT_R Yellow I sometimes find it difficult to see things from the “other guy’s” point of view.

004 4EC_R Purple Sometimes I don’t feel very sorry for other people when they are having problems.

005 5FS Green I really get involved with the feelings of the characters in a novel.

006 6PD Red In emergency situations, I feel apprehensive and ill-at-ease.

007 7FS_R Green I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)

008 8PT Yellow I try to look at everybody’s side of a disagreement before I make a decision.

009 9EC Purple When I see someone being taken advantage of, I feel kind of protective towards them.

010 10PD Red I sometimes feel helpless when I am in the middle of a very emotional situation.

011 11PT Yellow sometimes try to understand my friends better by imagining how things look from their perspective

012 12FS_R Green Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)

013 13PD_R Red When I see someone get hurt, I tend to remain calm. (Reversed)

014 14EC_R Purple Other people’s misfortunes do not usually disturb me a great deal. (Reversed)

015 15PT_R Yellow If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)

016 16FS Green After seeing a play or movie, I have felt as though I were one of the characters.

017 17PD Red Being in a tense emotional situation scares me.

018 18EC_R Purple When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)

019 19PD_R Red I am usually pretty effective in dealing with emergencies. (Reversed)

020 20FS Green I am often quite touched by things that I see happen.

021 21PT Yellow I believe that there are two sides to every question and try to look at them both.

022 22EC Purple I would describe myself as a pretty soft-hearted person.

023 23FS Green When I watch a good movie, I can very easily put myself in the place of a leading character.

024 24PD Red I tend to lose control during emergencies.

025 25PT Yellow When I’m upset at someone, I usually try to “put myself in his shoes” for a while.

026 26FS Green When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.

027 27PD Red When I see someone who badly needs help in an emergency, I go to pieces.

028 28PT Yellow Before criticizing somebody, I try to imagine how I would feel if I were in their place

More information about the dataset is contained in empathy_description.html file.
Cybersecurity Attack Dataset
kaggle.com
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tannu Barot (2025). Cybersecurity Attack Dataset [Dataset]. https://www.kaggle.com/datasets/tannubarot/cybersecurity-attack-and-defence-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tannu Barot
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview This dataset is a comprehensive, easy-to-understand collection of cybersecurity incidents, threats, and vulnerabilities, designed to help both beginners and experts explore the world of digital security. It covers a wide range of modern cybersecurity challenges, from everyday web attacks to cutting-edge threats in artificial intelligence (AI), satellites, and quantum computing. Whether you're a student, a security professional, a researcher, or just curious about cybersecurity, this dataset offers a clear and structured way to learn about how cyber attacks happen, what they target, and how to defend against them.

With 14134 entries and 15 columns, this dataset provides detailed insights into 26 distinct cybersecurity domains, making it a valuable tool for understanding the evolving landscape of digital threats. It’s perfect for anyone looking to study cyber risks, develop strategies to protect systems, or build tools to detect and prevent attacks.

What’s in the Dataset? The dataset is organized into 16 columns that describe each cybersecurity incident or research scenario in detail:

ID: A unique number for each entry (e.g., 1, 2, 3). Title: A short, descriptive name of the attack or scenario (e.g., "Authentication Bypass via SQL Injection"). Category: The main cybersecurity area, like Mobile Security, Satellite Security, or AI Exploits. Attack Type: The specific kind of attack, such as SQL Injection, Cross-Site Scripting (XSS), or GPS Spoofing. Scenario Description: A plain-language explanation of how the attack works or what the scenario involves. Tools Used: Software or tools used to carry out or test the attack (e.g., Burp Suite, SQLMap, GNURadio). Attack Steps: A step-by-step breakdown of how the attack is performed, written clearly for all audiences. Target Type: The system or technology attacked, like web apps, satellites, or login forms. Vulnerability: The weakness that makes the attack possible (e.g., unfiltered user input or weak encryption). MITRE Technique: A code from the MITRE ATT&CK framework, linking the attack to a standard classification (e.g., T1190 for exploiting public-facing apps). Impact: What could happen if the attack succeeds, like data theft, system takeover, or financial loss. Detection Method: Ways to spot the attack, such as checking logs or monitoring unusual activity. Solution: Practical steps to prevent or fix the issue, like using secure coding or stronger encryption. Tags: Keywords to help search and categorize entries (e.g., SQLi, WebSecurity, SatelliteSpoofing). Source: Where the information comes from, like OWASP, MITRE ATT&CK, or Space-ISAC.

Cybersecurity Domains Covered The dataset organizes cybersecurity into 26 key areas:

AI / ML Security

AI Agents & LLM Exploits

AI Data Leakage & Privacy Risks

Automotive / Cyber-Physical Systems

Blockchain / Web3 Security

Blue Team (Defense & SOC)

Browser Security

Cloud Security

DevSecOps & CI/CD Security

Email & Messaging Protocol Exploits

Forensics & Incident Response

Insider Threats

IoT / Embedded Devices

Mobile Security

Network Security

Operating System Exploits

Physical / Hardware Attacks

Quantum Cryptography & Post-Quantum Threats

Red Team Operations

Satellite & Space Infrastructure Security

SCADA / ICS (Industrial Systems)

Supply Chain Attacks

Virtualization & Container Security

Web Application Security

Wireless Attacks

Zero-Day Research / Fuzzing

Why Is This Dataset Important? Cybersecurity is more critical than ever as our world relies on technology for everything from banking to space exploration. This dataset is a one-stop resource to understand:

What threats exist: From simple web attacks to complex satellite hacks. How attacks work: Clear explanations of how hackers exploit weaknesses. How to stay safe: Practical solutions to prevent or stop attacks. Future risks: Insight into emerging threats like AI manipulation or quantum attacks. It’s a bridge between technical details and real-world applications, making cybersecurity accessible to everyone.

Potential Uses This dataset can be used in many ways, whether you’re a beginner or an expert:

Learning and Education: Students can explore how cyber attacks work and how to defend against them. Threat Intelligence: Security teams can identify common attack patterns and prepare better defenses. Security Planning: Businesses and governments can use it to prioritize protection for critical systems like satellites or cloud infrastructure. Machine Learning: Data scientists can train models to detect threats or predict vulnerabilities. Incident Response Training: Practice responding to cyber incidents, from web hacks to satellite tampering.

Ethical Considerations Purpose: The dataset is for educational and research purposes only, to help improve cybersecurity knowledge and de...
ML Methods/Algorithms Dataset
kaggle.com
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Ranjan (2023). ML Methods/Algorithms Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/5548954
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5548954
Dataset updated
Apr 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abhishek Ranjan
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
The ML Methods dataset returned by the PapersWithCode API represents a machine learning method or model, such as a neural network architecture or an optimization algorithm. It contains various attributes that describe the method, including its ID, name, description, and the papers that introduce it.

Column Descriptors

ID: A unique identifier for the method.

Name: The name of the method, which typically describes its architecture or algorithm.

Full Name: The full_name attribute of the Method dataset via the Papers with Code API represents the full name of a machine learning method, including any additional information such as version numbers or authors.

Description: A detailed description of the method, which may include information about its design choices, implementation details, and performance characteristics.

Paper: A list of Paper objects that introduce or describe the method

Use Case

The dataset can be used for NLP tasks, Data Analysis, Feature Engineering, etc. For instance, You could use clustering algorithms to group similar papers together based on their content.

The specific approach you take will depend on your research question and the tools and techniques you are familiar with.
SELTO Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7781392
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values] Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
Synthetic dataset 2
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noah Mitchell; Dillon Cislo (2023). Synthetic dataset 2 [Dataset]. http://doi.org/10.6084/m9.figshare.22814786.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22814786.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Noah Mitchell; Dillon Cislo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic dataset of nuclei on a tube-like tissue that changes shape, for analysis demonstration with TubULAR.

TubULAR is a set of tools for working with 3D data of surfaces – potentially complex and dynamic – that can be described as tubes. Developing guts, pumping hearts, and other visceral organs can be treated as tubes with potentially complex and dynamic shapes. With TubULAR, we can describe the tissue motion on the tube-like surface and quantify how it changes over time.

This synthetic dataset is a tube of cells with nuclei and membrane that coils into a loop, then uncoils into a straight tube. To generate the dataset, the surface geometry was encoded numerically. We placed 120 nuclei-like blobs of intensity centered at locations across the surface. Locations were chosen as a solution to an iterative farthest-point search, so that nuclei are well-spaced from each other. We then performed a Voronoi tessellation to create a channel mimicking `cell-cell junctions'. The nuclei sizes were determined based on the distance of each nucleus to the nearest membrane location.

For more on the codebase, visit: https://npmitchell.github.io/tubular/ https://github.com/npmitchell/tubular
r
Tabascal SNN-NLN Dataset
researchdata.edu.au
data.niaid.nih.gov
+1more
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas James Pritchard; School of Physics, Maths and Computing (2023). Tabascal SNN-NLN Dataset [Dataset]. http://doi.org/10.5281/ZENODO.8401763
Explore at:
Unique identifier
https://doi.org/10.5281/ZENODO.8401763
Dataset updated
2023
Dataset provided by
The University of Western Australia
Zenodo
Authors
Nicholas James Pritchard; School of Physics, Maths and Computing
Description
Dataset for training and evaluating RFI detection schemes representing MeerKat instrumentation and predominantly satellite-based contamination. These datasets are produced using Tabascal and output in hdf5 format. The choice of format is to allow for easy use with machine-learning workflows, not other astronomy pipelines (for example, measurement sets). These datasets are prepared for immediate loading with Tensorflow. The attached config.json files describe the parameters used to generate these datasets.

Dataset parameters Name Num Satellite Sources Num Ground RFI Sources obs_100AST_0SAT_0GRD_512BSL_64A_512T-0440-1462_016I_512F-1.227e+09-1.334e+09 0 0 obs_100AST_1SAT_0GRD_512BSL_64A_512T-0440-1462_016I_512F-1.227e+09-1.334e+09 1 0 obs_100AST_1SAT_3GRD_512BSL_64A_512T-0440-1462_016I_512F-1.227e+09-1.334e+09 1 3 obs_100AST_2SAT_0GRD_512BSL_64A_512T-0440-1462_016I_512F-1.227e+09-1.334e+09 2 0 obs_100AST_2SAT_3GRD_512BSL_64A_512T-0440-1462_016I_512F-1.227e+09-1.334e+09 2 3

Using simulated data allows for access to ground truth for noise contamination. As such, these datasets contain the observation visibility amplitudes (without noise), noise visibilities and boolean pixel-wise masks at several thresholds on the noise visibilities. We outline the dimensions of all datasets below:

Dataset Dimensions Field vis masks_orig masks_0 masks_1 masks_2 masks_4 masks_8 masks_16 Datatype float32 float32 bool bool bool bool bool bool Of course, one can produce masks at arbitrary thresholds, but for convenience, we include several pre-computed options.

All datasets and all fields have the dimensions 512, 512, 512, 1 (baseline, time, frequency, amplitude/mask)
Adult income dataset
kaggle.com
zip
Updated Oct 6, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
1251 (2016). Adult income dataset [Dataset]. https://www.kaggle.com/datasets/wenruliu/adult-income-dataset/discussion/285989
Explore at:
zip(667679 bytes)Available download formats
Dataset updated
Oct 6, 2016
Authors
1251
Description
An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc.

This is a widely cited KNN dataset. I encountered it during my course, and I wish to share it here because it is a good starter example for data pre-processing and machine learning practices.

Fields The dataset contains 16 columns Target filed: Income -- The income is divide into two classes: <=50K and >50K
Number of attributes: 14 -- These are the demographics and other features to describe a person

We can explore the possibility in predicting income level based on the individual’s personal information.

Acknowledgements This dataset named “adult” is found in the UCI machine learning repository http://www.cs.toronto.edu/~delve/data/adult/desc.html

The detailed description on the dataset can be found in the original UCI documentation http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html
100 categorized URLs of web pages that describe, contain, or link to...
zenodo.org
csv
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Plamena Neycheva; Robert Jäschke; Robert Jäschke; Plamena Neycheva (2025). 100 categorized URLs of web pages that describe, contain, or link to (research) datasets [Dataset]. http://doi.org/10.5281/zenodo.16418048
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16418048
Dataset updated
Jul 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Plamena Neycheva; Robert Jäschke; Robert Jäschke; Plamena Neycheva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 12, 2023
Description
This dataset is a list of 100 manually collected URLs of web pages that describe, contain, or link to (research) datasets. The list was annotated and categorised with the following fields:

URL: a URL to a page that describes, contains, or links (research) datasets

URL to dataset page: either the same URL or, for URLs that point to repository-like systems, a sub-page specific to a few datasets

type of page: 0 = list of data sets, 1 = description of a data set, 2 = reference to data sets, 3 = project website, 4 = research data repository, 5 = miscellaneous; this field can contain several values

number of datasets: how many datasets were found on the dataset page

file formats: which file formats were found on the dataset page (e.g., jpg, txt, csv)

types of datasets: which type of data were found on the dataset page (e.g., text, image, video, table)

available metadata: which metadata for the datasets were found on the dataset page (e.g., title, description, year)
Z
Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...
data.niaid.nih.gov
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Albert Gatt (2024). HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10723070
Explore at:
Dataset updated
Feb 28, 2024
Dataset provided by
Michele Cafagna
Kees van Deemter
Albert Gatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ("people at a holiday resort") and the actions they perform ("people having a picnic"). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.
Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
zenodo.org
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasija Nikiforova (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
Charalampos Alexopoulos
Magdalena Ciesielska
Nina Rizun
Anastasija Nikiforova
Andrea Miletič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt

Latest Data Professionals Salary Dataset

kaggle.com

Updated Jul 9, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Aman Chauhan (2023). Latest Data Professionals Salary Dataset [Dataset]. https://www.kaggle.com/datasets/whenamancodes/data-professionals-salary-dataset-2022

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 9, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Aman Chauhan

Description

About Dataset

Context

Analytics refers to the methodical examination and calculation of data or statistics. Its purpose is to uncover, interpret, and convey meaningful patterns found within the data. Additionally, analytics involves utilizing these data patterns to make informed decisions. It proves valuable in domains abundant with recorded information, employing a combination of statistics, computer programming, and operations research to measure performance.

Businesses can leverage analytics to describe, predict, and enhance their overall performance. Various branches of analytics encompass predictive analytics, prescriptive analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big Data Analytics, retail analytics, supply chain analytics, store assortment and stock-keeping unit optimization, marketing optimization and marketing mix modeling, web analytics, call analytics, speech analytics, sales force sizing and optimization, price and promotion modeling, predictive science, graph analytics, credit risk analysis, and fraud analytics. Due to the extensive computational requirements involved (particularly with big data), analytics algorithms and software utilize state-of-the-art methods from computer science, statistics, and mathematics.

Data Dictionary

Columns	Description
Company Name	Company Name refers to the name of the organization or company where an individual is employed. It represents the specific entity that provides job opportunities and is associated with a particular industry or sector.
Job Title	Job Title refers to the official designation or position held by an individual within a company or organization. It represents the specific role or responsibilities assigned to the person in their professional capacity.
Salaries Reported	Salaries Reported indicates the information or data related to the salaries of employees within a company or industry. This data may be collected and reported through various sources, such as surveys, employee disclosures, or public records.
Location	Location refers to the specific geographical location or area where a company or job position is situated. It provides information about the physical location or address associated with the company's operations or the job's work environment.
Salary	Salary refers to the monetary compensation or remuneration received by an employee in exchange for their work or services. It represents the amount of money paid to an individual on a regular basis, typically in the form of wages or a fixed annual income.

Content

This Dataset consists of salaries for Data Scientists, Machine Learning Engineers, Data Analysts, and Data Engineers in various cities across India (2022).

-Salary Dataset.csv -Partially Cleaned Salary Dataset.csv

Acknowledgements

This Dataset is created from https://www.glassdoor.co.in/. If you want to learn more, you can visit the Website.

Mars rover dataset

kaggle.com

Updated Mar 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Gaurav Kumar (2025). Mars rover dataset [Dataset]. https://www.kaggle.com/datasets/gauravkumar2525/mars-rover-dataset/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 1, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gaurav Kumar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

1. Description

The description section is crucial for helping users understand the purpose, context, and potential applications of your dataset. It should include the following details:

Dataset Overview: Provide a clear summary of what the dataset contains. For example, "This dataset includes images captured by NASA’s Curiosity Rover on Mars, along with metadata such as the camera used, the Martian sol (day) when the photo was taken, and the corresponding Earth date."
Source of the Data: Explain where the data comes from. If it’s obtained via an API, mention the source (e.g., "The images and metadata are retrieved from NASA's Mars Rover Photos API."). If the dataset is curated from multiple sources, list them.
Purpose and Use Cases: Describe why this dataset was created and how it can be used. For example:
- Machine Learning: Train models for image classification, object detection, and anomaly detection.
- Scientific Research: Analyze Martian surface patterns, study terrain features, or examine rover camera performance.
- Space Exploration: Understand Mars' environmental conditions and assist in future exploration planning.
Data Format and Organization: Briefly mention the format of the files (e.g., CSV file for metadata, ZIP file containing images) and how they are structured.
Licensing and Permissions: Specify if the dataset has any restrictions on usage. Since NASA data is typically public domain, state that users are free to use it for research and development.
Limitations or Considerations: Mention any potential challenges, such as missing data, limited coverage, or resolution constraints.

2. File Information

This section provides details about the files included in your dataset, helping users navigate and use them efficiently. Key points to include:

List of Files: Clearly mention all the files and their formats, such as:
- mars_rover_dataset.csv (CSV file containing metadata of images)
- mars_images.zip (Compressed folder containing all images)
Purpose of Each File:
- CSV File: Contains structured data, including image IDs, timestamps, camera details, and URLs.
- ZIP File: Stores actual Mars images, which can be extracted and used for ML training or visualization.
File Dependencies: Explain how files relate to each other. For example, "The img_src column in mars_rover_dataset.csv corresponds to the images stored in mars_images.zip. Users should extract the images before using the dataset for model training."
How to Access the Files: Provide instructions on downloading and extracting files. Example:
bash unzip mars_images.zip
This ensures that users can quickly set up the dataset in their working environment.

3. Column Descriptions

This section explains the meaning of each column in the dataset, ensuring users can analyze and interpret the data correctly. A well-structured table format is often useful:

Column Name	Description
`id`	Unique identifier for each image.
`sol`	Martian sol (day) when the image was captured.
`camera_name`	Abbreviated name of the rover's camera (e.g., "FHAZ" for Front Hazard Camera).
`camera_full_name`	Full descriptive name of the camera.
`img_src`	URL link to the image. Users can download images using this link.
`earth_date`	The Earth date corresponding to the Martian sol.
`rover_name`	Name of the rover that captured the image (e.g., "Curiosity").
`rover_status`	Current operational status of the rover (e.g., "Active" or "Complete").
`landing_date`	Date when the rover landed on Mars.
`launch_date`	Date when the rover was launched from Earth.

Additional Details:

Data Types: Indicate whether a column contains numbers, text, or dates.
Data Format: Example: earth_date is in YYYY-MM-DD format.
Special Notes: If any column has missing values or requires preprocessing, mention it.

This section helps users quickly understand the dataset's structure, making it easier for them to work with the data effectively.

d
Dynamic Land Cover Dataset
data.gov.au
researchdata.edu.au
zip
Updated Aug 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2023). Dynamic Land Cover Dataset [Dataset]. https://data.gov.au/data/dataset/1556b944-731c-4b7f-a03e-14577c7e68db
Explore at:
zip(186838338)Available download formats
Dataset updated
Aug 11, 2023
Dataset authored and provided by
Bioregional Assessment Program
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Abstract

This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied.

The Dynamic Land Cover Dataset of Australia is the first nationally consistent and thematically comprehensive land cover reference for Australia. It is the result of a collaboration between Geoscience Australia and the Australian Bureau of Agriculture and Resource Economics and Sciences, and provides a base-line for identifying and reporting on change and trends in vegetation cover and extent. Land cover is the observed biophysical cover on the Earth¿s surface, including native vegetation, soils, exposed rocks and water bodies as well as anthropogenic elements such as plantations, crops and built environments. Remote sensing data recorded over a period of time allows the observation of land cover dynamics. Different land cover types display distinct responses due to seasonal, climatic and anthropogenic drivers. Classifying these responses provides a robust and repeatable way of characterising land cover types. A key aspect of land cover is vegetation greenness. The greenness of vegetation is directly related to the amount of photosynthesis occurring, and can be measured as an index such as the Enhanced Vegetation Index (EVI). The Dynamic Land Cover Dataset presents land cover information for every 250m by 250m area of the country from April 2000 to April 2008. The classification scheme used to describe land cover categories in the Dynamic Land Cover Dataset conforms to the 2007 International Standards Organisation (ISO) land cover standard (19144-2). The Dynamic Land Cover Dataset shows Australian land covers clustered into 34 ISO classes. These reflect the structural character of vegetation, ranging from cultivated and managed land covers (crops and pastures) to natural land covers such as closed forest and sparse, open grasslands.

Dataset History

The source data for the Dynamic Land Cover Dataset is a time series of Enhanced Vegetation Index (EVI) data from the Moderate Resolution Imaging Spectroradiometer (MODIS) on the Terra and Aqua satellites operated by NASA. The time series includes 186 snapshots of vegetation greenness for each 250m by 250m area across the continent over an 8 year period from 2000 to 2008. Complete information on the creation of this product can be found in the following documents available from the Geoscience Australia website www.ga.gov.au/landcover.

Dataset Citation

Geoscience Australia (2010) Dynamic Land Cover Dataset. Bioregional Assessment Source Dataset. Viewed 27 September 2017, http://data.bioregionalassessments.gov.au/dataset/1556b944-731c-4b7f-a03e-14577c7e68db.
NADA-SynShapes: A synthetic shape benchmark for testing probabilistic deep...
zenodo.org
text/x-python, zip
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulio Del Corso; Giulio Del Corso; Volpini Federico; Volpini Federico; Claudia Caudai; Claudia Caudai; Davide Moroni; Davide Moroni; Sara Colantonio; Sara Colantonio (2025). NADA-SynShapes: A synthetic shape benchmark for testing probabilistic deep learning models [Dataset]. http://doi.org/10.5281/zenodo.15194187
Explore at:
zip, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15194187
Dataset updated
Apr 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Giulio Del Corso; Giulio Del Corso; Volpini Federico; Volpini Federico; Claudia Caudai; Claudia Caudai; Davide Moroni; Davide Moroni; Sara Colantonio; Sara Colantonio
License
Attribution-NonCommercial-NoDerivs 2.5 (CC BY-NC-ND 2.5)https://creativecommons.org/licenses/by-nc-nd/2.5/
License information was derived automatically
Time period covered
Dec 18, 2024
Description
NADA (Not-A-Database) is an easy-to-use geometric shape data generator that allows users to define non-uniform multivariate parameter distributions to test novel methodologies. The full open-source package is provided at GIT:NA_DAtabase. See Technical Report for details on how to use the provided package.

This database includes 3 repositories:

NADA_Dis: Is the model able to correctly characterize/Disentangle a complex latent space?
The repository contains 3x100,000 synthetic black and white images to test the ability of the models to correctly define a proper latent space (e.g., autoencoders) and disentangle it. The first 100,000 images contain 4 shapes and uniform parameter space distributions, while the other images have a more complex underlying distribution (truncated Gaussian and correlated marginal variables).

NADA_OOD: Does the model identify Out-Of-Distribution images?
The repository contains 100,000 training images (4 different shapes with 3 possible colors located in the upper left corner of the canvas) and 6x100,000 increasingly different sets of images (changing the color class balance, reducing the radius of the shape, moving the shape to the lower left corner) providing increasingly challenging out-of-distribution images.
This can help to test not only the capability of a model, but also methods that produce reliability estimates and should correctly classify OOD elements as "unreliable" as they are far from the original distributions.

NADA_AlEp: Does the model distinguish between different types (Aleatoric/Epistemic) of uncertainties?
The repository contains 5x100,000 images with different type of noise/uncertainties:

NADA_AlEp_0_Clean: Dataset clean of noise to use as a possible training set.

NADA_AlEp_1_White_Noise: Epistemic white noise dataset. Each image is perturbed with an amount of white noise randomly sampled from 0% to 90%.

NADA_AlEp_2_Deformation: Dataset with Epistemic deformation noise. Each image is deformed by a randomly amount uniformly sampled between 0% and 90%. 0% corresponds to the original image, while 100% is a full deformation to the circumscribing circle.

NADA_AlEp_3_Label: Dataset with label noise. Formally, 20% of Triangles of a given color are missclassified as a Square with a random color (among Blue, Orange, and Brown) and viceversa (Squares to Triangles). Label noise introduces \textit{Aleatoric Uncertainty} because it is inherent in the data and cannot be reduced.

NADA_AlEp_4_Combined: Combined dataset with all previous sources of uncertainty.

Each image can be used for classification (shape/color) or regression (radius/area) tasks.

All datasets can be modified and adapted to the user's research question using the included open source data generator.
e
Data from: Analysis of the Dataset for the Terms Novice Programmers Use to...
b2find.eudat.eu
Updated Jun 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Analysis of the Dataset for the Terms Novice Programmers Use to Describe Code Snippets in Java [Dataset]. https://b2find.eudat.eu/dataset/0e3aa9e0-79a2-5f7c-9aa0-83c168ca5c34
Explore at:
Dataset updated
Jun 2, 2024
Description
This dataset consists of about 1800 free-text responses in German from 123 students in an introductory programming course. For 15 different code snippets in Java, the participants described how they would explain what the corresponding code snippet does. This dataset includes also the analysis of the responses.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gurpreet Singh India (2024). Sales Dataset with Natural Language Statement [Dataset]. https://www.kaggle.com/datasets/gurpreetsinghindia/sales-data-with-natural-language

Sales Dataset with Natural Language Statement

A dataset of 10,000 sales transactions represented in natural language statement

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 1, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gurpreet Singh India

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This dataset contains 10,000 simulated sales transaction records, each represented in natural language with diverse sentence structures. It is designed to mimic how different users might describe the same type of transaction in varying ways, making it ideal for Natural Language Processing (NLP) tasks, text-based data extraction, and accounting automation projects.

Each record in the dataset includes the following fields:

Sale Date: The date on which the transaction took place. Customer Name: A randomly generated customer name. Product: The type of product purchased. Quantity: The quantity of the product purchased. Unit Price: The price per unit of the product. Total Amount: The total price for the purchased products. Tax Rate: The percentage of tax applied to the transaction. Payment Method: The method by which the payment was made (e.g., Credit Card, Debit Card, UPI, etc.). Sentence: A natural language description of the sales transaction. The sentence structure is varied to simulate different ways people describe the same type of sales event.

Use Cases: NLP Training: This dataset is suitable for training models to extract structured information (e.g., date, customer, amount) from natural language descriptions of sales transactions. Accounting Automation: The dataset can be used to build or test systems that automate posting of sales transactions based on unstructured text input. Text Data Preprocessing: It provides a good resource for developing methods to preprocess and standardize varying formats of text descriptions. Chatbot Training: This dataset can help train chatbots or virtual assistants that handle accounting or customer inquiries by understanding different ways of expressing the same transaction details.

Key Features: High Variability: Sentences are structured in numerous ways to simulate natural human language variations. Randomized Data: Names, dates, products, quantities, prices, and payment methods are randomized, ensuring no duplication. Multi-Field Information: Each record contains key sales information essential for accounting and business use cases.

Potential Applications: Use for Named Entity Recognition (NER) tasks. Apply for information extraction challenges. Create pattern recognition models to understand different sentence structures. Test rule-based systems or machine learning models for sales data entry and accounting automation.

License: Ensure that the dataset is appropriately licensed according to your intended use. For general public and research purposes, choose a CC0: Public Domain license, unless specific restrictions apply.

Clear search

Close search

Google apps

Main menu

Sales Dataset with Natural Language Statement

Common Metadata Elements for Cataloging Biomedical Datasets

Dataset metadata of known Dataverse installations, August 2023

Describe Art Dataset

Describe Art

A fMRI dataset in response to large number of short natural dynamic facial...

Empathy dataset

Cybersecurity Attack Dataset

ML Methods/Algorithms Dataset

Column Descriptors

Use Case

SELTO Dataset

Synthetic dataset 2

Tabascal SNN-NLN Dataset

Adult income dataset

100 categorized URLs of web pages that describe, contain, or link to...

Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...

Dataset: A Systematic Literature Review on the topic of High-value datasets

Latest Data Professionals Salary Dataset

About Dataset

Context

Data Dictionary

Content

Acknowledgements

Mars rover dataset

1. Description

2. File Information

3. Column Descriptions

Additional Details:

Dynamic Land Cover Dataset

Abstract

Dataset History

Dataset Citation

NADA-SynShapes: A synthetic shape benchmark for testing probabilistic deep...

Data from: Analysis of the Dataset for the Terms Novice Programmers Use to...

Sales Dataset with Natural Language Statement

A dataset of 10,000 sales transactions represented in natural language statement