100+ datasets found

U
Machine learning model that estimates total monthly and annual per capita...
data.usgs.gov
datasets.ai
+4more
Updated Sep 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880
Explore at:
Unique identifier
https://doi.org/10.5066/P9FUL880
Dataset updated
Sep 17, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 31, 2020
Description
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...
D
Data Versioning Tool Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Data Versioning Tool Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-versioning-tool-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Oct 4, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Versioning Tool Market Outlook

The global Data Versioning Tool market size was valued at approximately USD 1.5 billion in 2023 and is forecasted to reach around USD 4.8 billion by 2032, reflecting a robust CAGR of 13.7% during the forecast period. The growth in this market is primarily driven by the increasing need for efficient data management and the rising adoption of data-driven decision-making across various industries.

One of the significant growth factors for the Data Versioning Tool market is the exponential increase in the volume of data generated by enterprises. The advent of Big Data, IoT, and AI technologies has led to a data explosion, necessitating advanced tools to manage and version this data effectively. Data versioning tools facilitate the tracking of changes, enabling organizations to maintain data integrity, compliance, and governance. This ensures that organizations can handle their data efficiently, leading to enhanced data quality and better analytical outcomes.

Another driver contributing to the market's growth is the rising awareness of data security and compliance regulations. With stringent regulatory requirements such as GDPR, HIPAA, and CCPA, organizations are compelled to adopt robust data management practices. Data versioning tools provide an audit trail of data changes, which is crucial for compliance and reporting purposes. This capability helps organizations mitigate risks associated with data breaches and non-compliance, thereby fostering the adoption of these tools.

The increasing popularity of cloud computing also acts as a catalyst for the growth of the Data Versioning Tool market. Cloud-based data versioning tools offer scalability, flexibility, and cost-effectiveness, making them an attractive option for businesses of all sizes. These tools enable real-time collaboration and access to versioned data from any location, which is particularly beneficial in today's remote working environment. The seamless integration of cloud-based data versioning tools with other cloud services further enhances their value proposition, driving market growth.

Regionally, North America held the largest market share in 2023, attributed to the presence of major technology companies and the high adoption rate of advanced data management solutions. The Asia Pacific region is expected to exhibit the highest CAGR during the forecast period, driven by the rapid digital transformation and increasing investments in data infrastructure by emerging economies like China and India. Europe also presents significant growth opportunities due to stringent data protection regulations and the growing emphasis on data governance.

Component Analysis

The Data Versioning Tool market is segmented into software and services based on the component. The software segment held a dominant share in the market in 2023, driven by the high demand for advanced data management solutions. These software tools offer a wide range of functionalities, including data tracking, version control, and rollback capabilities, which are essential for maintaining data integrity and consistency. The integration of AI and machine learning algorithms in these tools further enhances their efficiency, making them indispensable for modern enterprises.

The services segment, although smaller, is expected to grow at a significant pace during the forecast period. This growth is attributed to the increasing need for consulting, implementation, and support services associated with data versioning tools. Organizations often require expert guidance to deploy these tools effectively and integrate them with their existing systems. Additionally, the ongoing maintenance and updates necessitate continuous support services, driving the demand in this segment.

The software segment can be further categorized into on-premises and cloud-based solutions. On-premises software is preferred by organizations with stringent data security requirements and those that need complete control over their data. However, the cloud-based software segment is expected to witness higher growth due to its scalability, cost-effectiveness, and ease of deployment. The cloud model also supports real-time collaboration and remote access, which are critical in today's distributed work environments.

Within the services segment, consulting services are anticipated to hold a substantial share. As organizations embark on their data management journeys, they seek expert advice to choose the right tools and strategies. Implementation services are a
M
ModelOps and MLOps Platforms Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). ModelOps and MLOps Platforms Report [Dataset]. https://www.datainsightsmarket.com/reports/modelops-and-mlops-platforms-1946071
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
May 23, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The ModelOps and MLOps platforms market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across various industries. The surge in data volume and complexity necessitates efficient management and deployment of ML models, fueling the demand for platforms that streamline the entire machine learning lifecycle. These platforms offer functionalities such as model versioning, monitoring, and deployment, enabling organizations to improve model performance, reduce operational costs, and accelerate time-to-market for AI-powered solutions. The market is segmented by deployment type (cloud, on-premise), organization size (small, medium, large), and industry vertical (finance, healthcare, retail, etc.), with cloud-based deployments gaining significant traction due to scalability and cost-effectiveness. Key players are actively investing in research and development, incorporating advanced features like automated model retraining, explainable AI (XAI), and MLOps automation to enhance platform capabilities and cater to evolving business needs. Competition is intensifying, with both established technology vendors and specialized startups vying for market share through strategic partnerships, acquisitions, and innovative product offerings. The forecast period (2025-2033) promises further expansion, fueled by factors such as the growing adoption of edge AI, the rise of generative AI, and the increasing demand for real-time analytics. However, challenges such as the need for skilled professionals, data security and privacy concerns, and the complexity of integrating MLOps into existing IT infrastructures remain. Despite these challenges, the long-term outlook remains positive, with the market expected to witness substantial growth driven by continuous technological advancements, wider industry adoption, and increasing awareness of the benefits of streamlined ML model management. This market will be shaped by the ability of vendors to provide user-friendly interfaces, robust scalability, and seamless integration with existing data pipelines and business processes. The focus will shift towards addressing the complexities of deploying and managing increasingly sophisticated AI models in production environments.
m
Software code quality and source code metrics dataset
data.mendeley.com
narcis.nl
Updated Feb 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2
Explore at:
Unique identifier
https://doi.org/10.17632/77p6rzb73n.2
Dataset updated
Feb 17, 2021
Authors
Sayed Mohsin Reza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.
e
Exploration of DVC (Data Version Control) - Dataset - B2FIND
b2find.eudat.eu
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Exploration of DVC (Data Version Control) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9459b6e6-14b1-5c9a-838c-23a12195b039
Explore at:
Dataset updated
May 8, 2023
Description
The repository contains tutorials and code which were created based on the exploration of DVC (Data Version Control) as a potential tool for managing machine learning pipelines within HZDR. The tutorials aim to help understanding the tools features and drawbacks and also serve as future teaching material.
m
Predicting Vulnerability Inducing Function Versions Using Node Embeddings...
data.mendeley.com
Updated Jan 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sefa Eren Şahin (2022). Predicting Vulnerability Inducing Function Versions Using Node Embeddings and Graph Neural Networks - Wireshark [Dataset]. http://doi.org/10.17632/ymtf9znmfz.2
Explore at:
Unique identifier
https://doi.org/10.17632/ymtf9znmfz.2
Dataset updated
Jan 19, 2022
Authors
Sefa Eren Şahin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wireshark Vulnerability Prediction Dataset

This dataset is constructed by a team of researchers in Istanbul Techical University Faculty of Computer and Informatics, and used in the paper entitled as "Predicting Vulnerability Inducing Function Versions Using Node Embeddings and Graph Neural Networks". Please see the GitHub repository https://github.com/erensahin/gnn-vulnerability-prediction for more details on usage.

This dataset consists of two main parts: * AST dumps which can be used as inputs for any Machine Learning model. (ast_input) * Wireshark file changes and bugs (file_changes_and_bugs)

ast_input

asp_input folder contains three files:

ast_input.zip: This file is a compressed version of AST dumps in Python pickle format. You should use python pickle library to unpickle and use the data.

node_embeddings_by_kind.pkl: Embedding vectors corresponding to AST node kinds in python pickle format.

token_id_vocabulary.pkl: Map of token ids and their corresponding tokens in python pickle format.

file_changes_and_bugs

file_changes_and_bugs folder consists of five files:

wireshark_file_changes.csv: list of file changes made in wireshark repository. file changes are basicly commit-file pairs.

wireshark_cve_bug_matching.csv: this entity maps CVE entries to bug ids in wireshark bug repository. This is scraped from https://www.wireshark.org/security/

additional_bugs.csv: additional security related bugs that our team manually identified by investigating security advisories and bug reports.

wireshark_bug_commit_matching.csv: this entity maps security bugs (vulnerabilities) to commits in wireshark source code repositry.

wireshark_bug_inducing_file_changes.csv: this entity maps vulnerabilities in wireshark source files in terms of in which commit a vulnerability is induced and fixed.
d
AI4Arctic / ASIP Sea Ice Dataset - version 2
data.dtu.dk
figshare.com
pdf
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver (2023). AI4Arctic / ASIP Sea Ice Dataset - version 2 [Dataset]. http://doi.org/10.11583/DTU.13011134.v3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.13011134.v3
Dataset updated
Jul 12, 2023
Dataset provided by
Technical University of Denmark
Authors
Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The AI4Arctic / ASIP Sea Ice Dataset - version 2 (ASID-v2) contain 461 Sentinel-1 Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute in 2018-2019. Ice charts contain sea ice concentration, stage of development and form of ice, provided in manual drawn polygons. The ice charts have been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the Sentinel-1 data. Details are described in the manual that is published together with the dataset.The manual has been revised, the latest is the 30-09-2020 version.
o
Data from: Machine Learning the Tip of the Red Giant Branch
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Oct 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitchell T Dennis; Jeremy Sakstein (2022). Machine Learning the Tip of the Red Giant Branch [Dataset]. http://doi.org/10.5281/zenodo.7197506
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7197506
Dataset updated
Oct 13, 2022
Authors
Mitchell T Dennis; Jeremy Sakstein
Description
Reproduction package for the Paper "Machine Learning the Tip of the Red Giant Branch" ## Authors Mitchell T. Dennis (mtde226@hawaii.edu) Jeremy Sakstein (sakstein@hawaii.edu) ## Software MESA version 15140 (http://mesa.sourceforge.net/) MESASDK version 20210401 (http://www.astro.wisc.edu/~townsend/static.php?ref=mesasdk) GFORTRAN GCC version 9.2.0 ## Citation Policy If you use any of this reproduction package for independent work, we recommend you cite the following papers: * Astrophys. J. Suppl. 192, 3 (2011) * Astrophys. J. Suppl. 208, 4 (2013) * Astrophys. J. Suppl. 234, 34 (2018) * Astrophys. J. Suppl. 243, 10 (2019) For more information, see the Readme.md
Z
Replication Package for "Why Do Deep Learning Projects Differ in Compatible...
data.niaid.nih.gov
zenodo.org
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huashan Lei (2023). Replication Package for "Why Do Deep Learning Projects Differ in Compatible Framework Versions? An Exploratory Study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8266949
Explore at:
Dataset updated
Sep 13, 2023
Dataset provided by
Jun Wang
Yepang Liu
Shuai Zhang
Yulei Sui
Huashan Lei
Guanping Xiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains scripts and data used to generate relevant results for this paper. Detailed information are described in README.md.

code

This folder contains all the scripts used for the experiment. The upgrade.py and downgrade.py are used to perform upgrade and downgrade runs. The pairing.py is used to generate the DFVC pairs. The main.py is used to identify root causes of DFVC pairs.

result

This folder contains all the results of the experiments, including the runtime output (e.g., a_1.0.0.txt), the runtime environment (e.g., condalist_1.0.0.txt), and the project's runtime commands (e.g., pytorch-cifar.xlsx) of all tested 90 PyTorch and 50 TensorFlow projects.

Distribution of dfvc pairs.xlsx

This file includes 6,926 DFVC pairs and their root causes.

Tested framework versions.xlsx

This file includes the framework versions tested and the Python versions that the framework versions are compatible with.

Tested projects.xlsx

This file includes the tested 90 PyTorch projects and 50 TensorFlow projects. We provide the following main information: (a) project name, (b) stars, (c) link, (d) the starting version, (e) python version, (f) incompatible upgrade/downgrade version, and (g) compatible versions.
Metadata record for: Compendiums of cancer transcriptomes for machine...
springernature.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim (2023). Metadata record for: Compendiums of cancer transcriptomes for machine learning applications [Dataset]. http://doi.org/10.6084/m9.figshare.9901763.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9901763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains key characteristics about the data described in the Data Descriptor Compendiums of cancer transcriptomes for machine learning applications. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 3. machine readable metadata file in ISA-Tab format (zipped folder)Versioning Note:A revised version was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
d
Mechanical MNIST crack path extended version
search.dataone.org
datadryad.org
+1more
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saeed Mohammadzadeh; Emma Lejeune (2025). Mechanical MNIST crack path extended version [Dataset]. http://doi.org/10.5061/dryad.rv15dv486
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.rv15dv486
Dataset updated
May 3, 2025
Dataset provided by
Dryad Digital Repository
Authors
Saeed Mohammadzadeh; Emma Lejeune
Time period covered
Jan 1, 2021
Description
The Mechanical MNIST Crack Path dataset contains Finite Element simulation results from phase-field models of quasi-static brittle fracture in heterogeneous material domains subjected to prescribed loading and boundary conditions. For all samples, the material domain is a square with a side length of 1. There is an initial crack of fixed length (0.25) on the left edge of each domain. The bottom edge of the domain is fixed in x (horizontal) and y (vertical), the right edge of the domain is fixed in x and free in y, and the left edge is free in both x and y. The top edge is free in x, and in y it is displaced such that, at each step, the displacement increases linearly from zero at the top right corner to the maximum displacement on the top left corner. Maximum displacement starts at 0.0 and increases to 0.02 by increments of 0.0001 (200 simulation steps in total). The heterogeneous material distribution is obtained by adding rigid circular inclusions to the domain using the Fashion MNIST...
i
LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and...
ieee-dataport.org
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
matthieu gallet (2023). LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling - Full Analysis Version [Dataset]. https://ieee-dataport.org/documents/lsd4wsd-vx-open-dataset-wet-snow-detection-sar-data-and-physical-labelling-full-analysis
Explore at:
Dataset updated
Oct 30, 2023
Authors
matthieu gallet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.

NBA Player Dataset & Prediction Model Artifacts

test.researchdata.tuwien.ac.at

bin, csv, json, png +2

Updated Apr 28, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43

Explore at:

json, png, csv, bin, txt, text/markdownAvailable download formats

Unique identifier

https://doi.org/10.70124/ymgzs-z3s43

Dataset updated

Apr 28, 2025

Dataset provided by

TU Wien

Authors

Burak Baltali; Burak Baltali

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

Brief overview of Files

end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;
the Jupyter notebook (Analysis.ipynb); All the code can be executed in there
the trained model binary (nba_model.pkl); Serialized Random Forest model artifact
Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here
FAIR4ML metadata (fair4ml_metadata.jsonld);
see README.md and abbreviations.txt for file details.”
For further information you can go to the github site (Link below)

File Details

Notebook

Analysis.ipynb: Involves the graphica output of the trained and tested data.

Trained/ Test csv Data

Name	Description	PID
regular_train.csv	For training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose	4421e56c-4cd3-4ec1-a566-a89d7ec0bced
regular_test.csv:	For testing purpose of the regular season, the 2022-2023 season was selected	f9d84d5e-db01-4475-b7d1-80cfe9fe0e61
playoff_train.csv	For training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected	bcb3cf2b-27df-48cc-8b76-9e49254783d0
playoff_test.csv	For testing purpose of the playoff season, 2023-2024 season was selected	de37d568-e97f-4cb9-bc05-2e600cc97102

Others

abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

Additional Notes

Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

Some preprocessing has to be done before uploading into dbrepo

Plots have also been uploaded as an output for visual purposes.

A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

Z
Spacekit Data Archive
data.niaid.nih.gov
zenodo.org
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kein, Ru (2025). Spacekit Data Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7830048
Explore at:
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Kein, Ru
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Collection of datasets, models and training results for spacekit machine learning algorithms. To learn more, please visit https://spacekit.readthedocs.io/en/latest/

Versioning note: modifications to existing uploads are indicated by major version iterations (e.g. 1.0, 2.0, 3.0); new file additions are denoted by minor version increments (e.g. 1.1, 1.2, 1.3) since these are inherently backwards compatible.
Z
A controlled vocabulary for research and innovation in the field of...
data.niaid.nih.gov
zenodo.org
Updated Sep 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernardo Rondelli (2022). A controlled vocabulary for research and innovation in the field of Artificial Intelligence (AI) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4536032
Explore at:
Dataset updated
Sep 7, 2022
Dataset provided by
César Parra-Rojas
Bernardo Rondelli
Nicolau Duran-Silva
Francesco Alessandro Massucci
Enric Fuster
Fernando Roda
Nicandro Bovenzi
Chiara Toietta
Arnau Quinquillà
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A controlled vocabulary for research and innovation in the field of Artificial Intelligence (AI)

This controlled vocabulary of keywords related to the field of Artificial Intelligence (AI) was built by SIRIS Academic in collaboration with ART-ER (the R&I and sustainable development in-house agency of the Emilia-Romagna region in Italy) and the Generalitat de Catalunya (the regional government of Catalonia, Spain), in order to identify AI research, development and innovation activities. The work was carried out by consulting domain experts' advice and it was ultimately applied to inform regional strategies on AI and research and innovation policy.

The aim of this vocabulary is to enable one to retrieve texts (e.g. R&D projects and scientific publications) featuring the concepts included in the present vocabulary in their titles and abstracts, assuming that these records have a certain contribution of applications, techniques and issues, in the domain of AI.

The present effort was carried out because, despite the high number of contributions and technological developments in the field of AI, there is no closed or static vocabulary of concepts that allows to unequivocally define the boundaries of what should be considered “an Artificial Intelligence intellectual product” (or what should not). Indeed, the literature presents different definitions of the domain, with visions that could be contradictory. AI encompasses today a wide variety of subdomains, ranging from general purpose areas such as learning and perception to more specific ones such as autonomous vehicle driving, theorem proving, or industrial process monitoring. AI synthesises and automates intellectual tasks, and is therefore potentially relevant to any area of human intellectual activity. In this sense, it is a genuinely universal and multidisciplinary field. AI draws upon disciplines as diverse as cybernetics, mathematics, philosophy, sociology and economics.

As a ground for the construction of the AI controlled vocabulary, an initial set of concepts was taken from different subdomains of the ACM Computing Classification System 2012, to define the boundaries of the AI domain. Notably, although some relevant AI subdomains have an independent category in the ACM taxonomy outside of AI, they have been included in the list of subdomains. In order to align the ACM taxonomical definition with the Catalan Strategy of AI, CATALONIA.AI, in version 1 of this resource the emerging area of AI Ethics was included in the vocabulary, while some other categories which are not relevant for the objectives were removed from the subdomains list. In the current version 2, the classification and the labels of the subdomains have been revised because of the evolution of the field. Some fields have been grouped in order to reduce the overlap between subdomains and to provide a taxonomy that makes more sense for the analysis of R&I ecosystems.

The different subdomains in the versions are presented in the following table:

Version Subdomains

Version 2

(1) Machine learning and deep learning; (2) Computer Vision; (3) Natural Language Processing and speech recognition; (4) Intelligent agents, planning, scheduling, problem-solving, control methods, and search; (5) Expert Systems, Knowledge representation and reasoning; (6) AI Ethics.

Version 1 (1) General, (2) Machine Learning, (3) Computer Vision, (4) Natural Language Processing, (5) Knowledge Representation and Reasoning, (6) Distributed Artificial Intelligence, (7) Expert Systems, Problem-Solving, Control Methods and Search and (8) AI Ethics.

Although a keyword rule-based approach suffers from the major shortcomings of not capturing all the lexical and linguistic variants of specific concepts nor the context of the words - namely, keyword-based approaches would miss relevant texts if the specific pattern is not matched during the search - the present vocabulary allowed us to obtain fairly good results, due to the specificity of the concepts describing the AI domain. Furthermore, an understandable and transparent controlled vocabulary allows a better control of the final results and the final definition of the domain borders. Also, a plain list of terms allows a much easier and interactive engagement of interested stakeholders with different degrees of knowledge (such as, for instance, domain experts, policy-makers and potential users) who can make use of vocabulary to retrieve pertinent literature or to enrich the resource itself.

The vocabulary has been built taking advantage of advanced language models and resources from knowledge datasets such as arXiv, DBpedia and Wikipedia. The resulting vocabulary comprises 833 keywords, and has been validated by experts from several universities in Emilia-Romagna and Catalonia.

The version 0.5 of this resource was developed by the SIRIS Academic in 2019 in collaboration with ART-ER, Emilia-Romagna (Quinquillá et al., 2020), the version 1 was the result of an update done in 2020 in collaboration with the Generalitat de Catalunya, and the current version (version 2) has resulted in 2021 from the collaboration with ART-ER and the integration of an additional set of keywords provided by the Artificial Intelligence and Intelligence Systems (AIIS) Laboratory of the CINI (Consorzio interuniversitario nazionale per l’informatica based in Rome, Italy).

The methodology for the construction of the controlled vocabulary is presented in the following steps:

An initial set of scientific publications was collected by retrieving the following records as a weakly-supervised (in the sense that records are linked to AI by their taxonomy and not by a manual label) dataset in the domain of Artificial Intelligence :

Publications from Scopus with the keyword “Artificial Intelligence”

Publications from arXiv in the category “Artificial Intelligence”

Publications in relevant journals in the scientific domain of “Artificial Intelligence”

An automated algorithm was used to retrieve, from the APIs of DBpedia, a series of terms that have some categorical relationships (i.e. those that are indexed as “sub-categories of”, “equivalent to”, among other relations in DBpedia) with the Artificial Intelligence concept and with the AI categories in the ACM taxonomy. The DBpedia tree has been exploited down to the level 3, and the relevant categories have been manually selected (for instance: Classification algorithms, Machine learning or Evolutionary computation) and others were ignored (for instance: Artificial intelligence in fiction, Robots or History of artificial intelligence) because they were not relevant, or not specifically in the domain.

The keywords in publications in the dataset were extracted from the keyword sections and from the abstracts. The keywords with a higher TF-IDF, using an IDF matrix in the open domain, have been selected. The co-occurrence of keywords with categories in specific AI subdomain and a clusterization of the main keywords has been used for a categorization of the keywords at the thematic level.

This list of keywords tagged by thematic category has been manually revised, removing the non-pertinent keywords and changing the wrong categorizations by fields.

The weak-supervised dataset in the domain of Artificial Intelligence is used to train a Word2Vec (Mikolov et al., 2013) word embedding model (a machine learning model based on neural networks).

The terms’ list is then enriched by means of automatic methods, which are run in parallel:

The trained Word2Vec model is used to select, among the indexed keywords of the reference corpus, all terms “semantically close” to the initial set of words. This step is carried out to select terms that might not appear in the texts themselves, but that were deemed pertinent to label the textual records.

Further, terms that are mentioned in the texts of the reference corpus and that are valued by the trained Word2Vec model as “semantically close” to the initial set of words are also retained. This step is performed to include in the controlled vocabulary a series of terms that are related to the focus of the SDGs and which are used by practitioners.

The final list produced by steps 2-6 is manually revised.

The definition of the vocabulary does not, per se, allow to identify STI contributions to AI: this activity in fact boils down to actually matching the terms in the controlled vocabulary to the content of the gathered STI textual records. To successfully carry out this task, a series of pattern matching rules must be defined to capture possible variants of the same concept, such as permutations of words within the concept and/or the presence of null words to be skipped. For this reason, we have carefully crafted matching rules that take into account permutations of words and that allow words within concept to be within a certain distance. Some relatively ambiguous keywords (which may match unwanted pieces of text), have a set of associated “extra” terms. These “extra” terms are defined as further terms that must co-appear, in the same sentence, together with their associated ambiguous keywords.

Finally, each keyword in the vocabulary was assigned one or more AI subdomains, so that the vocabulary can also be used to tag collections of texts within narrower AI sub-domains. In order to complement the alignment between keywords and subdomains, a set of subdomain-specific keywords have been defined to better capture the scope of the subdomains. These allow better characterization of subdomains that are more difficult to define only by means of unambiguous specific concepts, or that overlap with the wide “machine learning” subdomain (example: machine learning applied to object recognition or
V
Version Control Systems Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Version Control Systems Report [Dataset]. https://www.archivemarketresearch.com/reports/version-control-systems-565976
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Jun 17, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Version Control Systems (VCS) market is experiencing robust growth, driven by the increasing adoption of DevOps practices, cloud computing, and the rising need for collaborative software development. The market size in 2025 is estimated at $607.6 million. While the specific CAGR isn't provided, considering the industry trends and the presence of major players like Microsoft, Amazon Web Services, and IBM, a conservative estimate would place the CAGR between 10% and 15% for the forecast period (2025-2033). This growth is fueled by several factors, including the expanding adoption of Agile methodologies, the demand for improved software quality and faster release cycles, and the increasing complexity of software projects. The market is segmented across various types of VCS, including distributed systems (like Git), centralized systems, and cloud-based solutions. The competitive landscape is highly fragmented, with established players alongside emerging innovative companies. The growth of open-source VCS solutions and the increasing focus on security and integration with other development tools are further shaping the market dynamics. The continued rise of cloud-native applications and microservices architectures will significantly contribute to the growth trajectory of the VCS market throughout the forecast period. Companies are increasingly adopting cloud-based VCS solutions for enhanced scalability, accessibility, and collaboration. Furthermore, the growing demand for robust security features and compliance with industry regulations is compelling organizations to invest in sophisticated VCS platforms. The market is expected to see further consolidation in the coming years, with larger players potentially acquiring smaller companies to expand their market share and capabilities. The integration of AI and machine learning into VCS platforms for automated code review and improved development workflows will be a key trend shaping the future of the market.
h
Dataset and scripts for A Deep Dive into Machine Learning Density Functional...
rodare.hzdr.de
zip
Updated Oct 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael (2021). Dataset and scripts for A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry [Dataset]. http://doi.org/10.14278/rodare.1197
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.14278/rodare.1197
Dataset updated
Oct 1, 2021
Dataset provided by
HZDR / CASUS
Authors
Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains additional data for the publication "A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry". Its goal is to enable interested people to reproduce the citation analysis carried out in the aforementioned publication.

Prerequesites

The following software versions were used for the python version of this dataset:

Python: 3.8.6

Scholarly: 1.2.0

Pyzotero: 1.4.24

Numpy: 1.20.1

Contents

results/ : Contains the .csv files that were the results of the citation analysis. Paper groupings follow the ones outlined in the publication.

scripts/ : Contains scripts to perform the citation analysis.

Zotero.cached.pkl : Contains the cached Zotero library.

Usage

In order to reproduce the results of the citation analysis, you can use citation_analysis.py in conjunction with cached Zotero library. Manual additions can be verified using the check_consistency script.
Please note that you will need a Tor key for the citation analysis, and access to our Zotero library if you don't want to use the cached version. If you need this access, simply contact us.
f
SynSpeech Dataset (Large Version Part 1)
figshare.com
txt
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusuf Brima (2024). SynSpeech Dataset (Large Version Part 1) [Dataset]. http://doi.org/10.6084/m9.figshare.27628047.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27628047.v2
Dataset updated
Nov 7, 2024
Dataset provided by
figshare
Authors
Yusuf Brima
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SynSpeech Dataset (Large Version Part 1) is an English-language synthetic speech dataset designed for benchmarking disentangled speech representation learning methods. Created using OpenVoice and LibriSpeech-100, it includes 249 unique speakers, each with 500 distinct sentences spoken in four styles: “default,” “friendly,” “sad,” and “whispering,” recorded at a 16kHz sampling rate.Due to file size limitations, the dataset has been split into two nearly equal halves. This first half contains data for 136 of the 249 speakers, along with metadata detailing speaker information, gender, speaking style, text, and file paths. The synspeech_Large_Metadata.csv file provides metadata for both halves, and both parts of the archive must be extracted and placed within the same parent directory for full functionality.Data is organized by speaker ID, making this dataset ideal for applications in representation learning, speaker and content factorization, and TTS synthesis.
c
RF Probability (version 0.1): Mapping Wetlands with High Resolution Planet...
figshare.canterbury.ac.nz
tiff
Updated Jul 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Wilson; Saif Khan (2025). RF Probability (version 0.1): Mapping Wetlands with High Resolution Planet SuperDove Satellite Imagery [Dataset]. http://doi.org/10.26021/canterburynz.29310392.v1
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.26021/canterburynz.29310392.v1
Dataset updated
Jul 28, 2025
Dataset provided by
University of Canterbury Data Repository
Authors
Matthew Wilson; Saif Khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets represent predictions and associated probabilities using four machine learning methods, associated with collection: Mapping Wetlands with High Resolution Planet SuperDove Satellite Imagery: An Assessment of Machine Learning Models Across the Diverse Waterscapes of New Zealand (10.26021/canterburynz.c.7848596).The following datasets are available:HGB predictionHGB probabilityMLPC predictionMLPC probabilityRandom forest predictionRandom forest probability [this dataset]XGBoost predictionXGBoost probabilityFor details of the models developed, please see the collection and associated paper. The following files are available in each dataset, each representing an area within New Zealand:xxxxx_mmm_prediction.tif: model prediction, encoded as 8-bit integers where 1 is predicted as wetland (>50% probability), and NA (no data) is non-wetland.xxxxx_mmm_probability.tif: model wetland probability, encoded as 16-bit integers, with probability values from 0 to 1 rescaled from 0 to 10,000. Divide the values by 10,000 to obtain probabilities to four decimal places.In the tile filenames, xxxxx refers to the UUID of the grid area, which can be found in the file nzgrid_uuid.gpkg, and mmm is a code which refers to the model used:hgb: histogram gradient boostmlpc: multi-layer perceptron classificationrf: random forestxgb: extreme gradient boostingIn addition to the tif images, two virtual raster tile files are included to enable mapping at the national scale:_mmm_prediction.vrt_mmm_probability.vrtAll tif images are saved using cloud optimised geotiff (COG), which makes them fast to display even at a national level, although increases the data size. Total size is around 700 MB for the prediction datasets, and ~75 GB for the probability datasets.Metadata for the Planet SuperDove imagery used for each pixel of the predictions is available here: https://doi.org/10.26021/canterburynz.29231837.v
Z
Build and measurements of Linux kernel configurations across different...
data.niaid.nih.gov
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo Martin (2022). Build and measurements of Linux kernel configurations across different versions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7433622
Explore at:
Dataset updated
Dec 14, 2022
Dataset provided by
Juliana Alves Pereira
Luc Lesoil
Jean-Marc Jézéquel
Mathieu Acher
Hugo Martin
Djamel Eddine Khelladi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With large scale and complex configurable systems, it is hard for users to choose the right combination of options (i.e., configurations) in order to obtain the wanted trade-off between functionality and performance goals such as speed or size. Machine learning can help in relating these goals to the configurable system options, and thus, predict the effect of options on the outcome, typically after a costly training step. However, many configurable systems evolve at such a rapid pace that it is impractical to retrain a new model from scratch for each new version. Taking the extreme case of the Linux kernel with its ≈ 14, 500 configuration options, we investigate how binary size predictions of kernel size degrade over successive versions (and how transfer learning can be adapted and applied to mitigate this degradation).

We used and are sharing a unique and large dataset constituted of the binary sizes (compressed and non-compressed) of thousands of configurations for different versions of the kernel, spanning three years (4.13, 4.15, 4.20, 5.0, 5.4, 5.7, and 5.8). Overall, around 200K configurations over 10K+ options/features and 6 versions.

This dataset has been used in the Transactions of Software Engineering (TSE) article "Transfer Learning Across Variants and Versions: The Case of Linux Kernel Size" (preprint: https://hal.inria.fr/hal-03358817)

Facebook

Twitter

Click to copy link

Link copied

Cite

Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)

Explore at:

10 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5066/P9FUL880

Dataset updated

Sep 17, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Authors

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Time period covered

Jan 1, 2000 - Dec 31, 2020

Description

This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...

Clear search

Close search

Google apps

Main menu

Machine learning model that estimates total monthly and annual per capita...

Data Versioning Tool Market Report | Global Forecast From 2025 To 2033

Data Versioning Tool Market Outlook

Component Analysis

ModelOps and MLOps Platforms Report

Software code quality and source code metrics dataset

Exploration of DVC (Data Version Control) - Dataset - B2FIND

Predicting Vulnerability Inducing Function Versions Using Node Embeddings...

Wireshark Vulnerability Prediction Dataset

ast_input

file_changes_and_bugs

AI4Arctic / ASIP Sea Ice Dataset - version 2

Data from: Machine Learning the Tip of the Red Giant Branch

Replication Package for "Why Do Deep Learning Projects Differ in Compatible...

Metadata record for: Compendiums of cancer transcriptomes for machine...

Mechanical MNIST crack path extended version

LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and...

NBA Player Dataset & Prediction Model Artifacts

Description

Brief overview of Files

File Details

Spacekit Data Archive

A controlled vocabulary for research and innovation in the field of...

Version Control Systems Report

Dataset and scripts for A Deep Dive into Machine Learning Density Functional...

SynSpeech Dataset (Large Version Part 1)

RF Probability (version 0.1): Mapping Wetlands with High Resolution Planet...

Build and measurements of Linux kernel configurations across different...

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)See More Versions

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)