100+ datasets found
  1. U

    Machine learning model that estimates total monthly and annual per capita...

    • data.usgs.gov
    • datasets.ai
    • +4more
    Updated Sep 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880
    Explore at:
    Dataset updated
    Sep 17, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 2000 - Dec 31, 2020
    Description

    This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...

  2. D

    Data Versioning Tool Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Data Versioning Tool Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-versioning-tool-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 4, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Versioning Tool Market Outlook



    The global Data Versioning Tool market size was valued at approximately USD 1.5 billion in 2023 and is forecasted to reach around USD 4.8 billion by 2032, reflecting a robust CAGR of 13.7% during the forecast period. The growth in this market is primarily driven by the increasing need for efficient data management and the rising adoption of data-driven decision-making across various industries.



    One of the significant growth factors for the Data Versioning Tool market is the exponential increase in the volume of data generated by enterprises. The advent of Big Data, IoT, and AI technologies has led to a data explosion, necessitating advanced tools to manage and version this data effectively. Data versioning tools facilitate the tracking of changes, enabling organizations to maintain data integrity, compliance, and governance. This ensures that organizations can handle their data efficiently, leading to enhanced data quality and better analytical outcomes.



    Another driver contributing to the market's growth is the rising awareness of data security and compliance regulations. With stringent regulatory requirements such as GDPR, HIPAA, and CCPA, organizations are compelled to adopt robust data management practices. Data versioning tools provide an audit trail of data changes, which is crucial for compliance and reporting purposes. This capability helps organizations mitigate risks associated with data breaches and non-compliance, thereby fostering the adoption of these tools.



    The increasing popularity of cloud computing also acts as a catalyst for the growth of the Data Versioning Tool market. Cloud-based data versioning tools offer scalability, flexibility, and cost-effectiveness, making them an attractive option for businesses of all sizes. These tools enable real-time collaboration and access to versioned data from any location, which is particularly beneficial in today's remote working environment. The seamless integration of cloud-based data versioning tools with other cloud services further enhances their value proposition, driving market growth.



    Regionally, North America held the largest market share in 2023, attributed to the presence of major technology companies and the high adoption rate of advanced data management solutions. The Asia Pacific region is expected to exhibit the highest CAGR during the forecast period, driven by the rapid digital transformation and increasing investments in data infrastructure by emerging economies like China and India. Europe also presents significant growth opportunities due to stringent data protection regulations and the growing emphasis on data governance.



    Component Analysis



    The Data Versioning Tool market is segmented into software and services based on the component. The software segment held a dominant share in the market in 2023, driven by the high demand for advanced data management solutions. These software tools offer a wide range of functionalities, including data tracking, version control, and rollback capabilities, which are essential for maintaining data integrity and consistency. The integration of AI and machine learning algorithms in these tools further enhances their efficiency, making them indispensable for modern enterprises.



    The services segment, although smaller, is expected to grow at a significant pace during the forecast period. This growth is attributed to the increasing need for consulting, implementation, and support services associated with data versioning tools. Organizations often require expert guidance to deploy these tools effectively and integrate them with their existing systems. Additionally, the ongoing maintenance and updates necessitate continuous support services, driving the demand in this segment.



    The software segment can be further categorized into on-premises and cloud-based solutions. On-premises software is preferred by organizations with stringent data security requirements and those that need complete control over their data. However, the cloud-based software segment is expected to witness higher growth due to its scalability, cost-effectiveness, and ease of deployment. The cloud model also supports real-time collaboration and remote access, which are critical in today's distributed work environments.



    Within the services segment, consulting services are anticipated to hold a substantial share. As organizations embark on their data management journeys, they seek expert advice to choose the right tools and strategies. Implementation services are a

  3. M

    ModelOps and MLOps Platforms Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). ModelOps and MLOps Platforms Report [Dataset]. https://www.datainsightsmarket.com/reports/modelops-and-mlops-platforms-1946071
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    May 23, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The ModelOps and MLOps platforms market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across various industries. The surge in data volume and complexity necessitates efficient management and deployment of ML models, fueling the demand for platforms that streamline the entire machine learning lifecycle. These platforms offer functionalities such as model versioning, monitoring, and deployment, enabling organizations to improve model performance, reduce operational costs, and accelerate time-to-market for AI-powered solutions. The market is segmented by deployment type (cloud, on-premise), organization size (small, medium, large), and industry vertical (finance, healthcare, retail, etc.), with cloud-based deployments gaining significant traction due to scalability and cost-effectiveness. Key players are actively investing in research and development, incorporating advanced features like automated model retraining, explainable AI (XAI), and MLOps automation to enhance platform capabilities and cater to evolving business needs. Competition is intensifying, with both established technology vendors and specialized startups vying for market share through strategic partnerships, acquisitions, and innovative product offerings. The forecast period (2025-2033) promises further expansion, fueled by factors such as the growing adoption of edge AI, the rise of generative AI, and the increasing demand for real-time analytics. However, challenges such as the need for skilled professionals, data security and privacy concerns, and the complexity of integrating MLOps into existing IT infrastructures remain. Despite these challenges, the long-term outlook remains positive, with the market expected to witness substantial growth driven by continuous technological advancements, wider industry adoption, and increasing awareness of the benefits of streamlined ML model management. This market will be shaped by the ability of vendors to provide user-friendly interfaces, robust scalability, and seamless integration with existing data pipelines and business processes. The focus will shift towards addressing the complexities of deploying and managing increasingly sophisticated AI models in production environments.

  4. m

    Software code quality and source code metrics dataset

    • data.mendeley.com
    • narcis.nl
    Updated Feb 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2
    Explore at:
    Dataset updated
    Feb 17, 2021
    Authors
    Sayed Mohsin Reza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.

  5. e

    Exploration of DVC (Data Version Control) - Dataset - B2FIND

    • b2find.eudat.eu
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Exploration of DVC (Data Version Control) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9459b6e6-14b1-5c9a-838c-23a12195b039
    Explore at:
    Dataset updated
    May 8, 2023
    Description

    The repository contains tutorials and code which were created based on the exploration of DVC (Data Version Control) as a potential tool for managing machine learning pipelines within HZDR. The tutorials aim to help understanding the tools features and drawbacks and also serve as future teaching material.

  6. m

    Predicting Vulnerability Inducing Function Versions Using Node Embeddings...

    • data.mendeley.com
    Updated Jan 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sefa Eren Şahin (2022). Predicting Vulnerability Inducing Function Versions Using Node Embeddings and Graph Neural Networks - Wireshark [Dataset]. http://doi.org/10.17632/ymtf9znmfz.2
    Explore at:
    Dataset updated
    Jan 19, 2022
    Authors
    Sefa Eren Şahin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wireshark Vulnerability Prediction Dataset

    This dataset is constructed by a team of researchers in Istanbul Techical University Faculty of Computer and Informatics, and used in the paper entitled as "Predicting Vulnerability Inducing Function Versions Using Node Embeddings and Graph Neural Networks". Please see the GitHub repository https://github.com/erensahin/gnn-vulnerability-prediction for more details on usage.

    This dataset consists of two main parts: * AST dumps which can be used as inputs for any Machine Learning model. (ast_input) * Wireshark file changes and bugs (file_changes_and_bugs)

    ast_input

    asp_input folder contains three files:

    • ast_input.zip: This file is a compressed version of AST dumps in Python pickle format. You should use python pickle library to unpickle and use the data.
    • node_embeddings_by_kind.pkl: Embedding vectors corresponding to AST node kinds in python pickle format.
    • token_id_vocabulary.pkl: Map of token ids and their corresponding tokens in python pickle format.

    file_changes_and_bugs

    file_changes_and_bugs folder consists of five files:

    • wireshark_file_changes.csv: list of file changes made in wireshark repository. file changes are basicly commit-file pairs.
    • wireshark_cve_bug_matching.csv: this entity maps CVE entries to bug ids in wireshark bug repository. This is scraped from https://www.wireshark.org/security/
    • additional_bugs.csv: additional security related bugs that our team manually identified by investigating security advisories and bug reports.
    • wireshark_bug_commit_matching.csv: this entity maps security bugs (vulnerabilities) to commits in wireshark source code repositry.
    • wireshark_bug_inducing_file_changes.csv: this entity maps vulnerabilities in wireshark source files in terms of in which commit a vulnerability is induced and fixed.
  7. d

    AI4Arctic / ASIP Sea Ice Dataset - version 2

    • data.dtu.dk
    • figshare.com
    pdf
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver (2023). AI4Arctic / ASIP Sea Ice Dataset - version 2 [Dataset]. http://doi.org/10.11583/DTU.13011134.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 12, 2023
    Dataset provided by
    Technical University of Denmark
    Authors
    Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The AI4Arctic / ASIP Sea Ice Dataset - version 2 (ASID-v2) contain 461 Sentinel-1 Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute in 2018-2019. Ice charts contain sea ice concentration, stage of development and form of ice, provided in manual drawn polygons. The ice charts have been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the Sentinel-1 data. Details are described in the manual that is published together with the dataset.The manual has been revised, the latest is the 30-09-2020 version.

  8. o

    Data from: Machine Learning the Tip of the Red Giant Branch

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Oct 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchell T Dennis; Jeremy Sakstein (2022). Machine Learning the Tip of the Red Giant Branch [Dataset]. http://doi.org/10.5281/zenodo.7197506
    Explore at:
    Dataset updated
    Oct 13, 2022
    Authors
    Mitchell T Dennis; Jeremy Sakstein
    Description

    Reproduction package for the Paper "Machine Learning the Tip of the Red Giant Branch" ## Authors Mitchell T. Dennis (mtde226@hawaii.edu) Jeremy Sakstein (sakstein@hawaii.edu) ## Software MESA version 15140 (http://mesa.sourceforge.net/) MESASDK version 20210401 (http://www.astro.wisc.edu/~townsend/static.php?ref=mesasdk) GFORTRAN GCC version 9.2.0 ## Citation Policy If you use any of this reproduction package for independent work, we recommend you cite the following papers: * Astrophys. J. Suppl. 192, 3 (2011) * Astrophys. J. Suppl. 208, 4 (2013) * Astrophys. J. Suppl. 234, 34 (2018) * Astrophys. J. Suppl. 243, 10 (2019) For more information, see the Readme.md

  9. Z

    Replication Package for "Why Do Deep Learning Projects Differ in Compatible...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huashan Lei (2023). Replication Package for "Why Do Deep Learning Projects Differ in Compatible Framework Versions? An Exploratory Study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8266949
    Explore at:
    Dataset updated
    Sep 13, 2023
    Dataset provided by
    Jun Wang
    Yepang Liu
    Shuai Zhang
    Yulei Sui
    Huashan Lei
    Guanping Xiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains scripts and data used to generate relevant results for this paper. Detailed information are described in README.md.

    code

    This folder contains all the scripts used for the experiment. The upgrade.py and downgrade.py are used to perform upgrade and downgrade runs. The pairing.py is used to generate the DFVC pairs. The main.py is used to identify root causes of DFVC pairs.

    result

    This folder contains all the results of the experiments, including the runtime output (e.g., a_1.0.0.txt), the runtime environment (e.g., condalist_1.0.0.txt), and the project's runtime commands (e.g., pytorch-cifar.xlsx) of all tested 90 PyTorch and 50 TensorFlow projects.

    Distribution of dfvc pairs.xlsx

    This file includes 6,926 DFVC pairs and their root causes.

    Tested framework versions.xlsx

    This file includes the framework versions tested and the Python versions that the framework versions are compatible with.

    Tested projects.xlsx

    This file includes the tested 90 PyTorch projects and 50 TensorFlow projects. We provide the following main information: (a) project name, (b) stars, (c) link, (d) the starting version, (e) python version, (f) incompatible upgrade/downgrade version, and (g) compatible versions.

  10. Metadata record for: Compendiums of cancer transcriptomes for machine...

    • springernature.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim (2023). Metadata record for: Compendiums of cancer transcriptomes for machine learning applications [Dataset]. http://doi.org/10.6084/m9.figshare.9901763.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor Compendiums of cancer transcriptomes for machine learning applications. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format 
         3. machine readable metadata file in ISA-Tab format (zipped folder)Versioning Note:A revised version was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
    
  11. d

    Mechanical MNIST crack path extended version

    • search.dataone.org
    • datadryad.org
    • +1more
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saeed Mohammadzadeh; Emma Lejeune (2025). Mechanical MNIST crack path extended version [Dataset]. http://doi.org/10.5061/dryad.rv15dv486
    Explore at:
    Dataset updated
    May 3, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Saeed Mohammadzadeh; Emma Lejeune
    Time period covered
    Jan 1, 2021
    Description

    The Mechanical MNIST Crack Path dataset contains Finite Element simulation results from phase-field models of quasi-static brittle fracture in heterogeneous material domains subjected to prescribed loading and boundary conditions. For all samples, the material domain is a square with a side length of 1. There is an initial crack of fixed length (0.25) on the left edge of each domain. The bottom edge of the domain is fixed in x (horizontal) and y (vertical), the right edge of the domain is fixed in x and free in y, and the left edge is free in both x and y. The top edge is free in x, and in y it is displaced such that, at each step, the displacement increases linearly from zero at the top right corner to the maximum displacement on the top left corner. Maximum displacement starts at 0.0 and increases to 0.02 by increments of 0.0001 (200 simulation steps in total). The heterogeneous material distribution is obtained by adding rigid circular inclusions to the domain using the Fashion MNIST...

  12. i

    LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and...

    • ieee-dataport.org
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    matthieu gallet (2023). LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling - Full Analysis Version [Dataset]. https://ieee-dataport.org/documents/lsd4wsd-vx-open-dataset-wet-snow-detection-sar-data-and-physical-labelling-full-analysis
    Explore at:
    Dataset updated
    Oct 30, 2023
    Authors
    matthieu gallet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.

  13. t

    NBA Player Dataset & Prediction Model Artifacts

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png +2
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43
    Explore at:
    json, png, csv, bin, txt, text/markdownAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Burak Baltali; Burak Baltali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

    Brief overview of Files

    1. end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;

    2. the Jupyter notebook (Analysis.ipynb); All the code can be executed in there

    3. the trained model binary (nba_model.pkl); Serialized Random Forest model artifact

    4. Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here

    5. FAIR4ML metadata (fair4ml_metadata.jsonld);
      see README.md and abbreviations.txt for file details.”

    6. For further information you can go to the github site (Link below)

    File Details

    Notebook

    Analysis.ipynb: Involves the graphica output of the trained and tested data.

    Trained/ Test csv Data

    NameDescriptionPID
    regular_train.csvFor training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose4421e56c-4cd3-4ec1-a566-a89d7ec0bced
    regular_test.csv:For testing purpose of the regular season, the 2022-2023 season was selectedf9d84d5e-db01-4475-b7d1-80cfe9fe0e61
    playoff_train.csvFor training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected bcb3cf2b-27df-48cc-8b76-9e49254783d0
    playoff_test.csvFor testing purpose of the playoff season, 2023-2024 season was selectedde37d568-e97f-4cb9-bc05-2e600cc97102

    Others

    abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

    Additional Notes

    Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

    Some preprocessing has to be done before uploading into dbrepo

    Plots have also been uploaded as an output for visual purposes.

    A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

  14. Z

    Spacekit Data Archive

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kein, Ru (2025). Spacekit Data Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7830048
    Explore at:
    Dataset updated
    Mar 3, 2025
    Dataset authored and provided by
    Kein, Ru
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Collection of datasets, models and training results for spacekit machine learning algorithms. To learn more, please visit https://spacekit.readthedocs.io/en/latest/

    Versioning note: modifications to existing uploads are indicated by major version iterations (e.g. 1.0, 2.0, 3.0); new file additions are denoted by minor version increments (e.g. 1.1, 1.2, 1.3) since these are inherently backwards compatible.

  15. Z

    A controlled vocabulary for research and innovation in the field of...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernardo Rondelli (2022). A controlled vocabulary for research and innovation in the field of Artificial Intelligence (AI) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4536032
    Explore at:
    Dataset updated
    Sep 7, 2022
    Dataset provided by
    César Parra-Rojas
    Bernardo Rondelli
    Nicolau Duran-Silva
    Francesco Alessandro Massucci
    Enric Fuster
    Fernando Roda
    Nicandro Bovenzi
    Chiara Toietta
    Arnau Quinquillà
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A controlled vocabulary for research and innovation in the field of Artificial Intelligence (AI)

    This controlled vocabulary of keywords related to the field of Artificial Intelligence (AI) was built by SIRIS Academic in collaboration with ART-ER (the R&I and sustainable development in-house agency of the Emilia-Romagna region in Italy) and the Generalitat de Catalunya (the regional government of Catalonia, Spain), in order to identify AI research, development and innovation activities. The work was carried out by consulting domain experts' advice and it was ultimately applied to inform regional strategies on AI and research and innovation policy.

    The aim of this vocabulary is to enable one to retrieve texts (e.g. R&D projects and scientific publications) featuring the concepts included in the present vocabulary in their titles and abstracts, assuming that these records have a certain contribution of applications, techniques and issues, in the domain of AI.

    The present effort was carried out because, despite the high number of contributions and technological developments in the field of AI, there is no closed or static vocabulary of concepts that allows to unequivocally define the boundaries of what should be considered “an Artificial Intelligence intellectual product” (or what should not). Indeed, the literature presents different definitions of the domain, with visions that could be contradictory. AI encompasses today a wide variety of subdomains, ranging from general purpose areas such as learning and perception to more specific ones such as autonomous vehicle driving, theorem proving, or industrial process monitoring. AI synthesises and automates intellectual tasks, and is therefore potentially relevant to any area of human intellectual activity. In this sense, it is a genuinely universal and multidisciplinary field. AI draws upon disciplines as diverse as cybernetics, mathematics, philosophy, sociology and economics.

    As a ground for the construction of the AI controlled vocabulary, an initial set of concepts was taken from different subdomains of the ACM Computing Classification System 2012, to define the boundaries of the AI domain. Notably, although some relevant AI subdomains have an independent category in the ACM taxonomy outside of AI, they have been included in the list of subdomains. In order to align the ACM taxonomical definition with the Catalan Strategy of AI, CATALONIA.AI, in version 1 of this resource the emerging area of AI Ethics was included in the vocabulary, while some other categories which are not relevant for the objectives were removed from the subdomains list. In the current version 2, the classification and the labels of the subdomains have been revised because of the evolution of the field. Some fields have been grouped in order to reduce the overlap between subdomains and to provide a taxonomy that makes more sense for the analysis of R&I ecosystems.

    The different subdomains in the versions are presented in the following table:

        Version  
        Subdomains
    

    Version 2

    (1) Machine learning and deep learning; (2) Computer Vision; (3) Natural Language Processing and speech recognition; (4) Intelligent agents, planning, scheduling, problem-solving, control methods, and search; (5) Expert Systems, Knowledge representation and reasoning; (6) AI Ethics.

        Version 1
        (1) General, (2) Machine Learning, (3) Computer Vision, (4) Natural Language Processing, (5) Knowledge Representation and Reasoning, (6) Distributed Artificial Intelligence, (7) Expert Systems, Problem-Solving, Control Methods and Search and (8) AI Ethics.
    

    Although a keyword rule-based approach suffers from the major shortcomings of not capturing all the lexical and linguistic variants of specific concepts nor the context of the words - namely, keyword-based approaches would miss relevant texts if the specific pattern is not matched during the search - the present vocabulary allowed us to obtain fairly good results, due to the specificity of the concepts describing the AI domain. Furthermore, an understandable and transparent controlled vocabulary allows a better control of the final results and the final definition of the domain borders. Also, a plain list of terms allows a much easier and interactive engagement of interested stakeholders with different degrees of knowledge (such as, for instance, domain experts, policy-makers and potential users) who can make use of vocabulary to retrieve pertinent literature or to enrich the resource itself.

    The vocabulary has been built taking advantage of advanced language models and resources from knowledge datasets such as arXiv, DBpedia and Wikipedia. The resulting vocabulary comprises 833 keywords, and has been validated by experts from several universities in Emilia-Romagna and Catalonia.

    The version 0.5 of this resource was developed by the SIRIS Academic in 2019 in collaboration with ART-ER, Emilia-Romagna (Quinquillá et al., 2020), the version 1 was the result of an update done in 2020 in collaboration with the Generalitat de Catalunya, and the current version (version 2) has resulted in 2021 from the collaboration with ART-ER and the integration of an additional set of keywords provided by the Artificial Intelligence and Intelligence Systems (AIIS) Laboratory of the CINI (Consorzio interuniversitario nazionale per l’informatica based in Rome, Italy).

    The methodology for the construction of the controlled vocabulary is presented in the following steps:

    An initial set of scientific publications was collected by retrieving the following records as a weakly-supervised (in the sense that records are linked to AI by their taxonomy and not by a manual label) dataset in the domain of Artificial Intelligence :

    Publications from Scopus with the keyword “Artificial Intelligence”

    Publications from arXiv in the category “Artificial Intelligence”

    Publications in relevant journals in the scientific domain of “Artificial Intelligence”

    An automated algorithm was used to retrieve, from the APIs of DBpedia, a series of terms that have some categorical relationships (i.e. those that are indexed as “sub-categories of”, “equivalent to”, among other relations in DBpedia) with the Artificial Intelligence concept and with the AI categories in the ACM taxonomy. The DBpedia tree has been exploited down to the level 3, and the relevant categories have been manually selected (for instance: Classification algorithms, Machine learning or Evolutionary computation) and others were ignored (for instance: Artificial intelligence in fiction, Robots or History of artificial intelligence) because they were not relevant, or not specifically in the domain.

    The keywords in publications in the dataset were extracted from the keyword sections and from the abstracts. The keywords with a higher TF-IDF, using an IDF matrix in the open domain, have been selected. The co-occurrence of keywords with categories in specific AI subdomain and a clusterization of the main keywords has been used for a categorization of the keywords at the thematic level.

    This list of keywords tagged by thematic category has been manually revised, removing the non-pertinent keywords and changing the wrong categorizations by fields.

    The weak-supervised dataset in the domain of Artificial Intelligence is used to train a Word2Vec (Mikolov et al., 2013) word embedding model (a machine learning model based on neural networks).

    The terms’ list is then enriched by means of automatic methods, which are run in parallel:

    The trained Word2Vec model is used to select, among the indexed keywords of the reference corpus, all terms “semantically close” to the initial set of words. This step is carried out to select terms that might not appear in the texts themselves, but that were deemed pertinent to label the textual records.

    Further, terms that are mentioned in the texts of the reference corpus and that are valued by the trained Word2Vec model as “semantically close” to the initial set of words are also retained. This step is performed to include in the controlled vocabulary a series of terms that are related to the focus of the SDGs and which are used by practitioners.

    The final list produced by steps 2-6 is manually revised.

    The definition of the vocabulary does not, per se, allow to identify STI contributions to AI: this activity in fact boils down to actually matching the terms in the controlled vocabulary to the content of the gathered STI textual records. To successfully carry out this task, a series of pattern matching rules must be defined to capture possible variants of the same concept, such as permutations of words within the concept and/or the presence of null words to be skipped. For this reason, we have carefully crafted matching rules that take into account permutations of words and that allow words within concept to be within a certain distance. Some relatively ambiguous keywords (which may match unwanted pieces of text), have a set of associated “extra” terms. These “extra” terms are defined as further terms that must co-appear, in the same sentence, together with their associated ambiguous keywords.

    Finally, each keyword in the vocabulary was assigned one or more AI subdomains, so that the vocabulary can also be used to tag collections of texts within narrower AI sub-domains. In order to complement the alignment between keywords and subdomains, a set of subdomain-specific keywords have been defined to better capture the scope of the subdomains. These allow better characterization of subdomains that are more difficult to define only by means of unambiguous specific concepts, or that overlap with the wide “machine learning” subdomain (example: machine learning applied to object recognition or

  16. V

    Version Control Systems Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Version Control Systems Report [Dataset]. https://www.archivemarketresearch.com/reports/version-control-systems-565976
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Version Control Systems (VCS) market is experiencing robust growth, driven by the increasing adoption of DevOps practices, cloud computing, and the rising need for collaborative software development. The market size in 2025 is estimated at $607.6 million. While the specific CAGR isn't provided, considering the industry trends and the presence of major players like Microsoft, Amazon Web Services, and IBM, a conservative estimate would place the CAGR between 10% and 15% for the forecast period (2025-2033). This growth is fueled by several factors, including the expanding adoption of Agile methodologies, the demand for improved software quality and faster release cycles, and the increasing complexity of software projects. The market is segmented across various types of VCS, including distributed systems (like Git), centralized systems, and cloud-based solutions. The competitive landscape is highly fragmented, with established players alongside emerging innovative companies. The growth of open-source VCS solutions and the increasing focus on security and integration with other development tools are further shaping the market dynamics. The continued rise of cloud-native applications and microservices architectures will significantly contribute to the growth trajectory of the VCS market throughout the forecast period. Companies are increasingly adopting cloud-based VCS solutions for enhanced scalability, accessibility, and collaboration. Furthermore, the growing demand for robust security features and compliance with industry regulations is compelling organizations to invest in sophisticated VCS platforms. The market is expected to see further consolidation in the coming years, with larger players potentially acquiring smaller companies to expand their market share and capabilities. The integration of AI and machine learning into VCS platforms for automated code review and improved development workflows will be a key trend shaping the future of the market.

  17. h

    Dataset and scripts for A Deep Dive into Machine Learning Density Functional...

    • rodare.hzdr.de
    zip
    Updated Oct 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael (2021). Dataset and scripts for A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry [Dataset]. http://doi.org/10.14278/rodare.1197
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 1, 2021
    Dataset provided by
    HZDR / CASUS
    Authors
    Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains additional data for the publication "A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry". Its goal is to enable interested people to reproduce the citation analysis carried out in the aforementioned publication.

    Prerequesites

    The following software versions were used for the python version of this dataset:

    Python: 3.8.6

    Scholarly: 1.2.0

    Pyzotero: 1.4.24

    Numpy: 1.20.1

    Contents

    results/ : Contains the .csv files that were the results of the citation analysis. Paper groupings follow the ones outlined in the publication.

    scripts/ : Contains scripts to perform the citation analysis.

    Zotero.cached.pkl : Contains the cached Zotero library.

    Usage

    In order to reproduce the results of the citation analysis, you can use citation_analysis.py in conjunction with cached Zotero library. Manual additions can be verified using the check_consistency script.
    Please note that you will need a Tor key for the citation analysis, and access to our Zotero library if you don't want to use the cached version. If you need this access, simply contact us.

  18. f

    SynSpeech Dataset (Large Version Part 1)

    • figshare.com
    txt
    Updated Nov 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yusuf Brima (2024). SynSpeech Dataset (Large Version Part 1) [Dataset]. http://doi.org/10.6084/m9.figshare.27628047.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    figshare
    Authors
    Yusuf Brima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SynSpeech Dataset (Large Version Part 1) is an English-language synthetic speech dataset designed for benchmarking disentangled speech representation learning methods. Created using OpenVoice and LibriSpeech-100, it includes 249 unique speakers, each with 500 distinct sentences spoken in four styles: “default,” “friendly,” “sad,” and “whispering,” recorded at a 16kHz sampling rate.Due to file size limitations, the dataset has been split into two nearly equal halves. This first half contains data for 136 of the 249 speakers, along with metadata detailing speaker information, gender, speaking style, text, and file paths. The synspeech_Large_Metadata.csv file provides metadata for both halves, and both parts of the archive must be extracted and placed within the same parent directory for full functionality.Data is organized by speaker ID, making this dataset ideal for applications in representation learning, speaker and content factorization, and TTS synthesis.

  19. c

    RF Probability (version 0.1): Mapping Wetlands with High Resolution Planet...

    • figshare.canterbury.ac.nz
    tiff
    Updated Jul 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Wilson; Saif Khan (2025). RF Probability (version 0.1): Mapping Wetlands with High Resolution Planet SuperDove Satellite Imagery [Dataset]. http://doi.org/10.26021/canterburynz.29310392.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jul 28, 2025
    Dataset provided by
    University of Canterbury Data Repository
    Authors
    Matthew Wilson; Saif Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets represent predictions and associated probabilities using four machine learning methods, associated with collection: Mapping Wetlands with High Resolution Planet SuperDove Satellite Imagery: An Assessment of Machine Learning Models Across the Diverse Waterscapes of New Zealand (10.26021/canterburynz.c.7848596).The following datasets are available:HGB predictionHGB probabilityMLPC predictionMLPC probabilityRandom forest predictionRandom forest probability [this dataset]XGBoost predictionXGBoost probabilityFor details of the models developed, please see the collection and associated paper. The following files are available in each dataset, each representing an area within New Zealand:xxxxx_mmm_prediction.tif: model prediction, encoded as 8-bit integers where 1 is predicted as wetland (>50% probability), and NA (no data) is non-wetland.xxxxx_mmm_probability.tif: model wetland probability, encoded as 16-bit integers, with probability values from 0 to 1 rescaled from 0 to 10,000. Divide the values by 10,000 to obtain probabilities to four decimal places.In the tile filenames, xxxxx refers to the UUID of the grid area, which can be found in the file nzgrid_uuid.gpkg, and mmm is a code which refers to the model used:hgb: histogram gradient boostmlpc: multi-layer perceptron classificationrf: random forestxgb: extreme gradient boostingIn addition to the tif images, two virtual raster tile files are included to enable mapping at the national scale:_mmm_prediction.vrt_mmm_probability.vrtAll tif images are saved using cloud optimised geotiff (COG), which makes them fast to display even at a national level, although increases the data size. Total size is around 700 MB for the prediction datasets, and ~75 GB for the probability datasets.Metadata for the Planet SuperDove imagery used for each pixel of the predictions is available here: https://doi.org/10.26021/canterburynz.29231837.v

  20. Z

    Build and measurements of Linux kernel configurations across different...

    • data.niaid.nih.gov
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Martin (2022). Build and measurements of Linux kernel configurations across different versions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7433622
    Explore at:
    Dataset updated
    Dec 14, 2022
    Dataset provided by
    Juliana Alves Pereira
    Luc Lesoil
    Jean-Marc Jézéquel
    Mathieu Acher
    Hugo Martin
    Djamel Eddine Khelladi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With large scale and complex configurable systems, it is hard for users to choose the right combination of options (i.e., configurations) in order to obtain the wanted trade-off between functionality and performance goals such as speed or size. Machine learning can help in relating these goals to the configurable system options, and thus, predict the effect of options on the outcome, typically after a costly training step. However, many configurable systems evolve at such a rapid pace that it is impractical to retrain a new model from scratch for each new version. Taking the extreme case of the Linux kernel with its ≈ 14, 500 configuration options, we investigate how binary size predictions of kernel size degrade over successive versions (and how transfer learning can be adapted and applied to mitigate this degradation).

    We used and are sharing a unique and large dataset constituted of the binary sizes (compressed and non-compressed) of thousands of configurations for different versions of the kernel, spanning three years (4.13, 4.15, 4.20, 5.0, 5.4, 5.7, and 5.8). Overall, around 200K configurations over 10K+ options/features and 6 versions.

    This dataset has been used in the Transactions of Software Engineering (TSE) article "Transfer Learning Across Variants and Versions: The Case of Linux Kernel Size" (preprint: https://hal.inria.fr/hal-03358817)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 17, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Time period covered
Jan 1, 2000 - Dec 31, 2020
Description

This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...

Search
Clear search
Close search
Google apps
Main menu