30 datasets found
  1. Z

    polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Rampi Ramprasad
    Christopher Kuenneth
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

    I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

    For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

    
    
    PSMILES strings only
    
    
    
      
    generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
      
    generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
    
  2. H

    Dataset metadata of known Dataverse installations, August 2023

    • dataverse.harvard.edu
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

  3. Python Systems for Empirical Analysis

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Orrù; Matteo Orrù (2020). Python Systems for Empirical Analysis [Dataset]. http://doi.org/10.5281/zenodo.268468
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo Orrù; Matteo Orrù
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reference

    Studies who have been using the data (in any form) are required to include the following reference:

    @inproceedings{Orru2015, abstract = {The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.}, author = {Orrú, Matteo and Tempero, Ewan and Marchesi, Michele and Tonelli, Roberto and Destefanis, Giuseppe}, booktitle = {Submitted to PROMISE '15}, keywords = {Python, Empirical Studies, Curated Code Collection}, title = {A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering}, year = {2015} }

    About the Data

    Overview

    This paper presents a dataset of metrics taken from a curated collection of 51 popular Python software systems.

    The dataset reports 41 metrics of different categories: volume/size, complexity and object oriented metrics. These metrics and computed both at file and class level. We provide metrics for any file and class of each system and global metrics (computed on the entire system). Moreover we provide 14 meta-data for each system.

    Paper Abstract

    The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.

  4. h

    the-stack

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

  5. d

    TweetyNet results

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Nicholson; Yarden Cohen (2022). TweetyNet results [Dataset]. http://doi.org/10.5061/dryad.gtht76hk4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 29, 2022
    Dataset provided by
    Dryad
    Authors
    David Nicholson; Yarden Cohen
    Time period covered
    2020
    Description

    Please see README at https://github.com/yardencsGitHub/tweetynet/blob/master/article/README.md that explains how to install the software. To replicate the key result as described in the section of the article "TweetyNet annotates with low error rates across individuals and species", please follow the set-up instructions in that README, and then run this script: https://github.com/yardencsGitHub/tweetynet/blob/master/article/src/scripts/replicate-key-result/runall.sh For guidance on how to adapt pre-trained models to new datasets, please see the vak documentation: https://vak.readthedocs.io/en/latest/

  6. P

    darpa_sd2_perovskites Dataset

    • paperswithcode.com
    Updated May 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian M. Pendleton; Mary K. Caucci; Michael Tynes; Aaron Dharna; Mansoor Ani Najeeb Nellikkal; Zhi Li; Emory M. Chan; Alexander J. Norquist; and Joshua Schrier (2020). darpa_sd2_perovskites Dataset [Dataset]. https://paperswithcode.com/dataset/darpa-sd2-perovskites
    Explore at:
    Dataset updated
    May 25, 2020
    Authors
    Ian M. Pendleton; Mary K. Caucci; Michael Tynes; Aaron Dharna; Mansoor Ani Najeeb Nellikkal; Zhi Li; Emory M. Chan; Alexander J. Norquist; and Joshua Schrier
    Description

    Included in this content:

    0045.perovksitedata.csv - main dataset used in this article. A more detailed description can be found in the “dataset overview” section below Chemical Inventory.csv - the hand curated file of all chemicals used in the construction of the perovskite dataset. This file includes identifiers, chemical properties, and other information. ExcessMolarVolumeData.xlsx - record of experimental data, computations, and final dataset used in the generation of the excess molar volume plots. MLModelMetrics.xlsx - all of the ML metrics organized in one place (excludes reactant set specific breakdown, see ML_Logs.zip for those files). OrganoammoniumDensityDataset.xlsx - complete set of the data used to generate the density values. Example calculations included. model_matchup_main.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on GIT: https://github.com/ipendlet/MLScripts/blob/master/temp_densityconc/model_matchup_main_20191231.py

    SolutionVolumeDataset - complete set of 219 solutions in the perovskite dataset. Tabs include the automatically generated reagent information from ESCALATE, hand curated reagent information from early runs, and the generation of the dataset used in the creation of Figure 5. error_auditing.zip - code and historical datasets used for reporting the dataset auditing. “AllCode.zip” which contains: model_matchup_main_20191231.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on GIT: https://github.com/ipendlet/MLScripts/blob/master/temp_densityconc/0045.perovskitedata.csv VmE_CurveFitandPlot.py - python code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text. Requires the ‘MolFractionResults.csv’ to function (also included). Calculation_Vm_Ve_CURVEFITTING.nb - mathematica code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text.
    Covariance_Analysis.py - python code for ingesting and plotting the covariance of features and volumes in the perovskite dataset. Includes renaming dictionaries used for the publication. FeatureComparison_Plotting.py - python code for reading in and plotting features for the ‘GBT’ and ‘OHGBT’ folders in this directory. The code parses the contents of these folders and generates feature comparison metrics used for Figure 9 and the associated Figure S8. Some assembly required. Requirements.txt - all of the packages used in the generation of this paper 0045.perovskitedata.csv - the main dataset described throughout the article. This file is required to run some of the code and is therefore kept near the code.

    “ML_Logs.zip” which contains: A folder describing every model generated for this article. In each folder there are a number of files: Features_named_important.csv and features_value_importance.csv - these files are linked together and describe the weighted feature contributions from features (only present for GBT models) AnalysisLog.txt - Log file of the run including all options, data curation and model training summaries
    LeaveOneOut_Summary.csv - Results of the leave-one-reactant set-out studies on the model (if performed) LOOModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs). STTSModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs). StandardTestTrain_Summary.csv - Results of the 6 fold cross validation ML performance (for the hold out case) LeaveOneOut_FullDataset_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on the full dataset (all experiments) specified by reactant set (delineated by the amine) LeaveOneOut_StratifiedData_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on a random stratified sample (96 random experiments) specified by reactant set (delineated by the amine) model_matchup_main_*.py - code used to generate all of the runs contained in a particular folder. The code is exactly what was used at run time to generate a given dataset (requires 0045.perovskitedata.csv file to run).

  7. Z

    Data from: Computational 3D resolution enhancement for optical coherence...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeroen Kalkman (2024). Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7870794
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    George-Othon Glentis
    Jeroen Kalkman
    Jos de Wit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the code and data underlying the publication "Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source" in Biomedical Optics Express 14, 3532-3554 (2023) (doi.org/10.1364/BOE.487345).

    The reader is free to use the scripts and data in this depository, as long as the manuscript is correctly cited in their work. For further questions, please contact the corresponding author.

    Description of the code and datasets

    Table 1 describes all the Matlab and Python scripts in this depository. Table 2 describes the datasets. The input datasets are the phase corrected datasets, as the raw data is large in size and phase correction using a coverslip as reference is rather straightforward. Processed datasets are also added to the repository to allow for running only a limited number of scripts, or to obtain for example the aberration corrected data without the need to use python. Note that the simulation input data (input_simulations_pointscatters_SLDshape_98zf_noise75.mat) is generated with random noise, so if this is overwritten de results may slightly vary. Also the aberration correction is done with random apertures, so the processed aberration corrected data (exp_pointscat_image_MIAA_ISAM_CAO.mat and exp_leaf_image_MIAA_ISAM_CAO.mat) will also slightly change if the aberration correction script is run anew. The current processed datasets are used as basis for the figures in the publication. For details on the implementation we refer to the publication.

    Table 1: The Matlab and Python scripts with their description
    
    
        Script name
        Description
    
    
        MIAA_ISAM_processing.m
        This scripts performs the DFT, RFIAA and MIAA processing of the phase-corrected data that can be loaded from the datasets. Afterwards it also applies ISAM on the DFT and MIAA data and plots the results in a figure (via the scripts plot_figure3, plot_figure5 and plot_simulationdatafigure).
    
    
        resolution_analysis_figure4.m
        This figure loads the data from the point scatterers (absolute amplitude data), seeks the point scatterrers and fits them to obtain the resolution data. Finally it plots figure 4 of the publication.
    
    
        fiaa_oct_c1.m, oct_iaa_c1.m, rec_fiaa_oct_c1.m, rfiaa_oct_c1.m 
        These four functions are used to apply fast IAA and MIAA. See script MIAA_ISAM_processing.m for their usage.
    
    
        viridis.m, morgenstemning.m
        These scripts define the colormaps for the figures.
    
    
        plot_figure3.m, plot_figure5.m, plot_simulationdatafigure.m
        These scripts are used to plot the figures 3 and 5 and a figure with simulation data. These scripts are executed at the end of script MIAA_ISAM_processing.m.
    
    
        Python script: computational_adaptive_optics_script.py
        Python script that applied computational adaptive optics to obtain the data for figure 6 of the manuscript.
    
    
        Python script: zernike_functions2.py
        Python script that gives the values and carthesian derrivatives of the Zernike polynomials.
    
    
        figure6_ComputationalAdaptiveOptics.m
        Script that loads the CAO data that was saved in Python, analyzes the resolution, and plots figure 6.
    
    
        Python script: OCTsimulations_3D_script2.py
        Python script simulates OCT data, adds noise and saves it as .mat file for use in the matlab script above.
    
    
        Python script: OCTsimulations2.py
        Module that contains a python class that can be used to simulate 3D OCT datasets based on a Gaussian beam.
    
    
        Matlab toolbox DIPimage 2.9.zip
        Dipimage is used in the scripts. The toolbox can be downloaded online or this zip can be used.
    
    
    
    
    
    
    The datasets in this Zenodo repository
    
    
        Name
        Description
    
    
        input_leafdisc_phasecorrected.mat
        Phase corrected input image of the leaf disc (used in figure 5).
    
    
        input_TiO2gelatin_004_phasecorrected.mat
        Phase corrected input image of the TiO2 in gelatin sample.
    
    
        input_simulations_pointscatters_SLDshape_98zf_noise75
        Input simulation data that, once processed, is used in figure 4.
    

    exp_pointscat_image_DFT.mat

    exp_pointscat_image_DFT_ISAM.mat

    exp_pointscat_image_RFIAA.mat

    exp_pointscat_image_MIAA_ISAM.mat

    exp_pointscat_image_MIAA_ISAM_CAO.mat

        Processed experimental amplitude data for the TiO2 point scattering sample with respectively DFT, DFT+ISAM, RFIAA, MIAA+ISAM and MIAA+ISAM+CAO. These datasets are used for fitting in figure 4 (except for CAO), and MIAA_ISAM and MIAA_ISAM_CAO are used for figure 6.
    

    simu_pointscat_image_DFT.mat

    simu_pointscat_image_RFIAA.mat

    simu_pointscat_image_DFT_ISAM.mat

    simu_pointscat_image_MIAA_ISAM.mat

        Processed amplitude data from the simulation dataset, which is used in the script for figure 4 for the resolution analysis.
    

    exp_leaf_image_MIAA_ISAM.mat

    exp_leaf_image_MIAA_ISAM_CAO.mat

        Processed amplitude data from the leaf sample, with and without aberration correction which is used to produce figure 6.
    

    exp_leaf_zernike_coefficients_CAO_normal_wmaf.mat

    exp_pointscat_zernike_coefficients_CAO_normal_wmaf.mat

        Estimated Zernike coefficients and the weighted moving average of them that is used for the computational aberration correction. Some of this data is plotted in Figure 6 of the manuscript.
    
    
        input_zernike_modes.mat
        The reference Zernike modes corresponding to the data that is loaded to give the modes the proper name.
    

    exp_pointscat_MIAA_ISAM_complex.mat

    exp_leaf_MIAA_ISAM_complex

        Complex MIAA+ISAM processed data that is used as input for the computational aberration correction.
    
  8. d

    Data from: Data and Results for GIS-Based Identification of Areas that have...

    • catalog.data.gov
    • datasets.ai
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Data and Results for GIS-Based Identification of Areas that have Resource Potential for Lode Gold in Alaska [Dataset]. https://catalog.data.gov/dataset/data-and-results-for-gis-based-identification-of-areas-that-have-resource-potential-for-lo
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release contains the analytical results and evaluated source data files of geospatial analyses for identifying areas in Alaska that may be prospective for different types of lode gold deposits, including orogenic, reduced-intrusion-related, epithermal, and gold-bearing porphyry. The spatial analysis is based on queries of statewide source datasets of aeromagnetic surveys, Alaska Geochemical Database (AGDB3), Alaska Resource Data File (ARDF), and Alaska Geologic Map (SIM3340) within areas defined by 12-digit HUCs (subwatersheds) from the National Watershed Boundary dataset. The packages of files available for download are: 1. LodeGold_Results_gdb.zip - The analytical results in geodatabase polygon feature classes which contain the scores for each source dataset layer query, the accumulative score, and a designation for high, medium, or low potential and high, medium, or low certainty for a deposit type within the HUC. The data is described by FGDC metadata. An mxd file, and cartographic feature classes are provided for display of the results in ArcMap. An included README file describes the complete contents of the zip file. 2. LodeGold_Results_shape.zip - Copies of the results from the geodatabase are also provided in shapefile and CSV formats. The included README file describes the complete contents of the zip file. 3. LodeGold_SourceData_gdb.zip - The source datasets in geodatabase and geotiff format. Data layers include aeromagnetic surveys, AGDB3, ARDF, lithology from SIM3340, and HUC subwatersheds. The data is described by FGDC metadata. An mxd file and cartographic feature classes are provided for display of the source data in ArcMap. Also included are the python scripts used to perform the analyses. Users may modify the scripts to design their own analyses. The included README files describe the complete contents of the zip file and explain the usage of the scripts. 4. LodeGold_SourceData_shape.zip - Copies of the geodatabase source dataset derivatives from ARDF and lithology from SIM3340 created for this analysis are also provided in shapefile and CSV formats. The included README file describes the complete contents of the zip file.

  9. d

    Python-HBRT model and groundwater levels used for estimating the static,...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Python-HBRT model and groundwater levels used for estimating the static, shallow water table depth for the State of Wisconsin [Dataset]. https://catalog.data.gov/dataset/python-hbrt-model-and-groundwater-levels-used-for-estimating-the-static-shallow-water-tabl
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Wisconsin
    Description

    A histrogram-based boosted regression tree (HBRT) method was used to predict the depth to the surficial aquifer water table (in feet) throughout the State of Wisconsin. This method used a combination of discrete groundwater levels from the U.S. Geological Survey National Water Information System, continuous groundwater levels from the National Groundwater Monitoring Network, the State of Wisconsin well-construction database, and NHDPlus version 2.1-derived points. The predicted water table depth utilized the HBRT model available through Scikit-learn in Python version 3.10.10. The HBRT model can predict the surficial water table depth for any latitude and longitude for Wisconsin. A total of 48 predictor variables were used for model development, including basic well characteristics, soil properties, aquifer properties, hydrologic position on the landscape, recharge and evapotranspiration rates, and bedrock characteristics. Model results indicate that the mean surficial water table depth across Wisconsin is 28.3 feet below land surface, with a root mean square error of 7.40 feet for the holdout data to the HBRT model. Aside from the overall HBRT methods contained as part of the Python script, this data release includes a self-contained model directory for recreating the HBRT model published in this data release. The model directory also includes a model object for the HBRT model used to predict the surficial aquifer water table depth (in feet) for the State of Wisconsin. Three separate directories are available within this data release that define the input predictor variables, water levels, and NHD points for the HBRT model. The 'bedrock-overlay' sub-directory contains geospatial data that define the special selection zones used in the depth-to-water well selection (DTW_well_selection_zones.docx). The 'water-levels' sub-directory contains input files for the NHDPlus version 2.1 points, the State of Wisconsin well construction spreadsheets, and water level summary files. The 'python-attributes' sub-directory contains predictor variable rasters and vector data that predict the surficial water table depth for Wisconsin and a Jupyter Notebook used for the attribution and input files for well and NHD points.

  10. T

    fashion_mnist

    • tensorflow.org
    • opendatalab.com
    • +4more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). fashion_mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/fashion_mnist
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('fashion_mnist', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">

  11. PIPr: A Dataset of Public Infrastructure as Code Programs

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.

    Contents:

    • metadata.zip: The dataset metadata and analysis results as CSV files.
    • scripts-and-logs.zip: Scripts and logs of the dataset creation.
    • LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
    • README.md: This document.
    • redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.

    This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.

    Metadata

    The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.

    repositories.csv:

    • ID (integer): GitHub repository ID
    • url (string): GitHub repository URL
    • downloaded (boolean): Whether cloning the repository succeeded
    • name (string): Repository name
    • description (string): Repository description
    • licenses (string, list of strings): Repository licenses
    • redistributable (boolean): Whether the repository's licenses permit redistribution
    • created (string, date & time): Time of the repository's creation
    • updated (string, date & time): Time of the last update to the repository
    • pushed (string, date & time): Time of the last push to the repository
    • fork (boolean): Whether the repository is a fork
    • forks (integer): Number of forks
    • archive (boolean): Whether the repository is archived
    • programs (string, list of strings): Project file path of each IaC program in the repository

    programs.csv:

    • ID (string): Project file path of the IaC program
    • repository (integer): GitHub repository ID of the repository containing the IaC program
    • directory (string): Path of the directory containing the IaC program's project file
    • solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
    • language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
    • name (string): IaC program name
    • description (string): IaC program description
    • runtime (string): Runtime string of the IaC program
    • testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • tests (string, list of strings): File paths of IaC program's tests

    testing-files.csv:

    • file (string): Testing file path
    • language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
    • techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
    • program (string): Project file path of the testing file's IaC program

    Dataset Creation

    scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

    1. A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
    2. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
    3. Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
    4. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.

    Searching Repositories

    The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

    1. Github access token.
    2. Name of the CSV output file.
    3. Filename to search for.
    4. File extensions to search for, separated by commas.
    5. Min file size for the search (for all files: 0).
    6. Max file size for the search or * for unlimited (for all files: *).

    Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/

    AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html

    CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup

    Limitations

    The script uses the GitHub code search API and inherits its limitations:

    • Only forks with more stars than the parent repository are included.
    • Only the repositories' default branches are considered.
    • Only files smaller than 384 KB are searchable.
    • Only repositories with fewer than 500,000 files are considered.
    • Only repositories that have had activity or have been returned in search results in the last year are considered.

    More details: https://docs.github.com/en/search-github/searching-on-github/searching-code

    The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api

    Downloading Repositories

    download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

    1. Name of the repositories CSV files generated through search-repositories.py, separated by commas.
    2. Output directory to download the repositories to.
    3. Name of the CSV output file.

    The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.

  12. P

    PhysioNet Challenge 2020 Dataset

    • paperswithcode.com
    Updated Dec 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna (2020). PhysioNet Challenge 2020 Dataset [Dataset]. https://paperswithcode.com/dataset/physionet-challenge-2020
    Explore at:
    Dataset updated
    Dec 30, 2020
    Authors
    Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna
    Description

    Data The data for this Challenge are from multiple sources: CPSC Database and CPSC-Extra Database INCART Database PTB and PTB-XL Database The Georgia 12-lead ECG Challenge (G12EC) Database Undisclosed Database The first source is the public (CPSC Database) and unused data (CPSC-Extra Database) from the China Physiological Signal Challenge in 2018 (CPSC2018), held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. The unused data from the CPSC2018 is NOT the test data from the CPSC2018. The test data of the CPSC2018 is included in the final private database that has been sequestered. This training set consists of two sets of 6,877 (male: 3,699; female: 3,178) and 3,453 (male: 1,843; female: 1,610) of 12-ECG recordings lasting from 6 seconds to 60 seconds. Each recording was sampled at 500 Hz.

    The second source set is the public dataset from St Petersburg INCART 12-lead Arrhythmia Database. This database consists of 74 annotated recordings extracted from 32 Holter records. Each record is 30 minutes long and contains 12 standard leads, each sampled at 257 Hz.

    The third source from the Physikalisch Technische Bundesanstalt (PTB) comprises two public databases: the PTB Diagnostic ECG Database and the PTB-XL, a large publicly available electrocardiography dataset. The first PTB database contains 516 records (male: 377, female: 139). Each recording was sampled at 1000 Hz. The PTB-XL contains 21,837 clinical 12-lead ECGs (male: 11,379 and female: 10,458) of 10 second length with a sampling frequency of 500 Hz.

    The fourth source is a Georgia database which represents a unique demographic of the Southeastern United States. This training set contains 10,344 12-lead ECGs (male: 5,551, female: 4,793) of 10 second length with a sampling frequency of 500 Hz.

    The fifth source is an undisclosed American database that is geographically distinct from the Georgia database. This source contains 10,000 ECGs (all retained as test data).

    All data is provided in WFDB format. Each ECG recording has a binary MATLAB v4 file (see page 27) for the ECG signal data and a text file in WFDB header format describing the recording and patient attributes, including the diagnosis (the labels for the recording). The binary files can be read using the load function in MATLAB and the scipy.io.loadmat function in Python; please see our baseline models for examples of loading the data. The first line of the header provides information about the total number of leads and the total number of samples or points per lead. The following lines describe how each lead was saved, and the last lines provide information on demographics and diagnosis. Below is an example header file A0001.hea:

    A0001 12 500 7500 05-Feb-2020 11:39:16
    A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I
    A0001.mat 16+24 1000/mV 16 0 7 2029 0 II
    A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III
    A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR
    A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL
    A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF
    A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1
    A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2
    A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3
    A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4
    A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5
    A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6
    
    Age: 74
    Sex: Male
    Dx: 426783006
    Rx: Unknown
    Hx: Unknown
    Sx: Unknown
    

    From the first line, we see that the recording number is A0001, and the recording file is A0001.mat. The recording has 12 leads, each recorded at 500 Hz sample frequency, and contains 7500 samples. From the next 12 lines, we see that each signal was written at 16 bits with an offset of 24 bits, the amplitude resolution is 1000 with units in mV, the resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. The first value of the signal, the checksum, and the lead name are included for each signal. From the final 6 lines, we see that the patient is a 74-year-old male with a diagnosis (Dx) of 426783006. The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown.

    Each ECG recording has one or more labels from different type of abnormalities in SNOMED-CT codes. The full list of diagnoses for the challenge has been posted here as a 3 column CSV file: Long-form description, corresponding SNOMED-CT code, abbreviation. Although these descriptions apply to all training data there may be fewer classes in the test data, and in different proportions. However, every class in the test data will be represented in the training data.

  13. HUN GW Model code v01

    • researchdata.edu.au
    • data.gov.au
    • +2more
    Updated Jun 1, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2018). HUN GW Model code v01 [Dataset]. https://researchdata.edu.au/hun-gw-model-code-v01/2993659
    Explore at:
    Dataset updated
    Jun 1, 2018
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    Computer code and templates used to create the Hunter groundwater model.

    Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself.

    An example of a type of file in templates_and_inputs are all the uaXXXX.txt, which describe the parameters used in uncertainty analysis XXXX.

    Much of the computer code is in the form of python scripts, and most of these are run using either preprocess.py or postprocess.py (using subprocess.call). Each of the python scripts employs optparse, and so is largely self documenting. Each of the python scripts also requires an index file as an input, which is an XML file and contains all meta-data associated with the model building process, so that the scripts can discover where the raw data is needed to build the model. The HUN GW Model v01 contains the index file (index.xml) used to build the Hunter groundwater model.

    Finally, the "code" directory contains a snapshot of the MOOSE C++ code used to run the model.

    Dataset History

    Computer code and templates were written by hand.

    Dataset Citation

    Bioregional Assessment Programme (2016) HUN GW Model code v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/e54a1246-0076-4799-9ecf-6d673cf5b1da.

  14. T

    clinc_oos

    • tensorflow.org
    • opendatalab.com
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). clinc_oos [Dataset]. https://www.tensorflow.org/datasets/catalog/clinc_oos
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope (OOS), i.e., queries that do not fall into any of the system's supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. It offers a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('clinc_oos', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  15. Architectural Design Decisions for the Machine Learning Workflow: Dataset...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen John Warnett; Stephen John Warnett; Uwe Zdun; Uwe Zdun (2024). Architectural Design Decisions for the Machine Learning Workflow: Dataset and Code [Dataset]. http://doi.org/10.5281/zenodo.5730291
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stephen John Warnett; Stephen John Warnett; Uwe Zdun; Uwe Zdun
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    Title: Architectural Design Decisions for the Machine Learning Workflow: Dataset and Code

    Authors: Stephen John Warnett; Uwe Zdun

    About: This is the dataset and code artifact for the article entitled "Architectural Design Decisions for the Machine Learning Workflow".

    Contents: The "_generated" directory contains the generated results, including latex files with tables for use in publications and the Architectural Design Decision model in textual and graphical form. "Generators" contains Python applications that can be run to generate the above. "Metamodels" contains a Python file with type definitions. "Sources_coding" contains our source codings and audit trail. "Add_models" contains the Python implementation of our model and source codings. Finally, "appendix" contains a detailed description of our research method.

    Article Abstract: Bringing machine learning models to production is challenging as it is often fraught with uncertainty and confusion, partially due to the disparity between software engineering and machine learning practices, but also due to knowledge gaps on the level of the individual practitioner. We conducted a qualitative investigation into the architectural decisions faced by practitioners as documented in gray literature based on Straussian Grounded Theory and modeled current practices in machine learning. Our novel Architectural Design Decision model is based on current practitioner understanding of the topic and helps bridge the gap between science and practice, foster scientific understanding of the subject, and support practitioners via the integration and consolidation of the myriad decisions they face. We describe a subset of the Architectural Design Decisions that were modeled, discuss uses for the model, and outline areas in which further research may be pursued.

    Objective: This article aims to study current practitioner understanding of architectural concepts associated with data processing, model building, and Automated Machine Learning (AutoML) within the context of the machine learning workflow.

    Method: Applying Straussian Grounded Theory to gray literature sources containing practitioner views on machine learning practices, we studied methods and techniques currently applied by practitioners in the context of machine learning solution development and gained valuable insights into the software engineering and architectural state of the art as applied to ML.

    Results: Our study resulted in a model of Architectural Design Decisions, practitioner practices, and decision drivers in the field of software engineering and software architecture for machine learning.

    Conclusions: The resulting Architectural Design Decisions model can help researchers better understand practitioners' needs and the challenges they face, and guide their decisions based on existing practices. The study also opens new avenues for further research in the field, and the design guidance provided by our model can also help reduce design effort and risk. In future work, we plan on using our findings to provide automated design advice to machine learning engineers.

  16. d

    Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis...

    • catalog.data.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis and Summary Statistics [Dataset]. https://catalog.data.gov/dataset/protected-areas-database-of-the-united-states-pad-us-3-0-vector-analysis-and-summary-stati
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    Spatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and recreation access across the nation. The PAD-US 3.0 Combined Fee, Designation, Easement feature class (with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class) was modified to remove overlaps, avoiding overestimation in protected area statistics and to support user needs. A Python scripted process ("PADUS3_0_CreateVectorAnalysisFileScript.zip") associated with this data release prioritized overlapping designations (e.g. Wilderness within a National Forest) based upon their relative biodiversity conservation status (e.g. GAP Status Code 1 over 2), public access values (in the order of Closed, Restricted, Open, Unknown), and geodatabase load order (records are deliberately organized in the PAD-US full inventory with fee owned lands loaded before overlapping management designations, and easements). The Vector Analysis File ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") associated item of PAD-US 3.0 Spatial Analysis and Statistics ( https://doi.org/10.5066/P9KLBB5D ) was clipped to the Census state boundary file to define the extent and serve as a common denominator for statistical summaries. Boundaries of interest to stakeholders (State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative) were incorporated into separate geodatabase feature classes to support various data summaries ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip") and Comma-separated Value (CSV) tables ("PADUS3_0SummaryStatistics_TabularData_CSV.zip") summarizing "PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip" are provided as an alternative format and enable users to explore and download summary statistics of interest (Comma-separated Table [CSV], Microsoft Excel Workbook [.XLSX], Portable Document Format [.PDF] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). In addition, a "flattened" version of the PAD-US 3.0 combined file without other extent boundaries ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") allow for other applications that require a representation of overall protection status without overlapping designation boundaries. The "PADUS3_0VectorAnalysis_State_Clip_CENSUS2020" feature class ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.gdb") is the source of the PAD-US 3.0 raster files (associated item of PAD-US 3.0 Spatial Analysis and Statistics, https://doi.org/10.5066/P9KLBB5D ). Note, the PAD-US inventory is now considered functionally complete with the vast majority of land protection types represented in some manner, while work continues to maintain updates and improve data quality (see inventory completeness estimates at: http://www.protectedlands.net/data-stewards/ ). In addition, changes in protected area status between versions of the PAD-US may be attributed to improving the completeness and accuracy of the spatial data more than actual management actions or new acquisitions. USGS provides no legal warranty for the use of this data. While PAD-US is the official aggregation of protected areas ( https://www.fgdc.gov/ngda-reports/NGDA_Datasets.html ), agencies are the best source of their lands data.

  17. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  18. n

    Data from: PyProcar: A Python library for electronic structure...

    • narcis.nl
    • data.mendeley.com
    Updated Dec 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Herath, U (via Mendeley Data) (2019). PyProcar: A Python library for electronic structure pre/post-processing [Dataset]. http://doi.org/10.17632/d4rrfy3dy4.1
    Explore at:
    Dataset updated
    Dec 18, 2019
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Herath, U (via Mendeley Data)
    Description

    The PyProcar Python package plots the band structure and the Fermi surface as a function of site and/or s,p,d,f - projected wavefunctions obtained for each k-point in the Brillouin zone and band in an electronic structure calculation. This can be performed on top of any electronic structure code, as long as the band and projection information is written in the PROCAR format, as done by the VASP and ABINIT codes. PyProcar can be easily modified to read other formats as well. This package is particularly suitable for understanding atomic effects into the band structure, Fermi surface, spin texture, etc. PyProcar can be conveniently used in a command line mode, where each one of the parameters define a plot property. In the case of Fermi-surfaces, the package is able to plot the surface with colors depending on other properties such as the electron velocity or spin projection. The mesh used to calculate the property does not need to be the same as the one used to obtain the Fermi surface. A file with a specific property evaluated for each k-point in a k-mesh and for each band can be used to project other properties such as electron–phonon mean path, Fermi velocity, electron effective mass, etc. Another existing feature refers to the band unfolding of supercell calculations into predefined unit cells.

  19. SELTO Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated May 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7034899
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Benchmark Dataset for Deep Learning-based Methods for 3D Topology Optimization.

    One can find a description of the provided dataset partitions in Section 3 of Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.


    Every dataset container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and a corresponding binarized SIMP solution. Every file of the form {i}.csv contains all voxel-wise information about the sample i. Every file of the form {i}_info.csv file contains scalar parameters of the topology optimization problem, such as material parameters.


    This dataset represents topology optimization problems and solutions on the bases of voxels. We define all spatially varying quantities via the voxels' centers -- rather than via the vertices or surfaces of the voxels.
    In {i}.csv files, each row corresponds to one voxel in the design space. The columns correspond to ['x', 'y', 'z', 'design_space', 'dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density'].

    • x, y, z - These are three integer indices stating the index/location of the voxel within the voxel mesh.
    • design_space - This is one ternary variable indicating the type of material density constraint on the voxel within the TO problem formulation. "0" and "1" indicate a material density fixed at 0 or 1, respectively. "-1" indicates the absence of constraints.
    • dirichlet_x, dirichlet_y, dirichlet_z - These are three binary variables defining whether the voxel contains homogenous Dirichlet constraints in the respective axis direction.
    • force_x, force_y, force_z - These are three floating point variables giving the three spacial components of the forces applied to each voxel. All forces are body forces given in [N/m^3].
    • density - This is a binary variable stating whether the voxel carries material in the solution of the topology optimization problem.

    Any of these files with the index i can be imported using pandas by executing:

    import pandas as pd
    
    directory = ...
    file_path = f'{directory}/{i}.csv'
    column_names = ['x', 'y', 'z', 'design_space','dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density']
    data = pd.read_csv(file_path, names=column_names)

    From this pandas dataframe one can extract the torch tensors of forces F, Dirichlet conditions ωDirichlet, and design space information ωdesign using the following functions:

    import torch
    
    def get_shape_and_voxels(data):
      shape = data[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
      vox_x = data['x'].values
      vox_y = data['y'].values
      vox_z = data['z'].values
      voxels = [vox_x, vox_y, vox_z]
      return shape, voxels
    
    
    def get_forces_boundary_conditions_and_design_space(data, shape, voxels):
      F = torch.zeros(3, *shape, dtype=torch.float32)
      F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_x'].values, dtype=torch.float32)
      F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_y'].values, dtype=torch.float32)
      F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_z'].values, dtype=torch.float32)
    
      ω_Dirichlet = torch.zeros(3, *shape, dtype=torch.float32)
      ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_x'].values, dtype=torch.float32)
      ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_y'].values, dtype=torch.float32)
      ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_z'].values, dtype=torch.float32)
    
      ω_design = torch.zeros(1, *shape, dtype=int)
      ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['design_space'].values.astype(int))
      return F, ω_Dirichlet, ω_design

    The corresponding {i}_info.csv files only have one row with column labels ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z'].

    • E - Young's modulus [Pa]
    • ν - Poisson's ratio [-]
    • σ_ys - Yield stress [Pa]
    • vox_size - Length of the edge of a (cube-shaped) voxel [m]
    • p_x, p_y, p_z - Location of the root of the design space [m]

    Analogously to above, one can import any {i}_info.csv file by executing:

    file_path = f'{directory}/{i}_info.csv'
    data_info_column_names = ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z']
    data_info = pd.read_csv(file_path, names=data_info_column_names)

  20. AIMO-24: Model (openai-community/gpt2-large)

    • kaggle.com
    zip
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 7, 2024
    Authors
    Dinh Thoai Tran @ randrise.com
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    language: en

    license: mit

    GPT-2 Large

    Table of Contents

    Model Details

    Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

    How to Get Started with the Model

    Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

    >>> from transformers import pipeline, set_seed
    >>> generator = pipeline('text-generation', model='gpt2-large')
    >>> set_seed(42)
    >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    
    [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"},
     {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"},
     {'generated_text': "Hello, I'm a language model, why does this matter for you?
    
    When I hear new languages, I tend to start thinking in terms"},
     {'generated_text': "Hello, I'm a language model, a functional language...
    
    I don't need to know anything else. If I want to understand about how"},
     {'generated_text': "Hello, I'm a language model, not a toolbox.
    
    In a nutshell, a language model is a set of attributes that define how"}]
    

    Here is how to use this model to get the features of a given text in PyTorch:

    from transformers import GPT2Tokenizer, GPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = GPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    

    and in TensorFlow:

    from transformers import GPT2Tokenizer, TFGPT2Model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
    model = TFGPT2Model.from_pretrained('gpt2-large')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

    Uses

    Direct Use

    In their model card about GPT-2, OpenAI wrote:

    The primary intended users of these models are AI researchers and practitioners.

    We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

    Downstream Use

    In their model card about GPT-2, OpenAI wrote:

    Here are some secondary use cases we believe are likely:

    • Writing assistance: Grammar assistance, autocompletion (for normal prose or code)
    • Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.
    • Entertainment: Creation of games, chat bots, and amusing generations.

    Misuse and Out-of-scope Use

    In their model card about GPT-2, OpenAI wrote:

    Because large-scale language models like GPT-2 ...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187

polyOne Data Set - 100 million hypothetical polymers including 29 properties

Explore at:
Dataset updated
Mar 24, 2023
Dataset provided by
Rampi Ramprasad
Christopher Kuenneth
Description

polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe



PSMILES strings only



  
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
  
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Search
Clear search
Close search
Google apps
Main menu