30 datasets found

Z
polyOne Data Set - 100 million hypothetical polymers including 29 properties...
data.niaid.nih.gov
zenodo.org
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
Explore at:
Dataset updated
Mar 24, 2023
Dataset provided by
Rampi Ramprasad
Christopher Kuenneth
Description
polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

PSMILES strings only generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line. generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
H
Dataset metadata of known Dataverse installations, August 2023
dataverse.harvard.edu
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8FEGUV
Dataset updated
Aug 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
Python Systems for Empirical Analysis
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Orrù; Matteo Orrù (2020). Python Systems for Empirical Analysis [Dataset]. http://doi.org/10.5281/zenodo.268468
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.268468
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo Orrù; Matteo Orrù
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reference

Studies who have been using the data (in any form) are required to include the following reference:

@inproceedings{Orru2015, abstract = {The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.}, author = {Orrú, Matteo and Tempero, Ewan and Marchesi, Michele and Tonelli, Roberto and Destefanis, Giuseppe}, booktitle = {Submitted to PROMISE '15}, keywords = {Python, Empirical Studies, Curated Code Collection}, title = {A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering}, year = {2015} }

About the Data

Overview

This paper presents a dataset of metrics taken from a curated collection of 51 popular Python software systems.

The dataset reports 41 metrics of different categories: volume/size, complexity and object oriented metrics. These metrics and computed both at file and class level. We provide metrics for any file and class of each system and global metrics (computed on the entire system). Moreover we provide 14 meta-data for each system.

Paper Abstract

The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.
h
the-stack
huggingface.co
opendatalab.com
Updated Oct 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
Explore at:
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack

Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
d
TweetyNet results
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Nicholson; Yarden Cohen (2022). TweetyNet results [Dataset]. http://doi.org/10.5061/dryad.gtht76hk4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.gtht76hk4
Dataset updated
Apr 29, 2022
Dataset provided by
Dryad
Authors
David Nicholson; Yarden Cohen
Time period covered
2020
Description
Please see README at https://github.com/yardencsGitHub/tweetynet/blob/master/article/README.md that explains how to install the software. To replicate the key result as described in the section of the article "TweetyNet annotates with low error rates across individuals and species", please follow the set-up instructions in that README, and then run this script: https://github.com/yardencsGitHub/tweetynet/blob/master/article/src/scripts/replicate-key-result/runall.sh For guidance on how to adapt pre-trained models to new datasets, please see the vak documentation: https://vak.readthedocs.io/en/latest/
P
darpa_sd2_perovskites Dataset
paperswithcode.com
Updated May 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian M. Pendleton; Mary K. Caucci; Michael Tynes; Aaron Dharna; Mansoor Ani Najeeb Nellikkal; Zhi Li; Emory M. Chan; Alexander J. Norquist; and Joshua Schrier (2020). darpa_sd2_perovskites Dataset [Dataset]. https://paperswithcode.com/dataset/darpa-sd2-perovskites
Explore at:
Dataset updated
May 25, 2020
Authors
Ian M. Pendleton; Mary K. Caucci; Michael Tynes; Aaron Dharna; Mansoor Ani Najeeb Nellikkal; Zhi Li; Emory M. Chan; Alexander J. Norquist; and Joshua Schrier
Description
Included in this content:

0045.perovksitedata.csv - main dataset used in this article. A more detailed description can be found in the “dataset overview” section below Chemical Inventory.csv - the hand curated file of all chemicals used in the construction of the perovskite dataset. This file includes identifiers, chemical properties, and other information. ExcessMolarVolumeData.xlsx - record of experimental data, computations, and final dataset used in the generation of the excess molar volume plots. MLModelMetrics.xlsx - all of the ML metrics organized in one place (excludes reactant set specific breakdown, see ML_Logs.zip for those files). OrganoammoniumDensityDataset.xlsx - complete set of the data used to generate the density values. Example calculations included. model_matchup_main.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on GIT: https://github.com/ipendlet/MLScripts/blob/master/temp_densityconc/model_matchup_main_20191231.py

SolutionVolumeDataset - complete set of 219 solutions in the perovskite dataset. Tabs include the automatically generated reagent information from ESCALATE, hand curated reagent information from early runs, and the generation of the dataset used in the creation of Figure 5. error_auditing.zip - code and historical datasets used for reporting the dataset auditing. “AllCode.zip” which contains: model_matchup_main_20191231.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on GIT: https://github.com/ipendlet/MLScripts/blob/master/temp_densityconc/0045.perovskitedata.csv VmE_CurveFitandPlot.py - python code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text. Requires the ‘MolFractionResults.csv’ to function (also included). Calculation_Vm_Ve_CURVEFITTING.nb - mathematica code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text.
Covariance_Analysis.py - python code for ingesting and plotting the covariance of features and volumes in the perovskite dataset. Includes renaming dictionaries used for the publication. FeatureComparison_Plotting.py - python code for reading in and plotting features for the ‘GBT’ and ‘OHGBT’ folders in this directory. The code parses the contents of these folders and generates feature comparison metrics used for Figure 9 and the associated Figure S8. Some assembly required. Requirements.txt - all of the packages used in the generation of this paper 0045.perovskitedata.csv - the main dataset described throughout the article. This file is required to run some of the code and is therefore kept near the code.

“ML_Logs.zip” which contains: A folder describing every model generated for this article. In each folder there are a number of files: Features_named_important.csv and features_value_importance.csv - these files are linked together and describe the weighted feature contributions from features (only present for GBT models) AnalysisLog.txt - Log file of the run including all options, data curation and model training summaries
LeaveOneOut_Summary.csv - Results of the leave-one-reactant set-out studies on the model (if performed) LOOModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs). STTSModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs). StandardTestTrain_Summary.csv - Results of the 6 fold cross validation ML performance (for the hold out case) LeaveOneOut_FullDataset_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on the full dataset (all experiments) specified by reactant set (delineated by the amine) LeaveOneOut_StratifiedData_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on a random stratified sample (96 random experiments) specified by reactant set (delineated by the amine) model_matchup_main_*.py - code used to generate all of the runs contained in a particular folder. The code is exactly what was used at run time to generate a given dataset (requires 0045.perovskitedata.csv file to run).

Data from: Computational 3D resolution enhancement for optical coherence...

data.niaid.nih.gov
zenodo.org

Updated Jul 12, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Jeroen Kalkman (2024). Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7870794

Explore at:

Dataset updated

Jul 12, 2024

Dataset provided by

George-Othon Glentis
Jeroen Kalkman
Jos de Wit

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository contains the code and data underlying the publication "Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source" in Biomedical Optics Express 14, 3532-3554 (2023) (doi.org/10.1364/BOE.487345).

The reader is free to use the scripts and data in this depository, as long as the manuscript is correctly cited in their work. For further questions, please contact the corresponding author.

Description of the code and datasets

Table 1 describes all the Matlab and Python scripts in this depository. Table 2 describes the datasets. The input datasets are the phase corrected datasets, as the raw data is large in size and phase correction using a coverslip as reference is rather straightforward. Processed datasets are also added to the repository to allow for running only a limited number of scripts, or to obtain for example the aberration corrected data without the need to use python. Note that the simulation input data (input_simulations_pointscatters_SLDshape_98zf_noise75.mat) is generated with random noise, so if this is overwritten de results may slightly vary. Also the aberration correction is done with random apertures, so the processed aberration corrected data (exp_pointscat_image_MIAA_ISAM_CAO.mat and exp_leaf_image_MIAA_ISAM_CAO.mat) will also slightly change if the aberration correction script is run anew. The current processed datasets are used as basis for the figures in the publication. For details on the implementation we refer to the publication.

Table 1: The Matlab and Python scripts with their description


    Script name
    Description


    MIAA_ISAM_processing.m
    This scripts performs the DFT, RFIAA and MIAA processing of the phase-corrected data that can be loaded from the datasets. Afterwards it also applies ISAM on the DFT and MIAA data and plots the results in a figure (via the scripts plot_figure3, plot_figure5 and plot_simulationdatafigure).


    resolution_analysis_figure4.m
    This figure loads the data from the point scatterers (absolute amplitude data), seeks the point scatterrers and fits them to obtain the resolution data. Finally it plots figure 4 of the publication.


    fiaa_oct_c1.m, oct_iaa_c1.m, rec_fiaa_oct_c1.m, rfiaa_oct_c1.m 
    These four functions are used to apply fast IAA and MIAA. See script MIAA_ISAM_processing.m for their usage.


    viridis.m, morgenstemning.m
    These scripts define the colormaps for the figures.


    plot_figure3.m, plot_figure5.m, plot_simulationdatafigure.m
    These scripts are used to plot the figures 3 and 5 and a figure with simulation data. These scripts are executed at the end of script MIAA_ISAM_processing.m.


    Python script: computational_adaptive_optics_script.py
    Python script that applied computational adaptive optics to obtain the data for figure 6 of the manuscript.


    Python script: zernike_functions2.py
    Python script that gives the values and carthesian derrivatives of the Zernike polynomials.


    figure6_ComputationalAdaptiveOptics.m
    Script that loads the CAO data that was saved in Python, analyzes the resolution, and plots figure 6.


    Python script: OCTsimulations_3D_script2.py
    Python script simulates OCT data, adds noise and saves it as .mat file for use in the matlab script above.


    Python script: OCTsimulations2.py
    Module that contains a python class that can be used to simulate 3D OCT datasets based on a Gaussian beam.


    Matlab toolbox DIPimage 2.9.zip
    Dipimage is used in the scripts. The toolbox can be downloaded online or this zip can be used.






The datasets in this Zenodo repository


    Name
    Description


    input_leafdisc_phasecorrected.mat
    Phase corrected input image of the leaf disc (used in figure 5).


    input_TiO2gelatin_004_phasecorrected.mat
    Phase corrected input image of the TiO2 in gelatin sample.


    input_simulations_pointscatters_SLDshape_98zf_noise75
    Input simulation data that, once processed, is used in figure 4.

exp_pointscat_image_DFT.mat

exp_pointscat_image_DFT_ISAM.mat

exp_pointscat_image_RFIAA.mat

exp_pointscat_image_MIAA_ISAM.mat

exp_pointscat_image_MIAA_ISAM_CAO.mat

    Processed experimental amplitude data for the TiO2 point scattering sample with respectively DFT, DFT+ISAM, RFIAA, MIAA+ISAM and MIAA+ISAM+CAO. These datasets are used for fitting in figure 4 (except for CAO), and MIAA_ISAM and MIAA_ISAM_CAO are used for figure 6.

simu_pointscat_image_DFT.mat

simu_pointscat_image_RFIAA.mat

simu_pointscat_image_DFT_ISAM.mat

simu_pointscat_image_MIAA_ISAM.mat

    Processed amplitude data from the simulation dataset, which is used in the script for figure 4 for the resolution analysis.

exp_leaf_image_MIAA_ISAM.mat

exp_leaf_image_MIAA_ISAM_CAO.mat

    Processed amplitude data from the leaf sample, with and without aberration correction which is used to produce figure 6.

exp_leaf_zernike_coefficients_CAO_normal_wmaf.mat

exp_pointscat_zernike_coefficients_CAO_normal_wmaf.mat

    Estimated Zernike coefficients and the weighted moving average of them that is used for the computational aberration correction. Some of this data is plotted in Figure 6 of the manuscript.


    input_zernike_modes.mat
    The reference Zernike modes corresponding to the data that is loaded to give the modes the proper name.

exp_pointscat_MIAA_ISAM_complex.mat

exp_leaf_MIAA_ISAM_complex

    Complex MIAA+ISAM processed data that is used as input for the computational aberration correction.

d
Data from: Data and Results for GIS-Based Identification of Areas that have...
catalog.data.gov
datasets.ai
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data and Results for GIS-Based Identification of Areas that have Resource Potential for Lode Gold in Alaska [Dataset]. https://catalog.data.gov/dataset/data-and-results-for-gis-based-identification-of-areas-that-have-resource-potential-for-lo
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
This data release contains the analytical results and evaluated source data files of geospatial analyses for identifying areas in Alaska that may be prospective for different types of lode gold deposits, including orogenic, reduced-intrusion-related, epithermal, and gold-bearing porphyry. The spatial analysis is based on queries of statewide source datasets of aeromagnetic surveys, Alaska Geochemical Database (AGDB3), Alaska Resource Data File (ARDF), and Alaska Geologic Map (SIM3340) within areas defined by 12-digit HUCs (subwatersheds) from the National Watershed Boundary dataset. The packages of files available for download are: 1. LodeGold_Results_gdb.zip - The analytical results in geodatabase polygon feature classes which contain the scores for each source dataset layer query, the accumulative score, and a designation for high, medium, or low potential and high, medium, or low certainty for a deposit type within the HUC. The data is described by FGDC metadata. An mxd file, and cartographic feature classes are provided for display of the results in ArcMap. An included README file describes the complete contents of the zip file. 2. LodeGold_Results_shape.zip - Copies of the results from the geodatabase are also provided in shapefile and CSV formats. The included README file describes the complete contents of the zip file. 3. LodeGold_SourceData_gdb.zip - The source datasets in geodatabase and geotiff format. Data layers include aeromagnetic surveys, AGDB3, ARDF, lithology from SIM3340, and HUC subwatersheds. The data is described by FGDC metadata. An mxd file and cartographic feature classes are provided for display of the source data in ArcMap. Also included are the python scripts used to perform the analyses. Users may modify the scripts to design their own analyses. The included README files describe the complete contents of the zip file and explain the usage of the scripts. 4. LodeGold_SourceData_shape.zip - Copies of the geodatabase source dataset derivatives from ARDF and lithology from SIM3340 created for this analysis are also provided in shapefile and CSV formats. The included README file describes the complete contents of the zip file.
d
Python-HBRT model and groundwater levels used for estimating the static,...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Python-HBRT model and groundwater levels used for estimating the static, shallow water table depth for the State of Wisconsin [Dataset]. https://catalog.data.gov/dataset/python-hbrt-model-and-groundwater-levels-used-for-estimating-the-static-shallow-water-tabl
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Wisconsin
Description
A histrogram-based boosted regression tree (HBRT) method was used to predict the depth to the surficial aquifer water table (in feet) throughout the State of Wisconsin. This method used a combination of discrete groundwater levels from the U.S. Geological Survey National Water Information System, continuous groundwater levels from the National Groundwater Monitoring Network, the State of Wisconsin well-construction database, and NHDPlus version 2.1-derived points. The predicted water table depth utilized the HBRT model available through Scikit-learn in Python version 3.10.10. The HBRT model can predict the surficial water table depth for any latitude and longitude for Wisconsin. A total of 48 predictor variables were used for model development, including basic well characteristics, soil properties, aquifer properties, hydrologic position on the landscape, recharge and evapotranspiration rates, and bedrock characteristics. Model results indicate that the mean surficial water table depth across Wisconsin is 28.3 feet below land surface, with a root mean square error of 7.40 feet for the holdout data to the HBRT model. Aside from the overall HBRT methods contained as part of the Python script, this data release includes a self-contained model directory for recreating the HBRT model published in this data release. The model directory also includes a model object for the HBRT model used to predict the surficial aquifer water table depth (in feet) for the State of Wisconsin. Three separate directories are available within this data release that define the input predictor variables, water levels, and NHD points for the HBRT model. The 'bedrock-overlay' sub-directory contains geospatial data that define the special selection zones used in the depth-to-water well selection (DTW_well_selection_zones.docx). The 'water-levels' sub-directory contains input files for the NHDPlus version 2.1 points, the State of Wisconsin well construction spreadsheets, and water level summary files. The 'python-attributes' sub-directory contains predictor variable rasters and vector data that predict the surficial water table depth for Wisconsin and a Jupyter Notebook used for the attribution and input files for well and NHD points.
T
fashion_mnist
tensorflow.org
opendatalab.com
+4more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). fashion_mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/fashion_mnist
Explore at:
Dataset updated
Jun 1, 2024
Description
Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('fashion_mnist', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">
PIPr: A Dataset of Public Infrastructure as Code Programs
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10173400
Dataset updated
Nov 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
metadata.zip: The dataset metadata and analysis results as CSV files.
scripts-and-logs.zip: Scripts and logs of the dataset creation.
LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
README.md: This document.
redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
Metadata
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
ID (integer): GitHub repository ID
url (string): GitHub repository URL
downloaded (boolean): Whether cloning the repository succeeded
name (string): Repository name
description (string): Repository description
licenses (string, list of strings): Repository licenses
redistributable (boolean): Whether the repository's licenses permit redistribution
created (string, date & time): Time of the repository's creation
updated (string, date & time): Time of the last update to the repository
pushed (string, date & time): Time of the last push to the repository
fork (boolean): Whether the repository is a fork
forks (integer): Number of forks
archive (boolean): Whether the repository is archived
programs (string, list of strings): Project file path of each IaC program in the repository
programs.csv:
ID (string): Project file path of the IaC program
repository (integer): GitHub repository ID of the repository containing the IaC program
directory (string): Path of the directory containing the IaC program's project file
solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
name (string): IaC program name
description (string): IaC program description
runtime (string): Runtime string of the IaC program
testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
tests (string, list of strings): File paths of IaC program's tests
testing-files.csv:
file (string): Testing file path
language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
program (string): Project file path of the testing file's IaC program
Dataset Creation
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.
Searching Repositories
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token.
Name of the CSV output file.
Filename to search for.
File extensions to search for, separated by commas.
Min file size for the search (for all files: 0).
Max file size for the search or * for unlimited (for all files: *).
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
Limitations
The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included.
Only the repositories' default branches are considered.
Only files smaller than 384 KB are searchable.
Only repositories with fewer than 500,000 files are considered.
Only repositories that have had activity or have been returned in search results in the last year are considered.
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
Downloading Repositories
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas.
Output directory to download the repositories to.
Name of the CSV output file.
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
P
PhysioNet Challenge 2020 Dataset
paperswithcode.com
Updated Dec 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna (2020). PhysioNet Challenge 2020 Dataset [Dataset]. https://paperswithcode.com/dataset/physionet-challenge-2020
Explore at:
Dataset updated
Dec 30, 2020
Authors
Erick A. Perez Alday; Annie Gu; Amit Shah; Chad Robichaux; An-Kwok Ian Wong; Chengyu Liu; Feifei Liu; Ali Bahrami Rad; Andoni Elola; Salman Seyedi; Qiao Li; ASHISH SHARMA; Gari D. Clifford; Matthew A. Reyna
Description
Data The data for this Challenge are from multiple sources: CPSC Database and CPSC-Extra Database INCART Database PTB and PTB-XL Database The Georgia 12-lead ECG Challenge (G12EC) Database Undisclosed Database The first source is the public (CPSC Database) and unused data (CPSC-Extra Database) from the China Physiological Signal Challenge in 2018 (CPSC2018), held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. The unused data from the CPSC2018 is NOT the test data from the CPSC2018. The test data of the CPSC2018 is included in the final private database that has been sequestered. This training set consists of two sets of 6,877 (male: 3,699; female: 3,178) and 3,453 (male: 1,843; female: 1,610) of 12-ECG recordings lasting from 6 seconds to 60 seconds. Each recording was sampled at 500 Hz.

The second source set is the public dataset from St Petersburg INCART 12-lead Arrhythmia Database. This database consists of 74 annotated recordings extracted from 32 Holter records. Each record is 30 minutes long and contains 12 standard leads, each sampled at 257 Hz.

The third source from the Physikalisch Technische Bundesanstalt (PTB) comprises two public databases: the PTB Diagnostic ECG Database and the PTB-XL, a large publicly available electrocardiography dataset. The first PTB database contains 516 records (male: 377, female: 139). Each recording was sampled at 1000 Hz. The PTB-XL contains 21,837 clinical 12-lead ECGs (male: 11,379 and female: 10,458) of 10 second length with a sampling frequency of 500 Hz.

The fourth source is a Georgia database which represents a unique demographic of the Southeastern United States. This training set contains 10,344 12-lead ECGs (male: 5,551, female: 4,793) of 10 second length with a sampling frequency of 500 Hz.

The fifth source is an undisclosed American database that is geographically distinct from the Georgia database. This source contains 10,000 ECGs (all retained as test data).

All data is provided in WFDB format. Each ECG recording has a binary MATLAB v4 file (see page 27) for the ECG signal data and a text file in WFDB header format describing the recording and patient attributes, including the diagnosis (the labels for the recording). The binary files can be read using the load function in MATLAB and the scipy.io.loadmat function in Python; please see our baseline models for examples of loading the data. The first line of the header provides information about the total number of leads and the total number of samples or points per lead. The following lines describe how each lead was saved, and the last lines provide information on demographics and diagnosis. Below is an example header file A0001.hea:

A0001 12 500 7500 05-Feb-2020 11:39:16 A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I A0001.mat 16+24 1000/mV 16 0 7 2029 0 II A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1 A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2 A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3 A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4 A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5 A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6 Age: 74 Sex: Male Dx: 426783006 Rx: Unknown Hx: Unknown Sx: Unknown

From the first line, we see that the recording number is A0001, and the recording file is A0001.mat. The recording has 12 leads, each recorded at 500 Hz sample frequency, and contains 7500 samples. From the next 12 lines, we see that each signal was written at 16 bits with an offset of 24 bits, the amplitude resolution is 1000 with units in mV, the resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. The first value of the signal, the checksum, and the lead name are included for each signal. From the final 6 lines, we see that the patient is a 74-year-old male with a diagnosis (Dx) of 426783006. The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown.

Each ECG recording has one or more labels from different type of abnormalities in SNOMED-CT codes. The full list of diagnoses for the challenge has been posted here as a 3 column CSV file: Long-form description, corresponding SNOMED-CT code, abbreviation. Although these descriptions apply to all training data there may be fewer classes in the test data, and in different proportions. However, every class in the test data will be represented in the training data.
HUN GW Model code v01
researchdata.edu.au
data.gov.au
+2more
Updated Jun 1, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2018). HUN GW Model code v01 [Dataset]. https://researchdata.edu.au/hun-gw-model-code-v01/2993659
Explore at:
Dataset updated
Jun 1, 2018
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

Computer code and templates used to create the Hunter groundwater model.

Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself.

An example of a type of file in templates_and_inputs are all the uaXXXX.txt, which describe the parameters used in uncertainty analysis XXXX.

Much of the computer code is in the form of python scripts, and most of these are run using either preprocess.py or postprocess.py (using subprocess.call). Each of the python scripts employs optparse, and so is largely self documenting. Each of the python scripts also requires an index file as an input, which is an XML file and contains all meta-data associated with the model building process, so that the scripts can discover where the raw data is needed to build the model. The HUN GW Model v01 contains the index file (index.xml) used to build the Hunter groundwater model.

Finally, the "code" directory contains a snapshot of the MOOSE C++ code used to run the model.

Dataset History

Computer code and templates were written by hand.

Dataset Citation

Bioregional Assessment Programme (2016) HUN GW Model code v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/e54a1246-0076-4799-9ecf-6d673cf5b1da.
T
clinc_oos
tensorflow.org
opendatalab.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). clinc_oos [Dataset]. https://www.tensorflow.org/datasets/catalog/clinc_oos
Explore at:
Dataset updated
Dec 6, 2022
Description
Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope (OOS), i.e., queries that do not fall into any of the system's supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. It offers a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('clinc_oos', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Architectural Design Decisions for the Machine Learning Workflow: Dataset...
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen John Warnett; Stephen John Warnett; Uwe Zdun; Uwe Zdun (2024). Architectural Design Decisions for the Machine Learning Workflow: Dataset and Code [Dataset]. http://doi.org/10.5281/zenodo.5730291
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5730291
Dataset updated
Oct 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stephen John Warnett; Stephen John Warnett; Uwe Zdun; Uwe Zdun
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
Title: Architectural Design Decisions for the Machine Learning Workflow: Dataset and Code

Authors: Stephen John Warnett; Uwe Zdun

About: This is the dataset and code artifact for the article entitled "Architectural Design Decisions for the Machine Learning Workflow".

Contents: The "_generated" directory contains the generated results, including latex files with tables for use in publications and the Architectural Design Decision model in textual and graphical form. "Generators" contains Python applications that can be run to generate the above. "Metamodels" contains a Python file with type definitions. "Sources_coding" contains our source codings and audit trail. "Add_models" contains the Python implementation of our model and source codings. Finally, "appendix" contains a detailed description of our research method.

Article Abstract: Bringing machine learning models to production is challenging as it is often fraught with uncertainty and confusion, partially due to the disparity between software engineering and machine learning practices, but also due to knowledge gaps on the level of the individual practitioner. We conducted a qualitative investigation into the architectural decisions faced by practitioners as documented in gray literature based on Straussian Grounded Theory and modeled current practices in machine learning. Our novel Architectural Design Decision model is based on current practitioner understanding of the topic and helps bridge the gap between science and practice, foster scientific understanding of the subject, and support practitioners via the integration and consolidation of the myriad decisions they face. We describe a subset of the Architectural Design Decisions that were modeled, discuss uses for the model, and outline areas in which further research may be pursued.

Objective: This article aims to study current practitioner understanding of architectural concepts associated with data processing, model building, and Automated Machine Learning (AutoML) within the context of the machine learning workflow.

Method: Applying Straussian Grounded Theory to gray literature sources containing practitioner views on machine learning practices, we studied methods and techniques currently applied by practitioners in the context of machine learning solution development and gained valuable insights into the software engineering and architectural state of the art as applied to ML.

Results: Our study resulted in a model of Architectural Design Decisions, practitioner practices, and decision drivers in the field of software engineering and software architecture for machine learning.

Conclusions: The resulting Architectural Design Decisions model can help researchers better understand practitioners' needs and the challenges they face, and guide their decisions based on existing practices. The study also opens new avenues for further research in the field, and the design guidance provided by our model can also help reduce design effort and risk. In future work, we plan on using our findings to provide automated design advice to machine learning engineers.
d
Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis...
catalog.data.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis and Summary Statistics [Dataset]. https://catalog.data.gov/dataset/protected-areas-database-of-the-united-states-pad-us-3-0-vector-analysis-and-summary-stati
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
Spatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and recreation access across the nation. The PAD-US 3.0 Combined Fee, Designation, Easement feature class (with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class) was modified to remove overlaps, avoiding overestimation in protected area statistics and to support user needs. A Python scripted process ("PADUS3_0_CreateVectorAnalysisFileScript.zip") associated with this data release prioritized overlapping designations (e.g. Wilderness within a National Forest) based upon their relative biodiversity conservation status (e.g. GAP Status Code 1 over 2), public access values (in the order of Closed, Restricted, Open, Unknown), and geodatabase load order (records are deliberately organized in the PAD-US full inventory with fee owned lands loaded before overlapping management designations, and easements). The Vector Analysis File ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") associated item of PAD-US 3.0 Spatial Analysis and Statistics ( https://doi.org/10.5066/P9KLBB5D ) was clipped to the Census state boundary file to define the extent and serve as a common denominator for statistical summaries. Boundaries of interest to stakeholders (State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative) were incorporated into separate geodatabase feature classes to support various data summaries ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip") and Comma-separated Value (CSV) tables ("PADUS3_0SummaryStatistics_TabularData_CSV.zip") summarizing "PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip" are provided as an alternative format and enable users to explore and download summary statistics of interest (Comma-separated Table [CSV], Microsoft Excel Workbook [.XLSX], Portable Document Format [.PDF] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). In addition, a "flattened" version of the PAD-US 3.0 combined file without other extent boundaries ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") allow for other applications that require a representation of overall protection status without overlapping designation boundaries. The "PADUS3_0VectorAnalysis_State_Clip_CENSUS2020" feature class ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.gdb") is the source of the PAD-US 3.0 raster files (associated item of PAD-US 3.0 Spatial Analysis and Statistics, https://doi.org/10.5066/P9KLBB5D ). Note, the PAD-US inventory is now considered functionally complete with the vast majority of land protection types represented in some manner, while work continues to maintain updates and improve data quality (see inventory completeness estimates at: http://www.protectedlands.net/data-stewards/ ). In addition, changes in protected area status between versions of the PAD-US may be attributed to improving the completeness and accuracy of the spatial data more than actual management actions or new acquisitions. USGS provides no legal warranty for the use of this data. While PAD-US is the official aggregation of protected areas ( https://www.fgdc.gov/ngda-reports/NGDA_Datasets.html ), agencies are the best source of their lands data.
R
Custom Yolov7 On Kaggle On Custom Dataset
universe.roboflow.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Car Bounding Boxes
Description
Custom Training with YOLOv7 🔥

Some Important links

Model Inference🤖

🚀Training Yolov7 on Kaggle

Weight and Biases 🐝

HuggingFace 🤗 Model Repo

Contact Information

Name - Owais Ahmad

Phone - +91-9515884381

Email - owaiskhan9654@gmail.com

Portfolio - https://owaiskhan9654.github.io/

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

Link to the Downloadable Dataset

from IPython.display import Markdown, display display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

Export the dataset to YOLOv7

Train YOLOv7 to recognize the objects in our dataset

Evaluate our YOLOv7 model's performance

Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements %cd yolov7 !pip install -qr requirements.txt !pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

import os import glob import wandb import torch from roboflow import Roboflow from kaggle_secrets import UserSecretsClient from IPython.display import Image, clear_output, display # to display images print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try: user_secrets = UserSecretsClient() wandb_api_key = user_secrets.get_secret("wandb_api") wandb.login(key=wandb_api_key) anonymous = None except: wandb.login(anonymous='must') print('To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize') wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Convert an existing Coco dataset to YOLOv7 format. In Roboflow it supports over 30 formats object detection formats for conversion.

Uploading only these raw images and annotate them in Roboflow with Roboflow Annotate.

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient() roboflow_api_key = user_secrets.get_secret("roboflow_api")

rf = Roboflow(api_key=roboflow_api_key) project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq") dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
n
Data from: PyProcar: A Python library for electronic structure...
narcis.nl
data.mendeley.com
Updated Dec 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Herath, U (via Mendeley Data) (2019). PyProcar: A Python library for electronic structure pre/post-processing [Dataset]. http://doi.org/10.17632/d4rrfy3dy4.1
Explore at:
Unique identifier
https://doi.org/10.17632/d4rrfy3dy4.1
Dataset updated
Dec 18, 2019
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Herath, U (via Mendeley Data)
Description
The PyProcar Python package plots the band structure and the Fermi surface as a function of site and/or s,p,d,f - projected wavefunctions obtained for each k-point in the Brillouin zone and band in an electronic structure calculation. This can be performed on top of any electronic structure code, as long as the band and projection information is written in the PROCAR format, as done by the VASP and ABINIT codes. PyProcar can be easily modified to read other formats as well. This package is particularly suitable for understanding atomic effects into the band structure, Fermi surface, spin texture, etc. PyProcar can be conveniently used in a command line mode, where each one of the parameters define a plot property. In the case of Fermi-surfaces, the package is able to plot the surface with colors depending on other properties such as the electron velocity or spin projection. The mesh used to calculate the property does not need to be the same as the one used to obtain the Fermi surface. A file with a specific property evaluated for each k-point in a k-mesh and for each band can be used to project other properties such as electron–phonon mean path, Fermi velocity, electron effective mass, etc. Another existing feature refers to the band unfolding of supercell calculations into predefined unit cells.
SELTO Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7034899
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7034899
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning-based Methods for 3D Topology Optimization.

One can find a description of the provided dataset partitions in Section 3 of Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

Every dataset container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and a corresponding binarized SIMP solution. Every file of the form {i}.csv contains all voxel-wise information about the sample i. Every file of the form {i}_info.csv file contains scalar parameters of the topology optimization problem, such as material parameters.

This dataset represents topology optimization problems and solutions on the bases of voxels. We define all spatially varying quantities via the voxels' centers -- rather than via the vertices or surfaces of the voxels.
In {i}.csv files, each row corresponds to one voxel in the design space. The columns correspond to ['x', 'y', 'z', 'design_space', 'dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density'].

x, y, z - These are three integer indices stating the index/location of the voxel within the voxel mesh.

design_space - This is one ternary variable indicating the type of material density constraint on the voxel within the TO problem formulation. "0" and "1" indicate a material density fixed at 0 or 1, respectively. "-1" indicates the absence of constraints.

dirichlet_x, dirichlet_y, dirichlet_z - These are three binary variables defining whether the voxel contains homogenous Dirichlet constraints in the respective axis direction.

force_x, force_y, force_z - These are three floating point variables giving the three spacial components of the forces applied to each voxel. All forces are body forces given in [N/m^3].

density - This is a binary variable stating whether the voxel carries material in the solution of the topology optimization problem.

Any of these files with the index i can be imported using pandas by executing:

import pandas as pd directory = ... file_path = f'{directory}/{i}.csv' column_names = ['x', 'y', 'z', 'design_space','dirichlet_x', 'dirichlet_y', 'dirichlet_z', 'force_x', 'force_y', 'force_z', 'density'] data = pd.read_csv(file_path, names=column_names)

From this pandas dataframe one can extract the torch tensors of forces F, Dirichlet conditions ω_Dirichlet, and design space information ω_design using the following functions:

import torch def get_shape_and_voxels(data): shape = data[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 vox_x = data['x'].values vox_y = data['y'].values vox_z = data['z'].values voxels = [vox_x, vox_y, vox_z] return shape, voxels def get_forces_boundary_conditions_and_design_space(data, shape, voxels): F = torch.zeros(3, *shape, dtype=torch.float32) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_x'].values, dtype=torch.float32) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_y'].values, dtype=torch.float32) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['force_z'].values, dtype=torch.float32) ω_Dirichlet = torch.zeros(3, *shape, dtype=torch.float32) ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_x'].values, dtype=torch.float32) ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_y'].values, dtype=torch.float32) ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(data['dirichlet_z'].values, dtype=torch.float32) ω_design = torch.zeros(1, *shape, dtype=int) ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['design_space'].values.astype(int)) return F, ω_Dirichlet, ω_design

The corresponding {i}_info.csv files only have one row with column labels ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z'].

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - Yield stress [Pa]

vox_size - Length of the edge of a (cube-shaped) voxel [m]

p_x, p_y, p_z - Location of the root of the design space [m]

Analogously to above, one can import any {i}_info.csv file by executing:

file_path = f'{directory}/{i}_info.csv' data_info_column_names = ['E', 'ν', 'σ_ys', 'vox_size', 'p_x', 'p_y', 'p_z'] data_info = pd.read_csv(file_path, names=data_info_column_names)
AIMO-24: Model (openai-community/gpt2-large)
kaggle.com
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Dinh Thoai Tran @ randrise.com
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
language: en

license: mit

GPT-2 Large

Table of Contents

Model Details

How To Get Started With the Model

Uses

Risks, Limitations and Biases

Training

Evaluation

Environmental Impact

Technical Specifications

Citation Information

Model Card Authors

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

Developed by: OpenAI, see associated research paper and GitHub repo for model developers.

Model Type: Transformer-based language model

Language(s): English

License: Modified MIT License

Related Models: GPT-2, GPT-Medium and GPT-XL

Resources for more information:

Research Paper

OpenAI Blog Post

GitHub Repo

OpenAI Model Card for GPT-2

Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='gpt2-large') >>> set_seed(42) >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5) [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"}, {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"}, {'generated_text': "Hello, I'm a language model, why does this matter for you? When I hear new languages, I tend to start thinking in terms"}, {'generated_text': "Hello, I'm a language model, a functional language... I don't need to know anything else. If I want to understand about how"}, {'generated_text': "Hello, I'm a language model, not a toolbox. In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = GPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = TFGPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

Writing assistance: Grammar assistance, autocompletion (for normal prose or code)

Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.

Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...

Facebook

Twitter

Click to copy link

Link copied

Cite

Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187

polyOne Data Set - 100 million hypothetical polymers including 29 properties

Explore at:

Dataset updated

Mar 24, 2023

Dataset provided by

Rampi Ramprasad
Christopher Kuenneth

Description

polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe



PSMILES strings only



  
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
  
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.

Clear search

Close search

Google apps

Main menu

polyOne Data Set - 100 million hypothetical polymers including 29 properties...

Dataset metadata of known Dataverse installations, August 2023

Python Systems for Empirical Analysis

the-stack

TweetyNet results

darpa_sd2_perovskites Dataset

Data from: Computational 3D resolution enhancement for optical coherence...

Data from: Data and Results for GIS-Based Identification of Areas that have...

Python-HBRT model and groundwater levels used for estimating the static,...

fashion_mnist

PIPr: A Dataset of Public Infrastructure as Code Programs

Metadata

Dataset Creation

Searching Repositories

Limitations

Downloading Repositories

PhysioNet Challenge 2020 Dataset

HUN GW Model code v01

Abstract

Dataset History

Dataset Citation

clinc_oos

Architectural Design Decisions for the Machine Learning Workflow: Dataset...

Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis...

Custom Yolov7 On Kaggle On Custom Dataset

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

Custom Training with YOLOv7 🔥

📦 YOLOv7

Step 1: Install Requirements

Downloading YOLOV7 starting checkpoint

Step 2: Assemble Our Dataset

Version v2 Aug 12, 2022 Looks like this.

Step 3: Training Custom pretrained YOLOv7 model

Data from: PyProcar: A Python library for electronic structure...

SELTO Dataset

AIMO-24: Model (openai-community/gpt2-large)

license: mit

GPT-2 Large

Table of Contents

Model Details

How to Get Started with the Model

Uses

Direct Use

Downstream Use

Misuse and Out-of-scope Use

polyOne Data Set - 100 million hypothetical polymers including 29 properties