100+ datasets found

Datasets for manuscript "A data engineering framework for chemical flow...
catalog.data.gov
gimi9.com
Updated Nov 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
Explore at:
Dataset updated
Nov 7, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Z
Data from: #PraCegoVer dataset
data.niaid.nih.gov
Updated Jan 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Gabriel Oliveira dos Santos
Esther Luna Colombini
Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
P
EVIL-Encoders Dataset
paperswithcode.com
opendatalab.com
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). EVIL-Encoders Dataset [Dataset]. https://paperswithcode.com/dataset/evil-encoders
Explore at:
Dataset updated
Aug 31, 2021
Description
This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluation, we left untouched the developers' original code snippets and did not decompose them. We provided English intents to describe nested instructions altogether. In order to bootstrap the training process for the NMT model, we include in our dataset both the original, exploit-oriented snippets and snippets from a previous general-purpose Python dataset. This enables the NMT model to generate code that can mix general-purpose and exploit-oriented instructions. Among the several datasets for Python code generation, we choose the Django dataset due to its large size. This corpus contains 14,426 unique pairs of Python statements from the Django Web application framework and their corresponding description in English. Therefore, our final dataset contains 15,540 unique pairs of Python code snippets alongside their intents in natural language.
U
Python code used to determine average yearly and monthly tourism per 1000...
data.usgs.gov
catalog.data.gov
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter (2024). Python code used to determine average yearly and monthly tourism per 1000 residents for public-supply water service areas [Dataset]. http://doi.org/10.5066/P9FUL880
Explore at:
Unique identifier
https://doi.org/10.5066/P9FUL880
Dataset updated
Aug 27, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
2021
Description
This child item describes Python code used to estimate average yearly and monthly tourism per 1000 residents within public-supply water service areas. Increases in population due to tourism may impact amounts of water used by public-supply water systems. This data release contains model input datasets, Python code used to develop the tourism information, and output estimates of tourism. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Output from this code was used as an input feature in the public supply delivery and water use machine learning models. This page includes the following files: tourism_input_data.zip - a zip file containing input data sets used by the tourism Python code tourism_output.zip - a zip file with output produced by the tourism Python code README.txt - a README file describing the data files and code requirements tourism_study_code.zip - a zip file containin ...
h
code-search-net-python
huggingface.co
Updated Dec 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 27, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-python"

Dataset Description

Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do

Dataset Summary

This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.
d
Dataset metadata of known Dataverse installations
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DCDKZQ
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Gautier, Julian
Description
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
h
Magicoder-Evol-Instruct-110K-python
huggingface.co
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pxy (2024). Magicoder-Evol-Instruct-110K-python [Dataset]. https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2024
Authors
pxy
Description
Dataset Card for "Magicoder-Evol-Instruct-110K-python"

from datasets import load_dataset

Load your dataset

dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split

Define a filter function

def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.
c
Python code used to download gridMET climate data for public-supply water...
s.cnmilf.com
data.usgs.gov
+1more
Updated Aug 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Python code used to download gridMET climate data for public-supply water service areas [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/python-code-used-to-download-gridmet-climate-data-for-public-supply-water-service-areas
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
U.S. Geological Survey
Description
This child item describes Python code used to retrieve gridMET climate data for a specific area and time period. Climate data were retrieved for public-supply water service areas, but the climate data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the climate data collector code were used as input feature variables in the public supply delivery and water use machine learning models. This page includes the following file: climate_data_collector.zip - a zip file containing the climate data collector Python code used to retrieve climate data and a README file.
Z
Python Systems for Empirical Analysis
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Orrù (2020). Python Systems for Empirical Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_268468
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Matteo Orrù
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reference

Studies who have been using the data (in any form) are required to include the following reference:

@inproceedings{Orru2015, abstract = {The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.}, author = {Orrú, Matteo and Tempero, Ewan and Marchesi, Michele and Tonelli, Roberto and Destefanis, Giuseppe}, booktitle = {Submitted to PROMISE '15}, keywords = {Python, Empirical Studies, Curated Code Collection}, title = {A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering}, year = {2015} }

About the Data

Overview

This paper presents a dataset of metrics taken from a curated collection of 51 popular Python software systems.

The dataset reports 41 metrics of different categories: volume/size, complexity and object oriented metrics. These metrics and computed both at file and class level. We provide metrics for any file and class of each system and global metrics (computed on the entire system). Moreover we provide 14 meta-data for each system.

Paper Abstract

The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roig, Gemma (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Schaumlöffel, Timothy
Roig, Gemma
Choksi, Bhavin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
h
starcoderdata-python-edu
huggingface.co
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
starcoderdata-python-edu [Dataset]. https://huggingface.co/datasets/jon-tow/starcoderdata-python-edu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2024
Authors
Jonathan Tow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
starcoderdata-python-edu

StarCoder Training Dataset Cleaned and Scored

Dataset Details Dataset Description

This dataset is a filtered version of StarCoder Training Dataset that has been scored with the python-edu-scorer.

Dataset Sources

Repository: https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966 Paper: SmolLM - blazingly fast and remarkably powerful

Citation

@misc{allal2024SmolLM, title={SmolLM… See the full description on the dataset page: https://huggingface.co/datasets/jon-tow/starcoderdata-python-edu.
d
Python code used to download U.S. Census Bureau data for public-supply water...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Python code used to download U.S. Census Bureau data for public-supply water service areas [Dataset]. https://catalog.data.gov/dataset/python-code-used-to-download-u-s-census-bureau-data-for-public-supply-water-service-areas
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
This child item describes Python code used to query census data from the TigerWeb Representational State Transfer (REST) services and the U.S. Census Bureau Application Programming Interface (API). These data were needed as input feature variables for a machine learning model to predict public supply water use for the conterminous United States. Census data were retrieved for public-supply water service areas, but the census data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the census data collector code were used as input features in the public supply delivery and water use machine learning models. This page includes the following file: census_data_collector.zip - a zip file containing the census data collector Python code used to retrieve data from the U.S. Census Bureau and a README file.
o
Python based Github Repositories(above 500 stars)
opendatabay.com
.undefined
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Python based Github Repositories(above 500 stars) [Dataset]. https://www.opendatabay.com/data/ai-ml/ad6ea39b-cd2b-436b-81e3-2ad733b45d07
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 29, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Software and Technology
Description
About the Dataset: This dataset is a collection of all the python language-based public repositories which are having 500 or above stars on Github. The dataset is collected on 05-05-2022, having a total repository count of 9031.

An upvote would be great if you found this dataset useful 🙂

Purpose

Generate descriptive statistics Data visualization. NLP can be performed on the description field of repositories Clustering by topics. Finding hidden gems of open source projects Description of Columns:

Column Description full_name Full name of repository repo_lang Programming used in the repository repo_topics Topics of the repository created_at Repository creation date description Description of repository forks_count Total fork count of the repository open_issues_count Current open issues count repo_size size of repo repo_stargazers_count Star count of repository repo_subscribers_count Subscriber count of repository repo_watchers_count Watchers count git_url Git URL of repository html_url HTML URL of repository

License

CC0

Original Data Source: Python based Github Repositories(above 500 stars)
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Z
PIPr: A Dataset of Public Infrastructure as Code Programs
data.niaid.nih.gov
zenodo.org
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spielmann, David (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
Explore at:
Dataset updated
Nov 28, 2023
Dataset provided by
Spielmann, David
Sokolowski, Daniel
Salvaneschi, Guido
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |

Data from: Computational 3D resolution enhancement for optical coherence...

data.niaid.nih.gov
zenodo.org

Updated Jul 12, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Jeroen Kalkman (2024). Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7870794

Explore at:

Dataset updated

Jul 12, 2024

Dataset provided by

George-Othon Glentis
Jeroen Kalkman
Jos de Wit

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository contains the code and data underlying the publication "Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source" in Biomedical Optics Express 14, 3532-3554 (2023) (doi.org/10.1364/BOE.487345).

The reader is free to use the scripts and data in this depository, as long as the manuscript is correctly cited in their work. For further questions, please contact the corresponding author.

Description of the code and datasets

Table 1 describes all the Matlab and Python scripts in this depository. Table 2 describes the datasets. The input datasets are the phase corrected datasets, as the raw data is large in size and phase correction using a coverslip as reference is rather straightforward. Processed datasets are also added to the repository to allow for running only a limited number of scripts, or to obtain for example the aberration corrected data without the need to use python. Note that the simulation input data (input_simulations_pointscatters_SLDshape_98zf_noise75.mat) is generated with random noise, so if this is overwritten de results may slightly vary. Also the aberration correction is done with random apertures, so the processed aberration corrected data (exp_pointscat_image_MIAA_ISAM_CAO.mat and exp_leaf_image_MIAA_ISAM_CAO.mat) will also slightly change if the aberration correction script is run anew. The current processed datasets are used as basis for the figures in the publication. For details on the implementation we refer to the publication.

Table 1: The Matlab and Python scripts with their description


    Script name
    Description


    MIAA_ISAM_processing.m
    This scripts performs the DFT, RFIAA and MIAA processing of the phase-corrected data that can be loaded from the datasets. Afterwards it also applies ISAM on the DFT and MIAA data and plots the results in a figure (via the scripts plot_figure3, plot_figure5 and plot_simulationdatafigure).


    resolution_analysis_figure4.m
    This figure loads the data from the point scatterers (absolute amplitude data), seeks the point scatterrers and fits them to obtain the resolution data. Finally it plots figure 4 of the publication.


    fiaa_oct_c1.m, oct_iaa_c1.m, rec_fiaa_oct_c1.m, rfiaa_oct_c1.m 
    These four functions are used to apply fast IAA and MIAA. See script MIAA_ISAM_processing.m for their usage.


    viridis.m, morgenstemning.m
    These scripts define the colormaps for the figures.


    plot_figure3.m, plot_figure5.m, plot_simulationdatafigure.m
    These scripts are used to plot the figures 3 and 5 and a figure with simulation data. These scripts are executed at the end of script MIAA_ISAM_processing.m.


    Python script: computational_adaptive_optics_script.py
    Python script that applied computational adaptive optics to obtain the data for figure 6 of the manuscript.


    Python script: zernike_functions2.py
    Python script that gives the values and carthesian derrivatives of the Zernike polynomials.


    figure6_ComputationalAdaptiveOptics.m
    Script that loads the CAO data that was saved in Python, analyzes the resolution, and plots figure 6.


    Python script: OCTsimulations_3D_script2.py
    Python script simulates OCT data, adds noise and saves it as .mat file for use in the matlab script above.


    Python script: OCTsimulations2.py
    Module that contains a python class that can be used to simulate 3D OCT datasets based on a Gaussian beam.


    Matlab toolbox DIPimage 2.9.zip
    Dipimage is used in the scripts. The toolbox can be downloaded online or this zip can be used.






The datasets in this Zenodo repository


    Name
    Description


    input_leafdisc_phasecorrected.mat
    Phase corrected input image of the leaf disc (used in figure 5).


    input_TiO2gelatin_004_phasecorrected.mat
    Phase corrected input image of the TiO2 in gelatin sample.


    input_simulations_pointscatters_SLDshape_98zf_noise75
    Input simulation data that, once processed, is used in figure 4.

exp_pointscat_image_DFT.mat

exp_pointscat_image_DFT_ISAM.mat

exp_pointscat_image_RFIAA.mat

exp_pointscat_image_MIAA_ISAM.mat

exp_pointscat_image_MIAA_ISAM_CAO.mat

    Processed experimental amplitude data for the TiO2 point scattering sample with respectively DFT, DFT+ISAM, RFIAA, MIAA+ISAM and MIAA+ISAM+CAO. These datasets are used for fitting in figure 4 (except for CAO), and MIAA_ISAM and MIAA_ISAM_CAO are used for figure 6.

simu_pointscat_image_DFT.mat

simu_pointscat_image_RFIAA.mat

simu_pointscat_image_DFT_ISAM.mat

simu_pointscat_image_MIAA_ISAM.mat

    Processed amplitude data from the simulation dataset, which is used in the script for figure 4 for the resolution analysis.

exp_leaf_image_MIAA_ISAM.mat

exp_leaf_image_MIAA_ISAM_CAO.mat

    Processed amplitude data from the leaf sample, with and without aberration correction which is used to produce figure 6.

exp_leaf_zernike_coefficients_CAO_normal_wmaf.mat

exp_pointscat_zernike_coefficients_CAO_normal_wmaf.mat

    Estimated Zernike coefficients and the weighted moving average of them that is used for the computational aberration correction. Some of this data is plotted in Figure 6 of the manuscript.


    input_zernike_modes.mat
    The reference Zernike modes corresponding to the data that is loaded to give the modes the proper name.

exp_pointscat_MIAA_ISAM_complex.mat

exp_leaf_MIAA_ISAM_complex

    Complex MIAA+ISAM processed data that is used as input for the computational aberration correction.

SELTO Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7781392
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values] Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
d
Galilee Hydrological Response Variable (HRV) model
data.gov.au
researchdata.edu.au
zip
Updated Apr 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). Galilee Hydrological Response Variable (HRV) model [Dataset]. https://data.gov.au/data/dataset/facb2291-249c-4903-93ce-d9c3f83e7640
Explore at:
zip(307155160)Available download formats
Dataset updated
Apr 13, 2022
Dataset authored and provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement.

The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This dataset contains the scripts and files for the uncertainty analysis presented in product 262 Groundwater modelling for the Galilee subregion (Peeters et al. 2016)

Purpose

Below is a description of the various files in the dataset; for csv files the content is described, for python scripts the purpose, input and output is described.

'GAL_AEM_dmax_Parameters.csv' describes the parameters used in the modelling, their transformation, minimum, maximum, distribution and corresponding mean and standard deviation of the distribution.

'GAL_AEM_dmax_Posterior_Parameters.csv' has the parameter combinations that form the posterior parameter ensemble.

'GAL_AEM_postprocess.py' reads and collates the model output from GAL_AEM_Model and results in following files:

'GAL_AEM_dmax_Predictions.csv' describes the maximum drawdwon at the receptors (min,max,median)

'GAL_AEM_dmax_PosteriorPredictions.csv' contains the posterior predictions for all receptors of maximum drawdown

'GAL_AEM_tmax_Predictions.csv' describes the time to maximum drawdwon at the receptors (min,max,median)

'GAL_AEM_tmax_PosteriorPredictions.csv' contains the posterior predictions for all rec

Dataset History

This dataset contains all the files and scripts of the uncertainty analysis reported in product 262 for the Galilee subregion (Peeters et al. 2016). The custom-made python scripts read and collate the simulated drawdown time series at the prediction nodes and the change in surface water groundwater flux computed by the Galilee Analytic Element model (dataset Galilee groundwater numerical modelling AEM models). The drawdown timeseries are summarised in maximum drawdown (dmax) and time to maximum drawdown (tmax). A total of 10,000 different parameter combinations are randomly sampled from the posterior probability distributions described in Peeters et al. (2016). The dmax and tmax values corresponding to these parameter combinations are further summarised in percentiles of dmax and tmax for each node. This is the information that is used to visualise the results in Peeters et al. (2016) and compute the sensitivity indices of the predictions to the parameters.

Dataset Citation

Bioregional Assessment Programme (2016) Galilee Hydrological Response Variable (HRV) model. Bioregional Assessment Derived Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/facb2291-249c-4903-93ce-d9c3f83e7640.

Dataset Ancestors

Derived From Surface Geology of Australia, 1:2 500 000 scale, 2012 edition

Derived From Geoscience Australia GEODATA TOPO series - 1:1 Million to 1:10 Million scale

Derived From Galilee model HRV receptors gdb

Derived From Galilee groundwater numerical modelling AEM models
Z
CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type...
data.niaid.nih.gov
zenodo.org
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brust, Clemens-Alexander (2022). CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type Inference Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5747023
Explore at:
Dataset updated
Jan 28, 2022
Dataset provided by
Brust, Clemens-Alexander
Heinze, Thomas
Gruner, Bernd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains python repositories mined on GitHub on January 20, 2021. It allows a cross-domain evaluation of type inference systems. For this purpose, it consists of two sub-datasets, each containing only projects from the web or scientific calculation domain, respectively. Therefore we searched for projects with dependencies to either NumPy or Flask. Furthermore, only projects with dependencies to mypy were considered, because this should ensure that at least parts of the projects have type annotations. These can be used later as ground truth. Further details about the dataset will be described in an upcoming paper, as soon as it is published it will be linked here. The dataset consists of two files for the two sub-datasets. The web domain dataset contains 3129 repositories and the scientific calculation domain dataset contains 4783 repositories. The files have two columns with the URL to the GitHub repository and the used commit hash. Thus, it is possible to download the dataset using shell or python scripts, for example, the pipeline provided by ManyTypes4Py can be used. If repositories do not exist anymore or are private, you can contact us via the following email address: bernd.gruner@dlr.de. We have a backup of all repositories and will be happy to help you.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr

Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations"

Explore at:

Dataset updated

Nov 7, 2021

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).

Clear search

Close search

Google apps

Main menu

Datasets for manuscript "A data engineering framework for chemical flow...

Data from: #PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

EVIL-Encoders Dataset

Python code used to determine average yearly and monthly tourism per 1000...

code-search-net-python

Dataset metadata of known Dataverse installations

Magicoder-Evol-Instruct-110K-python

Load your dataset

Define a filter function

… See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

Python code used to download gridMET climate data for public-supply water...

Python Systems for Empirical Analysis

Multimodal Vision-Audio-Language Dataset

starcoderdata-python-edu

Python code used to download U.S. Census Bureau data for public-supply water...

Python based Github Repositories(above 500 stars)

License

Data and tools for studying isograms

PIPr: A Dataset of Public Infrastructure as Code Programs

Best Books Ever Dataset

Data from: Computational 3D resolution enhancement for optical coherence...

SELTO Dataset

Galilee Hydrological Response Variable (HRV) model

Abstract

Purpose

Dataset History

Dataset Citation

Dataset Ancestors

CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type...

Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations"