The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reference
Studies who have been using the data (in any form) are required to include the following reference:
@inproceedings{Orru2015, abstract = {The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.}, author = {Orrú, Matteo and Tempero, Ewan and Marchesi, Michele and Tonelli, Roberto and Destefanis, Giuseppe}, booktitle = {Submitted to PROMISE '15}, keywords = {Python, Empirical Studies, Curated Code Collection}, title = {A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering}, year = {2015} }
About the Data
Overview
This paper presents a dataset of metrics taken from a curated collection of 51 popular Python software systems.
The dataset reports 41 metrics of different categories: volume/size, complexity and object oriented metrics. These metrics and computed both at file and class level. We provide metrics for any file and class of each system and global metrics (computed on the entire system). Moreover we provide 14 meta-data for each system.
Paper Abstract
The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.
This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluation, we left untouched the developers' original code snippets and did not decompose them. We provided English intents to describe nested instructions altogether. In order to bootstrap the training process for the NMT model, we include in our dataset both the original, exploit-oriented snippets and snippets from a previous general-purpose Python dataset. This enables the NMT model to generate code that can mix general-purpose and exploit-oriented instructions. Among the several datasets for Python code generation, we choose the Django dataset due to its large size. This corpus contains 14,426 unique pairs of Python statements from the Django Web application framework and their corresponding description in English. Therefore, our final dataset contains 15,540 unique pairs of Python code snippets alongside their intents in natural language.
Dataset Card for "Magicoder-Evol-Instruct-110K-python"
from datasets import load_dataset
dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split
def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset accompanies the eLife publication "Automated annotation of birdsong with a neural network that segments spectrograms". In the article, we describe and benchmark a neural network architecture, TweetyNet, that automates the annotation of birdsong as we describe in the text. Here we provide checkpoint files that contain the weights of trained TweetyNet models. The checkpoints we provide correspond to the models that obtained the lowest error rates on the benchmark datasets used (as reported in the Results section titled "TweetyNet annotates with low error rates across individuals and species"). We share these checkpoints to enable other researchers to replicate our key result, and to allow users of our software to leverage them, for example to improve performance on their data by adapting pre-trained models with transfer learning methods.
Methods
Checkpoint files were generated using the vak
library (https://vak.readthedocs.io/en/latest/), running it with configuration files that are part of the code repository associated with the TweetyNet manuscript (https://github.com/yardencsGitHub/tweetynet). Those "config files" are in the directory "article/data/configs" and can be run on the appropriate datasets (as described in the paper). The "source data" files used to generate the figures were created by running scripts on the final results of running the vak
library. Those source data files and scripts are in the code repository as well. For further detail, please see the methods section in https://elifesciences.org/articles/63853
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning for 3D Topology Optimization
This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.
The following paper provides full documentation and examples:
Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ
file container consists of multiple enumerated pairs of CSV
files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i
-th sample is stored in the files i.csv
and i_info.csv
, where i.csv
contains all voxel-wise information and i_info.csv
contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.
For the i
-th sample, the columns of i_info.csv
correspond to the following scalar information:
E
- Young's modulus [Pa]ν
- Poisson's ratio [-]σ_ys
- a yield stress [Pa]h
- discretization size of the voxel grid [m]The columns of i.csv
correspond to the following voxel-wise information:
x
, y
, z
- the indices that state the location of the voxel within the voxel meshΩ_design
- design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0
and 1
indicate that the density is fixed at 0 or 1, respectively. -1
indicates the absence of constraints, i.e., the density in that voxel can be freely optimizedΩ_dirichlet_x
, Ω_dirichlet_y
, Ω_dirichlet_z
- homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimensionF_x
, F_y
, F_z
- floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]density
- defines the binary voxel-wise density of the ground truth solution to the topology optimization problem
How to Import the Dataset
with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset
object. As shown in the tutorial this can be done via:
from dl4to.datasets import SELTODataset
dataset = SELTODataset(root=root, name=name, train=train)
Here, root
is the path where the dataset should be saved. name
is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train
is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset
class.
without DL4TO: After downloading and unzipping, any of the i.csv
files can be manually imported into Python as a Pandas dataframe object:
import pandas as pd
root = ...
file_path = f'{root}/{i}.csv'
columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
df = pd.read_csv(file_path, names=columns)
Similarly, we can import a i_info.csv
file via:
file_path = f'{root}/{i}_info.csv'
info_column_names = ['E', 'ν', 'σ_ys', 'h']
df_info = pd.read_csv(file_path, names=info_columns)
We can extract PyTorch tensors from the Pandas dataframe df
using the following function:
import torch
def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
voxels = [df['x'].values, df['y'].values, df['z'].values]
Ω_design = torch.zeros(1, *shape, dtype=int)
Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
F = torch.zeros(3, *shape, dtype=dtype)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
density = torch.zeros(1, *shape, dtype=dtype)
density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
return Ω_design, Ω_Dirichlet, F, density
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.
What can you do with the data?
I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:
Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.
Data set size:
For the data, it uses the following fields (DS = Data Scientist, W = Workstation):
You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope (OOS), i.e., queries that do not fall into any of the system's supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. It offers a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('clinc_oos', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains post-processed data obtained from variational Monte-Carlo approach for Hubbard model with complex, spin and direction dependent phase. This model is believed to properly describe the eseential features of WSe2 twisted homo-bilayer. The python notebook included, allows to generate figures regsarding formation of Mott insulating phase and spin ordering.
Wikipedia - Image/Caption Matching Kaggle Competition.
This competition is organized by the Research team at the Wikimedia Foundation in collaboration with Google Research and a few external collaborators. This competition is based on the WIT dataset published by Google Research as detailed in thisSIGIR paper.
In this competition, you’ll build a model that automatically retrieves the text closest to an image. Specifically, you'll train your model to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. If successful, you'll be contributing to the accessibility of the largest online encyclopedia. The millions of Wikipedia readers and edietors will be able to more easily understand, search, and describe media at scale. As a result, you’ll contribute to an open model to improve learning for all.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wit_kaggle', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/wit_kaggle-train_with_extended_features-1.0.2.png" alt="Visualization" width="500px">
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:
metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:
ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:
ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:
file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This file comprises a hdf5-compressed table intended for use with the Python package Pandas. Its columns describe 42 metrics, or computational details on those metrics; its rows are scenes, indexed by a string according the format "yyyy-mm-dd-s-n", where:- y: year- m: month- d: day- s: satellite (a - Aqua, t - Terra)- n: scene number on the dateThe file's metadata contains a dictionary that converts column headers into more legible descriptions. See e.g. https://stackoverflow.com/a/29130146 for instructions to load this data. Use keyword 'mydata' to access the data and metadata in the file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type of file in templates_and_inputs are all the uaXXXX.txt, which describe the parameters used in uncertainty analysis XXXX. Much of the computer code is in the form of python scripts, and most of these are run using either preprocess.py or postprocess.py (using subprocess.call). Each of the python scripts employs optparse, and so is largely self documenting. Each of the python scripts also requires an index file as an input, which is an XML file and contains all meta-data associated with the model building process, so that the scripts can discover where the raw data is needed to build the model. The HUN GW Model v01 contains the index file (index.xml) used to build the Hunter groundwater model. Finally, the "code" directory contains a snapshot of the MOOSE C++ code used to run the model. Dataset History Computer code and templates were written by hand. Dataset Citation Bioregional Assessment Programme (2016) HUN GW Model code v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/e54a1246-0076-4799-9ecf-6d673cf5b1da.
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).
Original Github repo
https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">
!pip install klib
import klib
import pandas as pd
df = pd.DataFrame(data)
# klib.describe functions for visualizing datasets
- klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
- klib.corr_mat(df) # returns a color-encoded correlation matrix
- klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
- klib.dist_plot(df) # returns a distribution plot for every numeric feature
- klib.missingval_plot(df) # returns a figure containing information about missing values
Take a look at this starter notebook.
Further examples, as well as applications of the functions can be found here.
Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.
Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('fashion_mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.
Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters).
Each Dataset contains the following columns : marketplace - 2 letter country code of the marketplace where the review was written. customer_id - Random identifier that can be used to aggregate reviews written by a single author. review_id - The unique ID of the review. product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id. product_parent - Random identifier that can be used to aggregate reviews for the same product. product_title - Title of the product. product_category - Broad product category that can be used to group reviews (also used to group the dataset into coherent parts). star_rating - The 1-5 star rating of the review. helpful_votes - Number of helpful votes. total_votes - Number of total votes the review received. vine - Review was written as part of the Vine program. verified_purchase - The review is on a verified purchase. review_headline - The title of the review. review_body - The review text. review_date - The date the review was written.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('amazon_us_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Included in this content:
0045.perovksitedata.csv - main dataset used in this article. A more detailed description can be found in the “dataset overview” section below Chemical Inventory.csv - the hand curated file of all chemicals used in the construction of the perovskite dataset. This file includes identifiers, chemical properties, and other information. ExcessMolarVolumeData.xlsx - record of experimental data, computations, and final dataset used in the generation of the excess molar volume plots. MLModelMetrics.xlsx - all of the ML metrics organized in one place (excludes reactant set specific breakdown, see ML_Logs.zip for those files). OrganoammoniumDensityDataset.xlsx - complete set of the data used to generate the density values. Example calculations included. model_matchup_main.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on GIT: https://github.com/ipendlet/MLScripts/blob/master/temp_densityconc/model_matchup_main_20191231.py
SolutionVolumeDataset - complete set of 219 solutions in the perovskite dataset. Tabs include the automatically generated reagent information from ESCALATE, hand curated reagent information from early runs, and the generation of the dataset used in the creation of Figure 5.
error_auditing.zip - code and historical datasets used for reporting the dataset auditing.
“AllCode.zip” which contains:
model_matchup_main_20191231.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on
GIT: https://github.com/ipendlet/MLScripts/blob/master/temp_densityconc/0045.perovskitedata.csv
VmE_CurveFitandPlot.py - python code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text. Requires the ‘MolFractionResults.csv’ to function (also included).
Calculation_Vm_Ve_CURVEFITTING.nb - mathematica code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text.
Covariance_Analysis.py - python code for ingesting and plotting the covariance of features and volumes in the perovskite dataset. Includes renaming dictionaries used for the publication.
FeatureComparison_Plotting.py - python code for reading in and plotting features for the ‘GBT’ and ‘OHGBT’ folders in this directory. The code parses the contents of these folders and generates feature comparison metrics used for Figure 9 and the associated Figure S8. Some assembly required.
Requirements.txt - all of the packages used in the generation of this paper
0045.perovskitedata.csv - the main dataset described throughout the article. This file is required to run some of the code and is therefore kept near the code.
“ML_Logs.zip” which contains:
A folder describing every model generated for this article. In each folder there are a number of files:
Features_named_important.csv and features_value_importance.csv - these files are linked together and describe the weighted feature contributions from features (only present for GBT models)
AnalysisLog.txt - Log file of the run including all options, data curation and model training summaries
LeaveOneOut_Summary.csv - Results of the leave-one-reactant set-out studies on the model (if performed)
LOOModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs).
STTSModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs).
StandardTestTrain_Summary.csv - Results of the 6 fold cross validation ML performance (for the hold out case)
LeaveOneOut_FullDataset_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on the full dataset (all experiments) specified by reactant set (delineated by the amine)
LeaveOneOut_StratifiedData_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on a random stratified sample (96 random experiments) specified by reactant set (delineated by the amine)
model_matchup_main_*.py - code used to generate all of the runs contained in a particular folder. The code is exactly what was used at run time to generate a given dataset (requires 0045.perovskitedata.csv file to run).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global Aridity Index and Potential Evapotranspiration Database: CMIP_6 Future Projections(Future_Global_AI_PET)Robert J. Zomer 1, 2, 3, Antonio Trabucco1,41. Euro-Mediterranean Center on Climate Change, IAFES Division, Sassari, Italy. 2. Centre for Mountain Futures, Kunming Institute of Botany, Chinese Academy of Science, Kunming, Yunnan, China3. CIFOR-ICRAF China Program, World Agroforestry (ICRAF), Kunming, Yunnan. China4. National Biodiversity Future Center (NBFC), Palermo, ItalyThe Global Aridity Index and Potential Evapotranspiration (Global AI-PET) Database: CMIP_6 Future Projections – Version 1 (Future_Global_AI_PET) provides a high-resolution (30 arc-seconds) global raster dataset of average monthly and annual potential evapotransipation (PET) and aridity index (AI) for two historical (1960-1990; 1970-2000) and two future (2021-2040; 2041-2060) time periods for each of 22 CMIP6 Earth System Models across four emission scenarios (SSP: 126, 245, 370, 585). The database also includes three averaged multi-model ensembles produced for each of the four emission scenarios:· All Models: includes all of the 22 ESM, as available within a particular SSP.· High Risk: includes 5 ESM identified as projecting the highest increases in temperature and precipitation and lying outside and significantly higher than the majority of estimates.· Majority Consensus: includes 15 ESM, that is, all available ESM excluding the ESM in the “High Risk” category, and those missing data across all of the 4 SSP. Further herein referred to as the “Consensus” category.These geo-spatial datasets have been produced with the support of Euro-Mediterranean Center on Climate Change, IAFES Division; Centre for Mountain Futures, Kunming Institute of Botany, Chinese Academy of Science; CIFOR-ICRAF China Program, World Agroforestry (CIFOR-ICRAF) and the National Biodiversity Future Center (NBFC).These datasets are provided under a CC_BY 4.0 License (please attribute), in standard GeoTiff format, WGS84 Geographic Coordinate System, 30 arc seconds or ~ 1km at the equator, to support studies contributing to sustainable development, biodiversity and environmental conservation, poverty alleviation, and adaption to climate change, among other global, regional, national, and local concerns.The Future_Global_AI_PET is available online from the Science Data Bank (ScienceDB) at: https://doi.org/10.57760/sciencedb.nbsdc.00086Previous versions of the Global Aridity Index and PET Database are available online here:https://figshare.com/articles/dataset/Global_Aridity_Index_and_Potential_Evapotranspiration_ET0_Climate_Database_v2/7504448/6Technical questions regarding the datasets can be directed to Robert Zomer: r.zomer@mac.com or Antonio Trabucco: antonio.trabucco@cmcc.it Methods:Based on the results of comparative validations, the Hargreaves model has been evaluated as one of the best fit to model PET and Aridity index globally with the available high resolution downscaled and bias corrected climate projections and chosen for the implementation of the Global-AI_PET- CMIP6 Future Projections. This method performs almost as well as the Penman-Monteith method, but requires less parameterization, and has significantly lower sensitivity to error in climatic inputs (Hargreaves and Allen, 2003). The currently available downscaled CMIP6 projections (available from WorldClim) do provide fewer climate variables idoneous for implementation of temperature-based evapotranspiration methods, such as the Hargreaves method. Hargreaves (1985, 1994) uses mean monthly temperature (Tmean), mean monthly temperature range (TD) and extraterrestrial radiation (RA, radiation on top of the atmosphere) to calculate ET0, as shown below: PET = 0.023 * RA * (Tmean + 17.8) * TD0.5where RA is extraterrestrial radiation at the top of the atmosphere, TD is the difference between mean maximum temperatures and mean minimum temperatures (Tmax - Tmin), and Tmean is equal to Tmax + Tmin divided by 2. The Hargreaves equation has been implemented globally on a per grid cell basis at 30 arc seconds resolution (~ 1km2 at the equator), in ArcGIS (v11.1) using Python v3.2 (see code availability section) to estimate PET/AI globally using future projections provided by the CMIP6 collaboration. The data to parametrize the equation were obtained from the Worldclim (worldclim.org) online data repository, which provides bias-corrected downscaled monthly values of minimum temperature, maximum temperature, and precipitation for 25 CMIP6 Earth System Models (ESMs), across four Shared Socio-economic Pathways (SSPs): 126, 245, 370 and 585. PET/AI was estimated for two historical periods, WorldClim 1.4 (1960-1990) and WorldClim 2.1 (1970-2000), representing on average a decades change, by applying the Hargreaves methodology described above. Similarly, PET/AI was estimated for two future time periods, namely 2021-2040 and 2041-2060, for each of the 25 models across their respective four SSP scenarios (126, 245, 370,585). Aridity Index Aridity is often expressed as an Aridity Index, comprised of the ratio of precipitation over PET, and signifying the amount of precipitation available in relation to atmospheric water demand and quantifying the water (from rainfall) availability for plant growth after ET demand has been met, comparing incoming moisture totals with potential outgoing moisture. The AI for the averaged time periods has been calculated on a per grid cell basis, as: AI = MA_Prec/MA_PETwhere: AI = Aridity Index MA_Prec = Mean Annual Precipitation MA_PET = Mean Annual Reference EvapotranspirationUsing the mean annual precipitation (MA_Prec) values obtained from the CMIP6 climate projections, while ET0 datasets estimated on a monthly average basis by the method described above were aggregated to mean annual values (MA_PET). Using this formulation, AI values are unitless, increasing with more humid condition and decreasing with more arid conditions.Multi-Model Averaged EnsemblesBased upon the distribution of the various scenarios along a gradient of their projected temperature and precipitation estimates for the each of the four SSP and two future time period, three multi-model ensembles, each articulated by their four respective SSPs, were identified. The three parameters of monthly minimum temperature, monthly maximum temperature and monthly precipitation for ESM’s included within each of these ensemble categories were averaged for each of their respective SSPs. These averaged parameters were then used to calculate the PET/AI as per the above methodology.Code Availablity:The algorithm and code in Python used to calculate PET and AI is available on Figshare at this link below:https://figshare.com/articles/software/Global_Future_PET_AI_Algorithm_Code_Python_-_Calculate_PET_AI/24978666DATA FORMATPET datasets are available as monthly averages (12 datasets, i.e. one dataset for each month, averaged over the specified time period) or as an annual average (1 dataset) for the specified time period. Aridity Index grid layers are available as one grid layer representing the annual average over the specified period. The following nomenclature is used to describe the dataset: Zipped Files - Directory Names refer to: Model_SSP_Time-PeriodFor example: ACCESS-CM2_126_2021-2040.zip Model: ACCESS-CM2 / SSP:126 / Time-Period: 2021-2040Prefix of Files (TIFFs) is either:pet_ for PET layers aridity_index for Aridity Index (no suffix)Suffix For PET Files is either:1, 2, ... 12 Month of the yearyr Yearly averagesd Standard DeviationExamples:pet_02.tif is the PET average for the month of February.pet_yr.tif is the PET annual average.’pet_sd.tif is the standard deviation of the annual PETaridity_index.tif is the annual aridity index. The PET values are defined as total mm of PET per month or per year. The Aridity Index values are unit-less.The geospatial dataset is in geographic coordinates; datum and spheroid are WGS84; spatial units are decimal degrees. The spatial resolution is 30 arc-seconds or 0.008333 degrees. Arc degrees and seconds are angular distances, and conversion to linear units (like km) varies with latitude, as below:The Future-PET and Future-Aridity Index data layers have been processed and finalized for distribution online as GEO-TIFFs. These datasets have been zipped (.zip) into monthly series or individual annual layers, by each combination of climate model/scenarios, and are available for online access. Data Storage HierarchyThe database is organized for storage into a hierarchy of directories (see ReadMe.doc):( Individual zipped files are generally about 1 GB or less) Associated Peer Reviewed Journal Article:Zomer RJ, Xu J, Spano D and Trabucco A. 2024. CMIP6-based global estimates of future aridity index and potential evapotranspiration for 2021-2060. Open Research Europe 4:157 https://doi.org/10.12688/openreseurope.18110.1For further info, please refer to these earlier paper describing the database and methodology:Zomer, R.J.; Xu, J.; Trabucco, A. 2022. Version 3 of the Global Aridity Index and Potential Evapotranspiration Database. Scientific Data 9, 409.Zomer, R. J; Bossio, D. A.; Trabucco, A.; van Straaten, O.; Verchot, L.V. 2008. Climate Change Mitigation: A Spatial Analysis of Global Land Suitability for Clean Development Mechanism Afforestation and Reforestation. Agric. Ecosystems and Environment. 126:67-80.Trabucco, A.; Zomer, R. J.; Bossio, D. A.; van Straaten, O.; Verchot, L.V. 2008. Climate Change Mitigation through Afforestation / Reforestation: A global analysis of hydrologic
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CuBERT ETH150 Open Benchmarks
This is an unofficial HuggingFace upload of the CuBERT ETH150 Open Benchmarks. This dataset was released along with Learning and Evaluating Contextual Embedding of Source Code.
Benchmarks and Fine-Tuned Models
Here we describe the 6 Python benchmarks we created. All 6 benchmarks were derived from ETH Py150 Open. All examples are stored as sharded text files. Each text line corresponds to a separate example encoded as a JSON object. For each… See the full description on the dataset page: https://huggingface.co/datasets/claudios/cubert_ETHPy150Open.
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).