100+ datasets found

f
SCodeSearcher
figshare.com
zip
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li (2024). SCodeSearcher [Dataset]. http://doi.org/10.6084/m9.figshare.25359841.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25359841.v2
Dataset updated
Mar 11, 2024
Dataset provided by
figshare
Authors
Jia Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
File path configurationBefore you start training the model, make sure that all file paths are correctly set to the paths in your local environment. This includes training data, where the model is saved, and any associated configuration files.- ** Training data paths ** : Check the path of the training data to make sure they point to the correct location.- ** Model save path ** : The 'checkpoint' directory is used to save model weights during training. Make sure that this path correctly points to the local directory where you want to save the model weights.** Important note ** : Before running the run.sh script, open the script and any related Python files, then check and update the path settings.## Soft contrastive learningTo soft contrastive learning, navigate to the corresponding directory and run the following command:shcd $Project_Pathbash run.sh## Parameter setting description- To adjust the weight ranges of positive samples, modify the softmax operation for 'ai' on line 158 of 'utils.py'.- To adjust the weight ranges of negative sample, adjust 'bi' on line 163 of 'utils.py'. ## Code SearchThe dataset file contains the code retrieval datasets and the code classification datasets. python run.py \ --output_dir=./python \ --config_name=/graphcodebert-base \ --model_name_or_path=/graphcodebert-base \ --tokenizer_name=/graphcodebert-base \ --lang=python \ --do_train \ --train_data_file=/dataset/CSN-Python/train.jsonl \ --eval_data_file=/dataset/CSN-Python/test.jsonl \ --test_data_file=/dataset/CSN-Python/test.jsonl \ --codebase_file=/dataset/CSN-Python/codebase.jsonl \ --num_train_epochs 20 \ --code_length 318 \ --data_flow_length 64 \ --nl_length 256 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 42
Population Assessment of Tobacco and Health (PATH) Study [United States]...
icpsr.umich.edu
Updated Jun 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inter-university Consortium for Political and Social Research [distributor] (2025). Population Assessment of Tobacco and Health (PATH) Study [United States] Special Collection Restricted-Use Files [Dataset]. http://doi.org/10.3886/ICPSR37519.v13
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR37519.v13
Dataset updated
Jun 27, 2025
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
License
https://www.icpsr.umich.edu/web/ICPSR/studies/37519/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37519/terms
Area covered
United States
Description
The PATH Study was launched in 2011 to inform the Food and Drug Administration's regulatory activities under the Family Smoking Prevention and Tobacco Control Act (TCA). The PATH Study is a collaboration between the National Institute on Drug Abuse (NIDA), National Institutes of Health (NIH), and the Center for Tobacco Products (CTP), Food and Drug Administration (FDA). The study sampled over 150,000 mailing addresses across the United States to create a national sample of people who use or do not use tobacco. 45,971 adults and youth constitute the first (baseline) wave, Wave 1, of data collected by this longitudinal cohort study. These 45,971 adults and 9 to 11 sampled at Wave 1) make up the 53,178 participants that constitute the Wave 1 Cohort. Respondents are asked to complete an interview at each follow-up wave. Youth who turn 18 by the current wave of data collection are considered "aged-up adults" and are invited to complete the Adult Interview. Additionally, "shadow youth" are considered "aged-up youth" upon turning 12 years old, when they are asked to complete an interview after parental consent. At Wave 4, a probability sample of 14,098 adults, youth, and shadow youth ages 10 to 11 was selected from the civilian, noninstitutionalized population at the time of Wave 4. This sample was recruited from residential addresses not selected for Wave 1 in the same sampled primary sampling units (PSU)s and segments using similar within-household sampling procedures. This "replenishment sample" was combined for estimation and analysis purposes with Wave 4 adult and youth respondents from the Wave 1 Cohort who were in the civilian, noninstitutionalized population at the time of Wave 4. This combined set of Wave 4 participants, 52,731 participants in total, forms the Wave 4 Cohort. At Wave 7, a probability sample of 14,863 adults, youth, and shadow youth ages 9 to 11 was selected from the civilian, noninstitutionalized population at the time of Wave 7. This sample was recruited from residential addresses not selected for Wave 1 or Wave 4 in the same sampled PSUs and segments using similar within-household sampling procedures. This "second replenishment sample" was combined for estimation and analysis purposes with the Wave 7 adult and youth respondents from the Wave 4 Cohorts who were at least age 15 and in the civilian, noninstitutionalized population at the time of Wave 7 participants, 46,169 participants in total, forms the Wave 7 Cohort. Please refer to the Restricted-Use Files User Guide that provides further details about children designated as "shadow youth" and the formation of the Wave 1, Wave 4, and Wave 7 Cohorts. Wave 4.5 was a special data collection for youth only who were aged 12 to 17 at the time of the Wave 4.5 interview. Wave 4.5 was the fourth annual follow-up wave for those who were members of the Wave 1 Cohort. For those who were sampled at Wave 4, Wave 4.5 was the first annual follow-up wave. Wave 5.5, conducted in 2020, was a special data collection for Wave 4 Cohort youth and young adults ages 13 to 19 at the time of the Wave 5.5 interview. Also in 2020, a subsample of Wave 4 Cohort adults ages 20 and older were interviewed via the PATH Study Adult Telephone Survey (PATH-ATS). Wave 7.5 was a special collection for Wave 4 and Wave 7 Cohort youth and young adults ages 12 to 22 at the time of the Wave 7.5 interview. For those who were sampled at Wave 7, Wave 7.5 was the first annual follow-up wave. Dataset 1002 (DS1002) contains the data from the Wave 4.5 Youth and Parent Questionnaire. This file contains 1,617 variables and 13,131 cases. Of these cases, 11,378 are continuing youth having completed a prior Youth Interview. The other 1,753 cases are "aged-up youth" having previously been sampled as "shadow youth" Datasets 1112, 1212, and 1222, (DS1112, DS1212, and DS1222) are data files comprising the weight variables for Wave 4.5. The "all-waves" weight file contains weights for participants in the Wave 1 Cohort who completed a Wave 4.5 Youth Interview and completed interviews (if old enough to do so) or verified their information with the study (if not old enough to be interviewed) in Waves 1, 2, 3, and 4. There are two separate files with "single wave" weights: one for the Wave 1 Cohort and one for the Wave 4 Cohort. The "single-wave" weight file for the Wave 1 Cohort contains weights for youth who c
Path Mixing VLN dataset
zenodo.org
json
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2023). Path Mixing VLN dataset [Dataset]. http://doi.org/10.5281/zenodo.8422186
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8422186
Dataset updated
Oct 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The R2R dataset consists of human annotated instructions corresponding to the paths in these graphs. Each path consists of a sequence of viewpoints encountered by the agent during navigation. A derived dataset, Fine-Grained R2R (FGR2R) [ 12] dataset, annotated parts of instructions with corresponding graph edges to obtain a fine-grained dataset. Existing works in VLN have shown that more instruction examples can improve an agent’s performance in
previously unseen environments.

Hence to augment training data, we mix parts of paths from the FGR2R dataset to obtain additional instruction-trajectory pairs. The paths are mixed from other neighboring paths that are part of
same house which sustains both view and instruction consistency.
To mix paths we identify all the edges in the graph corresponding
to the start of navigation 𝜀𝑠𝑡𝑎𝑟𝑡 and end of navigational episodes
𝜀𝑒𝑛𝑑 . These edges are important for mixing as they correspond
to micro-instructions (Walk away from the desk, Turn right etc.)
that refer to start and stop positions in the house while other edges
correspond to instructions that back reference to previous locations.
The remaining transition edges 𝜀𝑡𝑟𝑎𝑛𝑠 are mixed to obtain a path
𝜀𝑠𝑡𝑎𝑟𝑡 → 𝜀𝑡𝑟𝑎𝑛𝑠 → 𝜀𝑒𝑛𝑑 . Not all edges are inter-connectable, as
some of the nodes could be spatially close to each other - reducing
the visual variety of viewpoints or resulting in the repetition of
micro-instructions (short but actionable instructions) in the final
instruction. Accordingly, the edges are connected based on the
following criteria: (1) the distance between any 2 nodes should be
greater than 3m and the angle between edges should not acute
to prevent navigating in loops (2) the distance between the start
and end nodes should be greater than 3m to ensure that the path
ends up in a different room (3) the start and end nodes cannot
have a common edge (4) micro-instructions from common edges
of different paths are chosen randomly. The final instruction is
the sequence of micro-instructions and the path is the sequence of
edges (Figure 2). Using this method, we generate 162k instruction-
trajectory pairs with path lengths between 5m and 30m. The final dataset
has on average 7.27 views per path, a mean of 14.4m trajectory
length and an average of 82 words per instruction.
4
Data underlying the research on path planning of robot unknown environment...
data.4tu.nl
zip
Updated Sep 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bo Wei Xu; Jun Peng Zhang (2023). Data underlying the research on path planning of robot unknown environment based on improved A * algorithm [Dataset]. http://doi.org/10.4121/285af7be-36da-4b69-a75d-bf822ebc107f.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/285af7be-36da-4b69-a75d-bf822ebc107f.v1
Dataset updated
Sep 29, 2023
Dataset provided by
4TU.ResearchData
Authors
Bo Wei Xu; Jun Peng Zhang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2023
Description
This dataset is a source code file and the code language is MATLAB, Propose an improved algorithm based on the traditional A* algorithm, which expands the search step and search angle - Improv-A*. This algorithm not only improves the search speed but also enhances search efficiency, reducing the total planning distance. In order to achieve a combination of static global path planning and dynamic local path planning, we attempt to integrate Improv-A* algorithm with artificial potential field method to achieve dynamic path planning for unmanned aerial vehicles.
h
path-vqa
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flavia Giammarino, path-vqa [Dataset]. https://huggingface.co/datasets/flaviagiammarino/path-vqa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Flavia Giammarino
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for PathVQA

Dataset Description

PathVQA is a dataset of question-answer pairs on pathology images. The dataset is intended to be used for training and testing Medical Visual Question Answering (VQA) systems. The dataset includes both open-ended questions and binary "yes/no" questions. The dataset is built from two publicly-available pathology textbooks: "Textbook of Pathology" and "Basic Pathology", and a publicly-available digital library: "Pathology… See the full description on the dataset page: https://huggingface.co/datasets/flaviagiammarino/path-vqa.
o
View Path Cross Street Data in The Villages, FL
ownerly.com
Updated Dec 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ownerly (2021). View Path Cross Street Data in The Villages, FL [Dataset]. https://www.ownerly.com/fl/the-villages/view-path-home-details
Explore at:
Dataset updated
Dec 14, 2021
Dataset authored and provided by
Ownerly
Area covered
Florida, The Villages, View Path
Description
This dataset provides information about the number of properties, residents, and average property values for View Path cross streets in The Villages, FL.
d
Paths and Barriers Identification: Algorithm based on LiDAR data
search.dataone.org
borealisdata.ca
+1more
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lyu, Yiwei (2023). Paths and Barriers Identification: Algorithm based on LiDAR data [Dataset]. http://doi.org/10.5683/SP3/NRVT59
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/NRVT59
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Lyu, Yiwei
Description
The wayfinding system is important for the people on campus. However, the existed wayfinding system of UBC does not consider some walkable paths which are not shown on the street map. Also, the wayfinding system ignores the barriers like stairs, which could be obstacles for wheelchair users, on the paths. LiDAR is developed rapidly in recent years. It can collect the elevation information of the objectives on the ground. University of British Columbia (UBC) collects and publishes the LiDAR dataset of campus every year. This project uses the elevation and the point intensity information from the LiDAR point dataset to identify the walkable paths and the barriers on the paths. Two algorithms are announced. The first one is the intensity-based path identification algorithm, which assumes that the concrete paths have a homogenous intensity. Another algorithm is the barrier identification algorithm, which is based on the Canny edge detection algorithm. As a result, the two algorithms both work well in the research area, and they have the potential to be developed as an automatic process and can be one part of the wayfinding system.
N
Honea Path, SC Age Cohorts Dataset: Children, Working Adults, and Seniors in...
neilsberg.com
csv, json
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Honea Path, SC Age Cohorts Dataset: Children, Working Adults, and Seniors in Honea Path - Population and Percentage Analysis // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/4b887b3f-f122-11ef-8c1b-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Feb 22, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Carolina, Honea Path
Variables measured
Population Over 65 Years, Population Under 18 Years, Population Between 18 and 64 Years, Percent of Total Population for Age Groups
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age cohorts. For age cohorts we divided it into three buckets Children ( Under the age of 18 years), working population ( Between 18 and 64 years) and senior population ( Over 65 years). For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Honea Path population by age cohorts (Children: Under 18 years; Working population: 18-64 years; Senior population: 65 years or more). It lists the population in each age cohort group along with its percentage relative to the total population of Honea Path. The dataset can be utilized to understand the population distribution across children, working population and senior population for dependency ratio, housing requirements, ageing, migration patterns etc.

Key observations

The largest age group was 18 to 64 years with a poulation of 2,230 (59.72% of the total population). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age cohorts:

Under 18 years

18 to 64 years

65 years and over

Variables / Data Columns

Age Group: This column displays the age cohort for the Honea Path population analysis. Total expected values are 3 groups ( Children, Working Population and Senior Population).

Population: The population for the age cohort in Honea Path is shown in the following column.

Percent of Total Population: The population as a percent of total population of the Honea Path is shown in the following column.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Honea Path Population by Age. You can refer the same here
4
Dataset about Port Research
data.4tu.nl
xlsx
Updated Sep 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihui Yang (2021). Dataset about Port Research [Dataset]. http://doi.org/10.4121/14298851.v3
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.4121/14298851.v3
Dataset updated
Sep 1, 2021
Dataset provided by
4TU.ResearchData
Authors
Zihui Yang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
After detemining that there is no direct connection to the port in the network diagram, get the direct connection distance between ports through the port.sol.com.cn、SeaRates.com and McDistance shipping calculation tool. If there is a big difference between the three query data, the average value method is used for optimization, get the table Port Distance.Using the Floyd algorithm, the path between two ports in the port network graph is solved on the basis of the table Port Distance, there maybe multiple shortest paths between two ports, but this situation is not considered here, the only result will be the result of Python simulation, get the table Port Shortest Path.After get the Port Shortest Path, calculate the value of the shortest path between two ports, get the table Port Shortest Path Value.According to the shortest path between two ports, count the number of routes for each port, then use the K-Medoids, construting the model of strategic importance of ports, get the table The number of ports is crossed by the shortest path.According to the principle of the Betweenness Centrality model, the Betweenness Centrality of each port in the whole network is obtained by the table Port Shortest Path, and then use the K-Medoids, get the table Port Betweenness Centrality.The values and contents of the table The number of ports is crossed by the shortest path and the table Betweenness Centrality Group are combined together to get the table Total Group to facilitate data search.
w
Dataset of publication dates of book series where books equals Pennsylvania...
workwithdata.com
Updated Nov 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of publication dates of book series where books equals Pennsylvania off the beaten path : discover your fun [Dataset]. https://www.workwithdata.com/datasets/book-series?col=book_series%2Cj0-publication_date&f=1&fcol0=j0-book&fop0=%3D&fval0=Pennsylvania+off+the+beaten+path+%3A+discover+your+fun&j=1&j0=books
Explore at:
Dataset updated
Nov 25, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book series. It has 1 row and is filtered where the books is Pennsylvania off the beaten path : discover your fun. It features 2 columns including publication dates.
h
trial_image_dataset
huggingface.co
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sri Sakthi Prathosh R (2024). trial_image_dataset [Dataset]. https://huggingface.co/datasets/Power108/trial_image_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2024
Authors
Sri Sakthi Prathosh R
Description
My Dataset

This dataset includes research articles with metadata and images.

Features

The dataset contains the following features:

pmid: The PubMed ID of the article (string). pmcid: The PubMed Central ID of the article (string). title: The title of the article (string). abstract: The abstract of the article (string). fulltext: The full text of the article (string). images: Contains image data with the following fields: bytes: Binary image data. path: Relative path to… See the full description on the dataset page: https://huggingface.co/datasets/Power108/trial_image_dataset.
PIPr: A Dataset of Public Infrastructure as Code Programs
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10173400
Dataset updated
Nov 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
metadata.zip: The dataset metadata and analysis results as CSV files.
scripts-and-logs.zip: Scripts and logs of the dataset creation.
LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
README.md: This document.
redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
Metadata
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
ID (integer): GitHub repository ID
url (string): GitHub repository URL
downloaded (boolean): Whether cloning the repository succeeded
name (string): Repository name
description (string): Repository description
licenses (string, list of strings): Repository licenses
redistributable (boolean): Whether the repository's licenses permit redistribution
created (string, date & time): Time of the repository's creation
updated (string, date & time): Time of the last update to the repository
pushed (string, date & time): Time of the last push to the repository
fork (boolean): Whether the repository is a fork
forks (integer): Number of forks
archive (boolean): Whether the repository is archived
programs (string, list of strings): Project file path of each IaC program in the repository
programs.csv:
ID (string): Project file path of the IaC program
repository (integer): GitHub repository ID of the repository containing the IaC program
directory (string): Path of the directory containing the IaC program's project file
solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
name (string): IaC program name
description (string): IaC program description
runtime (string): Runtime string of the IaC program
testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
tests (string, list of strings): File paths of IaC program's tests
testing-files.csv:
file (string): Testing file path
language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
program (string): Project file path of the testing file's IaC program
Dataset Creation
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.
Searching Repositories
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token.
Name of the CSV output file.
Filename to search for.
File extensions to search for, separated by commas.
Min file size for the search (for all files: 0).
Max file size for the search or * for unlimited (for all files: *).
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
Limitations
The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included.
Only the repositories' default branches are considered.
Only files smaller than 384 KB are searchable.
Only repositories with fewer than 500,000 files are considered.
Only repositories that have had activity or have been returned in search results in the last year are considered.
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
Downloading Repositories
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas.
Output directory to download the repositories to.
Name of the CSV output file.
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
f
Ground Truth for Entity Relatedness Problem over DBpedia datasets
figshare.com
zip
Updated Aug 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Guillot Jiménez (2021). Ground Truth for Entity Relatedness Problem over DBpedia datasets [Dataset]. http://doi.org/10.6084/m9.figshare.15181086.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.15181086.v1
Dataset updated
Aug 17, 2021
Dataset provided by
figshare
Authors
Javier Guillot Jiménez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. More precisely, this problem can be defined as: “Given an RDF graph 'G' and a pair of entities 'a' and 'b', represented in 'G', compute the paths in 'G' from 'a' to 'b' that best describe the connectivity between them”.This dataset supports the evaluation of approaches that address the entity relatedness problem and contains a total of 240 ranked lists with 50 relationship paths each between entity pairs in two familiar domains, music and movies, in two subsets of the DBpedia that we called DBpedia21M and DBpedia45M. Specifically, we extracted data from the following two publicly available subsets of the English DBpedia corpus to form our two knowledge bases:1. mappingbased-objects: https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2021.03.01/mappingbased-objects_lang=en.ttl.bz22. infobox-properties: https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2021.03.01/infobox-properties_lang=en.ttl.bz2 DBpedia21M contains the statements in the mappingbased-objects dataset, and DBpedia45M contains the union of the statements in mappingbased-objects and in infobox-properties. In both cases, we exclude statements involving literals or blank nodes.For each dataset (DBpedia21M and DBpedia45M), the ground truth contains 120 ranked lists with 50 relationship paths each. Each list corresponds to the most relevant paths between one of the 20 entity pairs, 10 pairs from the music domain and 10 from the movie domain, found using different path search strategies.A path search strategy consists of an entity similarity measure and a path ranking measure. The ground truth was created using the following 6 strategies:1. Jaccard Index & Predicate Frequency Inverse Triple Frequency (PF-ITF)2. Jaccard Index & Exclusivity-based Relatedness (EBR)3. Jaccard Index & Pointwise Mutual Information (PMI)4. Wikipedia Link-based Measure (WLM) & PF-ITF5. WLM & EBR6. WLM & PMIThe filename of a file that contains the ranked list of 50 relationship paths between a pair of entities has the following format:[Dataset].[EntityPairID].[SearchStrategyID].[Entity1-Entity2].txtExample 1: DBpedia21M.1.2.Michael_Jackson-Whitney_Houston.txtExample 2: DBpedia45M.27.4.Paul_Newman-Joanne_Woodward.txtThe file in Example 1 contains the top-50 most relevant paths between Michael Jackson and Whitney Houston in DBpedia21M using the search strategy number 2 (Jaccard Index & EBR)The file in Example 2 contains the top-50 most relevant paths between Paul Newman and Joanne Woodward in DBpedia45M using the search strategy number 4 (WLM & PF-ITF)The data is splitted into 2 files, one for each dataset and compressed in .zip format:DBpedia21M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia21M dataset. DBpedia45M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia45M dataset.
A
‘Path to a SAEIV line’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Path to a SAEIV line’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-europa-eu-path-to-a-saeiv-line-fc4f/e9e4f166/?iid=006-116&v=presentation
Explore at:
Dataset updated
Aug 4, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Path to a SAEIV line’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/61a025be66bcd934d64ed79e on 17 January 2022.

--- Dataset description provided by original source is as follows ---

Dictionary Public

This non-graphic dataset from the Operating Assistance and Traveller Information System (SAEIV) represents the variants of line paths used by vehicles in the TBM network. A line path is an orderly sequence of consecutive sections, with a direction (Go or back).

This dataset can be linked to Course of a vehicle on a path, SAEIV commercial line, Elementary Pathway Round, Deviation Round, physical stop on the network, bus schedules for the next 14 days and Vehicle in service on the network

This data set is refreshed all: 1 hour(s). Be careful, for performance reasons, this dataset (Table, Map, Analysis and Export tabs) can be updated less frequently than the source and a deviation may exist. We also invite you to use our Webservices (see Webservices BM tab) to retrieve the freshest data.

--- Original source retains full ownership of the source dataset ---
J
No one true path: uncovering the interplay between geography, institutions,...
journaldata.zbw.eu
jda-test.zbw.eu
txt, xls
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chih Ming Tan; Chih Ming Tan (2022). No one true path: uncovering the interplay between geography, institutions, and fractionalization in economic development (replication data) [Dataset]. http://doi.org/10.15456/jae.2022320.0720867021
Explore at:
xls(148480), txt(4148), txt(25343)Available download formats
Unique identifier
https://doi.org/10.15456/jae.2022320.0720867021
Dataset updated
Dec 7, 2022
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
Chih Ming Tan; Chih Ming Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Do institutions rule when explaining cross-country divergence? By employing regression tree analysis to uncover the existence and nature of multiple development clubs and growth regimes, this paper finds that to a large extent they do. However, the role of ethnic fractionalization cannot be dismissed. The findings suggest that sufficiently high-quality institutions may be necessary for the negative impact on development from high levels of ethnic fractionalization to be mitigated. Interestingly, I find no role for geographic factors-neither those associated with climate nor physical isolation-in explaining divergence. There is also no evidence to suggest a role for religious fractionalization.
h
climateset
huggingface.co
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ClimateSet (2024). climateset [Dataset]. https://huggingface.co/datasets/climateset/climateset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 6, 2024
Authors
ClimateSet
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Terms of Use

By using the dataset, you agree to comply with the dataset license (CC-by-4.0-Deed).

Download Instructions

To download one file, please use from huggingface_hub import hf_hub_download

Path of the directory where the data will be downloaded in your local machine

local_directory = 'LOCAL_DIRECTORY'

Relative path of the file in the repository

filepath = 'FILE_PATH'

repo_id = "climateset/climateset" repo_type = "dataset" hf_hub_download(repo_id=repo_id… See the full description on the dataset page: https://huggingface.co/datasets/climateset/climateset.
4
Dataset about Port Research
data.4tu.nl
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihui Yang, Dataset about Port Research [Dataset]. http://doi.org/10.4121/14298851.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/14298851.v2
Dataset provided by
4TU.ResearchData
Authors
Zihui Yang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
在确定网络图中端口之间没有直接连接后，请通过port.sol.com.cn，SeaRates.com和McDistance运输计算工具获得端口之间的直接连接距离。如果三个查询数据之间有较大差异，则使用平均值法进行优化，获得表端口距离。
Using the Floyd algorithm, the path between two ports in the port network graph is solved on the basis of the table Port Distance, there maybe multiple shortest paths between two ports, but this situation is not considered here, the only result will be the result of Python simulation, get the table Port Shortest Path.
After get the Port Shortest Path, calculate the value of the shortest path between two ports, get the table Port Shortest Path Value.
According to the shortest path between two ports, count the number of routes for each port, then use the K-Medoids, construting the model of strategic importance of ports, get the table Number of ports are crossed by the shortest path.
根据“中间性中心性”模型的原理，通过“端口最短路径”表获得整个网络中每个端口的“中间性中心性”，然后使用K-Medoids来获得“端口中间性中心性”表。
表“端口通过次数组”和“中间性组”表的值和内容组合在一起，得到表“总组”，以方便数据搜索。
r
Landsat 4-9 Tiling Grid Path/Row World Reference System (WRS-2) (USGS)
researchdata.edu.au
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Landsat Missions (2022). Landsat 4-9 Tiling Grid Path/Row World Reference System (WRS-2) (USGS) [Dataset]. https://researchdata.edu.au/landsat-4-9-2-usgs/2973832
Explore at:
Dataset updated
Nov 9, 2022
Dataset provided by
Australian Institute of Marine Science (AIMS)
Australian Ocean Data Network
Authors
Landsat Missions
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgshttps://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs
Time period covered
Jul 16, 1982 - Jan 1, 2036
Area covered

Description
This dataset shows the tiling grid and their Row and Path IDs for Landsat 4 - 9 satellite imagery. The IDs are useful for selecting imagery of an area of interest. Landsat 4 - 9 are a series of Earth observation satellites, part of the US Landsat program aimed at monitoring Earth's land surfaces since 1982.

The Worldwide Reference System (WRS) is a global notation system used for cataloging and indexing Landsat imagery. It employs a grid-based system consisting of path and row numbers, where the path indicates the longitude and the row indicates the latitude, allowing users to easily locate and identify specific scenes covering a particular area on Earth.

Landsat satellites 4,5,7, 8, and 9 follow WRS-2 which this dataset describes.

This dataset corresponds to the descending Path Row identifiers as these correspond to day time scenes.

eAtlas Notes: It should be noted that the extent boundaries of the scene polygons in this dataset are only indicative of the imagery extent. For Landsat 5 images the individual images move around by about 10 km and the shape of the Landsat 8 and 9 images do not match the shape of the WRS-2 polygons. The angle of the top and bottom edges are at a different angle to the imagery, where the imagery is more square in shape. The left and right edges of the polygons are also smaller than the imagery. As a result of this, this dataset is probably not suitable as a clipping mask for the imagery for these satellites.

This dataset is suitable for determining the approximate extent of the imagery and the associated Row and Path IDs for a given scene.

Why is this dataset in the eAtlas?: Landsat imagery is very useful for the studying and mapping of reef systems. Selecting imagery for study often requires knowing the Path and Row numbers for the area of interest. This dataset is intended as a reference layer. This metadata is included to link to from the associated mapping layer. The eAtlas is not the custodian of this dataset and copies of the data should be obtained from the original sources. The eAtlas does however keep a cached version of the dataset from the time this dataset was setup to make available should the original dataset no longer become available.

eAtlas Processing: The original data was sourced from USGS (See links). No modifications to the underlying data were performed.

Location of the data: This dataset is filed in the eAtlas enduring data repository at: data
on-custodian\2020-2024\World_USGS_Landsat-WRS-2
m
Inter-Domain Path Computation under Node-defined Domain Uniqueness...
data.mendeley.com
narcis.nl
Updated Feb 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thanh Pham Dinh (2022). Inter-Domain Path Computation under Node-defined Domain Uniqueness Constraint [Dataset]. http://doi.org/10.17632/tpg2nbcsc5.2
Explore at:
Unique identifier
https://doi.org/10.17632/tpg2nbcsc5.2
Dataset updated
Feb 16, 2022
Authors
Thanh Pham Dinh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The tested data for Inter-Domain Path Computation under Domain Uniqueness constraint (IDPCDU).

On account of no public dataset to be available for the IDPC-NDU problem, two distinct types of instances are created based on the dataset of IDPC-EDU, which is also a shortest-path problem. We first generated three parameters for each instance: number of nodes, number of domains, and number of edges. After that, an optimal path p where the weight of edges is equal to 1 and the number of domains on p is approximately the input graph’s domain number. Next, the noise is added to the instance by for every node in p, besides random weight edges, several random one-weight edges from that node to some other nodes not in p and some random edges with greater values of weight than the total cost of p are added into. These traps make simple greedy algorithms harder to find the optimal solution. Especially in Type 2, feasible paths whose length is less than three are removed. The datasets are categorized into two kinds regarding dimensionality: small instances, each of which has between 50 and 2000 vertices, and large instances, each of which has over 2000 vertices.

Filename idpc_

First line of a file constains two intergers N and D, which are number of nodes and number of domains, respectively. Second line contains two integers s and t, which are the source node and terminal node. Every next line contains four integers u, v, w, d, represents an edge (u,v) has weight w and belong to domain d.
e
Inspire data set BPL “See path, construction lines”
data.europa.eu
wfs, wms
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Inspire data set BPL “See path, construction lines” [Dataset]. https://data.europa.eu/88u/dataset/b80aa8e5-0113-45a0-9d3a-76906e858ea4
Explore at:
wms, wfsAvailable download formats
Description
According to INSPIRE transformed development plan “See Path, Building Lines” of the city of Sachsenheim based on an XPlanung dataset in version 5.0.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jia Li (2024). SCodeSearcher [Dataset]. http://doi.org/10.6084/m9.figshare.25359841.v2

SCodeSearcher

Explore at:

11 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.25359841.v2

Dataset updated

Mar 11, 2024

Dataset provided by

figshare

Authors

Jia Li

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

File path configurationBefore you start training the model, make sure that all file paths are correctly set to the paths in your local environment. This includes training data, where the model is saved, and any associated configuration files.- Training data paths : Check the path of the training data to make sure they point to the correct location.- Model save path : The 'checkpoint' directory is used to save model weights during training. Make sure that this path correctly points to the local directory where you want to save the model weights. Important note : Before running the run.sh script, open the script and any related Python files, then check and update the path settings.## Soft contrastive learningTo soft contrastive learning, navigate to the corresponding directory and run the following command:`shcd $Project_Pathbash run.sh`## Parameter setting description- To adjust the weight ranges of positive samples, modify the softmax operation for 'ai' on line 158 of 'utils.py'.- To adjust the weight ranges of negative sample, adjust 'bi' on line 163 of 'utils.py'. ## Code SearchThe dataset file contains the code retrieval datasets and the code classification datasets. python run.py \ --output_dir=./python \ --config_name=/graphcodebert-base \ --model_name_or_path=/graphcodebert-base \ --tokenizer_name=/graphcodebert-base \ --lang=python \ --do_train \ --train_data_file=/dataset/CSN-Python/train.jsonl \ --eval_data_file=/dataset/CSN-Python/test.jsonl \ --test_data_file=/dataset/CSN-Python/test.jsonl \ --codebase_file=/dataset/CSN-Python/codebase.jsonl \ --num_train_epochs 20 \ --code_length 318 \ --data_flow_length 64 \ --nl_length 256 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 42

Clear search

Close search

Google apps

Main menu

SCodeSearcher

Population Assessment of Tobacco and Health (PATH) Study [United States]...

Path Mixing VLN dataset

Data underlying the research on path planning of robot unknown environment...

path-vqa

View Path Cross Street Data in The Villages, FL

Paths and Barriers Identification: Algorithm based on LiDAR data

Honea Path, SC Age Cohorts Dataset: Children, Working Adults, and Seniors in...

About this dataset

Content

Inspiration

Recommended for further research

Dataset about Port Research

Dataset of publication dates of book series where books equals Pennsylvania...

trial_image_dataset

PIPr: A Dataset of Public Infrastructure as Code Programs

Metadata

Dataset Creation

Searching Repositories

Limitations

Downloading Repositories

Ground Truth for Entity Relatedness Problem over DBpedia datasets

‘Path to a SAEIV line’ analyzed by Analyst-2

No one true path: uncovering the interplay between geography, institutions,...

climateset

Path of the directory where the data will be downloaded in your local machine

Relative path of the file in the repository

Dataset about Port Research

Landsat 4-9 Tiling Grid Path/Row World Reference System (WRS-2) (USGS)

Inter-Domain Path Computation under Node-defined Domain Uniqueness...

Inspire data set BPL “See path, construction lines”

SCodeSearcher