100+ datasets found
  1. f

    SCodeSearcher

    • figshare.com
    zip
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li (2024). SCodeSearcher [Dataset]. http://doi.org/10.6084/m9.figshare.25359841.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    figshare
    Authors
    Jia Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    File path configurationBefore you start training the model, make sure that all file paths are correctly set to the paths in your local environment. This includes training data, where the model is saved, and any associated configuration files.- ** Training data paths ** : Check the path of the training data to make sure they point to the correct location.- ** Model save path ** : The 'checkpoint' directory is used to save model weights during training. Make sure that this path correctly points to the local directory where you want to save the model weights.** Important note ** : Before running the run.sh script, open the script and any related Python files, then check and update the path settings.## Soft contrastive learningTo soft contrastive learning, navigate to the corresponding directory and run the following command:shcd $Project_Pathbash run.sh## Parameter setting description- To adjust the weight ranges of positive samples, modify the softmax operation for 'ai' on line 158 of 'utils.py'.- To adjust the weight ranges of negative sample, adjust 'bi' on line 163 of 'utils.py'. ## Code SearchThe dataset file contains the code retrieval datasets and the code classification datasets. python run.py \ --output_dir=./python \ --config_name=/graphcodebert-base \ --model_name_or_path=/graphcodebert-base \ --tokenizer_name=/graphcodebert-base \ --lang=python \ --do_train \ --train_data_file=/dataset/CSN-Python/train.jsonl \ --eval_data_file=/dataset/CSN-Python/test.jsonl \ --test_data_file=/dataset/CSN-Python/test.jsonl \ --codebase_file=/dataset/CSN-Python/codebase.jsonl \ --num_train_epochs 20 \ --code_length 318 \ --data_flow_length 64 \ --nl_length 256 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 42

  2. Population Assessment of Tobacco and Health (PATH) Study [United States]...

    • icpsr.umich.edu
    Updated Jun 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inter-university Consortium for Political and Social Research [distributor] (2025). Population Assessment of Tobacco and Health (PATH) Study [United States] Special Collection Restricted-Use Files [Dataset]. http://doi.org/10.3886/ICPSR37519.v13
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/37519/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37519/terms

    Area covered
    United States
    Description

    The PATH Study was launched in 2011 to inform the Food and Drug Administration's regulatory activities under the Family Smoking Prevention and Tobacco Control Act (TCA). The PATH Study is a collaboration between the National Institute on Drug Abuse (NIDA), National Institutes of Health (NIH), and the Center for Tobacco Products (CTP), Food and Drug Administration (FDA). The study sampled over 150,000 mailing addresses across the United States to create a national sample of people who use or do not use tobacco. 45,971 adults and youth constitute the first (baseline) wave, Wave 1, of data collected by this longitudinal cohort study. These 45,971 adults and 9 to 11 sampled at Wave 1) make up the 53,178 participants that constitute the Wave 1 Cohort. Respondents are asked to complete an interview at each follow-up wave. Youth who turn 18 by the current wave of data collection are considered "aged-up adults" and are invited to complete the Adult Interview. Additionally, "shadow youth" are considered "aged-up youth" upon turning 12 years old, when they are asked to complete an interview after parental consent. At Wave 4, a probability sample of 14,098 adults, youth, and shadow youth ages 10 to 11 was selected from the civilian, noninstitutionalized population at the time of Wave 4. This sample was recruited from residential addresses not selected for Wave 1 in the same sampled primary sampling units (PSU)s and segments using similar within-household sampling procedures. This "replenishment sample" was combined for estimation and analysis purposes with Wave 4 adult and youth respondents from the Wave 1 Cohort who were in the civilian, noninstitutionalized population at the time of Wave 4. This combined set of Wave 4 participants, 52,731 participants in total, forms the Wave 4 Cohort. At Wave 7, a probability sample of 14,863 adults, youth, and shadow youth ages 9 to 11 was selected from the civilian, noninstitutionalized population at the time of Wave 7. This sample was recruited from residential addresses not selected for Wave 1 or Wave 4 in the same sampled PSUs and segments using similar within-household sampling procedures. This "second replenishment sample" was combined for estimation and analysis purposes with the Wave 7 adult and youth respondents from the Wave 4 Cohorts who were at least age 15 and in the civilian, noninstitutionalized population at the time of Wave 7 participants, 46,169 participants in total, forms the Wave 7 Cohort. Please refer to the Restricted-Use Files User Guide that provides further details about children designated as "shadow youth" and the formation of the Wave 1, Wave 4, and Wave 7 Cohorts. Wave 4.5 was a special data collection for youth only who were aged 12 to 17 at the time of the Wave 4.5 interview. Wave 4.5 was the fourth annual follow-up wave for those who were members of the Wave 1 Cohort. For those who were sampled at Wave 4, Wave 4.5 was the first annual follow-up wave. Wave 5.5, conducted in 2020, was a special data collection for Wave 4 Cohort youth and young adults ages 13 to 19 at the time of the Wave 5.5 interview. Also in 2020, a subsample of Wave 4 Cohort adults ages 20 and older were interviewed via the PATH Study Adult Telephone Survey (PATH-ATS). Wave 7.5 was a special collection for Wave 4 and Wave 7 Cohort youth and young adults ages 12 to 22 at the time of the Wave 7.5 interview. For those who were sampled at Wave 7, Wave 7.5 was the first annual follow-up wave. Dataset 1002 (DS1002) contains the data from the Wave 4.5 Youth and Parent Questionnaire. This file contains 1,617 variables and 13,131 cases. Of these cases, 11,378 are continuing youth having completed a prior Youth Interview. The other 1,753 cases are "aged-up youth" having previously been sampled as "shadow youth" Datasets 1112, 1212, and 1222, (DS1112, DS1212, and DS1222) are data files comprising the weight variables for Wave 4.5. The "all-waves" weight file contains weights for participants in the Wave 1 Cohort who completed a Wave 4.5 Youth Interview and completed interviews (if old enough to do so) or verified their information with the study (if not old enough to be interviewed) in Waves 1, 2, 3, and 4. There are two separate files with "single wave" weights: one for the Wave 1 Cohort and one for the Wave 4 Cohort. The "single-wave" weight file for the Wave 1 Cohort contains weights for youth who c

  3. Path Mixing VLN dataset

    • zenodo.org
    json
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). Path Mixing VLN dataset [Dataset]. http://doi.org/10.5281/zenodo.8422186
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The R2R dataset consists of human annotated instructions corresponding to the paths in these graphs. Each path consists of a sequence of viewpoints encountered by the agent during navigation. A derived dataset, Fine-Grained R2R (FGR2R) [ 12] dataset, annotated parts of instructions with corresponding graph edges to obtain a fine-grained dataset. Existing works in VLN have shown that more instruction examples can improve an agent’s performance in
    previously unseen environments.

    Hence to augment training data, we mix parts of paths from the FGR2R dataset to obtain additional instruction-trajectory pairs. The paths are mixed from other neighboring paths that are part of
    same house which sustains both view and instruction consistency.
    To mix paths we identify all the edges in the graph corresponding
    to the start of navigation 𝜀𝑠𝑡𝑎𝑟𝑡 and end of navigational episodes
    𝜀𝑒𝑛𝑑 . These edges are important for mixing as they correspond
    to micro-instructions (Walk away from the desk, Turn right etc.)
    that refer to start and stop positions in the house while other edges
    correspond to instructions that back reference to previous locations.
    The remaining transition edges 𝜀𝑡𝑟𝑎𝑛𝑠 are mixed to obtain a path
    𝜀𝑠𝑡𝑎𝑟𝑡 → 𝜀𝑡𝑟𝑎𝑛𝑠 → 𝜀𝑒𝑛𝑑 . Not all edges are inter-connectable, as
    some of the nodes could be spatially close to each other - reducing
    the visual variety of viewpoints or resulting in the repetition of
    micro-instructions (short but actionable instructions) in the final
    instruction. Accordingly, the edges are connected based on the
    following criteria: (1) the distance between any 2 nodes should be
    greater than 3m and the angle between edges should not acute
    to prevent navigating in loops (2) the distance between the start
    and end nodes should be greater than 3m to ensure that the path
    ends up in a different room (3) the start and end nodes cannot
    have a common edge (4) micro-instructions from common edges
    of different paths are chosen randomly. The final instruction is
    the sequence of micro-instructions and the path is the sequence of
    edges (Figure 2). Using this method, we generate 162k instruction-
    trajectory pairs with path lengths between 5m and 30m. The final dataset
    has on average 7.27 views per path, a mean of 14.4m trajectory
    length and an average of 82 words per instruction.

  4. 4

    Data underlying the research on path planning of robot unknown environment...

    • data.4tu.nl
    zip
    Updated Sep 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Wei Xu; Jun Peng Zhang (2023). Data underlying the research on path planning of robot unknown environment based on improved A * algorithm [Dataset]. http://doi.org/10.4121/285af7be-36da-4b69-a75d-bf822ebc107f.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Bo Wei Xu; Jun Peng Zhang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2023
    Description

    This dataset is a source code file and the code language is MATLAB, Propose an improved algorithm based on the traditional A* algorithm, which expands the search step and search angle - Improv-A*. This algorithm not only improves the search speed but also enhances search efficiency, reducing the total planning distance. In order to achieve a combination of static global path planning and dynamic local path planning, we attempt to integrate Improv-A* algorithm with artificial potential field method to achieve dynamic path planning for unmanned aerial vehicles.

  5. h

    path-vqa

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flavia Giammarino, path-vqa [Dataset]. https://huggingface.co/datasets/flaviagiammarino/path-vqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Flavia Giammarino
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for PathVQA

      Dataset Description
    

    PathVQA is a dataset of question-answer pairs on pathology images. The dataset is intended to be used for training and testing Medical Visual Question Answering (VQA) systems. The dataset includes both open-ended questions and binary "yes/no" questions. The dataset is built from two publicly-available pathology textbooks: "Textbook of Pathology" and "Basic Pathology", and a publicly-available digital library: "Pathology… See the full description on the dataset page: https://huggingface.co/datasets/flaviagiammarino/path-vqa.

  6. o

    View Path Cross Street Data in The Villages, FL

    • ownerly.com
    Updated Dec 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ownerly (2021). View Path Cross Street Data in The Villages, FL [Dataset]. https://www.ownerly.com/fl/the-villages/view-path-home-details
    Explore at:
    Dataset updated
    Dec 14, 2021
    Dataset authored and provided by
    Ownerly
    Area covered
    Florida, The Villages, View Path
    Description

    This dataset provides information about the number of properties, residents, and average property values for View Path cross streets in The Villages, FL.

  7. d

    Paths and Barriers Identification: Algorithm based on LiDAR data

    • search.dataone.org
    • borealisdata.ca
    • +1more
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lyu, Yiwei (2023). Paths and Barriers Identification: Algorithm based on LiDAR data [Dataset]. http://doi.org/10.5683/SP3/NRVT59
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Lyu, Yiwei
    Description

    The wayfinding system is important for the people on campus. However, the existed wayfinding system of UBC does not consider some walkable paths which are not shown on the street map. Also, the wayfinding system ignores the barriers like stairs, which could be obstacles for wheelchair users, on the paths. LiDAR is developed rapidly in recent years. It can collect the elevation information of the objectives on the ground. University of British Columbia (UBC) collects and publishes the LiDAR dataset of campus every year. This project uses the elevation and the point intensity information from the LiDAR point dataset to identify the walkable paths and the barriers on the paths. Two algorithms are announced. The first one is the intensity-based path identification algorithm, which assumes that the concrete paths have a homogenous intensity. Another algorithm is the barrier identification algorithm, which is based on the Canny edge detection algorithm. As a result, the two algorithms both work well in the research area, and they have the potential to be developed as an automatic process and can be one part of the wayfinding system.

  8. N

    Honea Path, SC Age Cohorts Dataset: Children, Working Adults, and Seniors in...

    • neilsberg.com
    csv, json
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Honea Path, SC Age Cohorts Dataset: Children, Working Adults, and Seniors in Honea Path - Population and Percentage Analysis // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/4b887b3f-f122-11ef-8c1b-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South Carolina, Honea Path
    Variables measured
    Population Over 65 Years, Population Under 18 Years, Population Between 18 and 64 Years, Percent of Total Population for Age Groups
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age cohorts. For age cohorts we divided it into three buckets Children ( Under the age of 18 years), working population ( Between 18 and 64 years) and senior population ( Over 65 years). For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Honea Path population by age cohorts (Children: Under 18 years; Working population: 18-64 years; Senior population: 65 years or more). It lists the population in each age cohort group along with its percentage relative to the total population of Honea Path. The dataset can be utilized to understand the population distribution across children, working population and senior population for dependency ratio, housing requirements, ageing, migration patterns etc.

    Key observations

    The largest age group was 18 to 64 years with a poulation of 2,230 (59.72% of the total population). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Age cohorts:

    • Under 18 years
    • 18 to 64 years
    • 65 years and over

    Variables / Data Columns

    • Age Group: This column displays the age cohort for the Honea Path population analysis. Total expected values are 3 groups ( Children, Working Population and Senior Population).
    • Population: The population for the age cohort in Honea Path is shown in the following column.
    • Percent of Total Population: The population as a percent of total population of the Honea Path is shown in the following column.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Honea Path Population by Age. You can refer the same here

  9. 4

    Dataset about Port Research

    • data.4tu.nl
    xlsx
    Updated Sep 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihui Yang (2021). Dataset about Port Research [Dataset]. http://doi.org/10.4121/14298851.v3
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 1, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Zihui Yang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    After detemining that there is no direct connection to the port in the network diagram, get the direct connection distance between ports through the port.sol.com.cn、SeaRates.com and McDistance shipping calculation tool. If there is a big difference between the three query data, the average value method is used for optimization, get the table Port Distance.Using the Floyd algorithm, the path between two ports in the port network graph is solved on the basis of the table Port Distance, there maybe multiple shortest paths between two ports, but this situation is not considered here, the only result will be the result of Python simulation, get the table Port Shortest Path.After get the Port Shortest Path, calculate the value of the shortest path between two ports, get the table Port Shortest Path Value.According to the shortest path between two ports, count the number of routes for each port, then use the K-Medoids, construting the model of strategic importance of ports, get the table The number of ports is crossed by the shortest path.According to the principle of the Betweenness Centrality model, the Betweenness Centrality of each port in the whole network is obtained by the table Port Shortest Path, and then use the K-Medoids, get the table Port Betweenness Centrality.The values and contents of the table The number of ports is crossed by the shortest path and the table Betweenness Centrality Group are combined together to get the table Total Group to facilitate data search.

  10. w

    Dataset of publication dates of book series where books equals Pennsylvania...

    • workwithdata.com
    Updated Nov 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of publication dates of book series where books equals Pennsylvania off the beaten path : discover your fun [Dataset]. https://www.workwithdata.com/datasets/book-series?col=book_series%2Cj0-publication_date&f=1&fcol0=j0-book&fop0=%3D&fval0=Pennsylvania+off+the+beaten+path+%3A+discover+your+fun&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book series. It has 1 row and is filtered where the books is Pennsylvania off the beaten path : discover your fun. It features 2 columns including publication dates.

  11. h

    trial_image_dataset

    • huggingface.co
    Updated Aug 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sri Sakthi Prathosh R (2024). trial_image_dataset [Dataset]. https://huggingface.co/datasets/Power108/trial_image_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2024
    Authors
    Sri Sakthi Prathosh R
    Description

    My Dataset

    This dataset includes research articles with metadata and images.

      Features
    

    The dataset contains the following features:

    pmid: The PubMed ID of the article (string). pmcid: The PubMed Central ID of the article (string). title: The title of the article (string). abstract: The abstract of the article (string). fulltext: The full text of the article (string). images: Contains image data with the following fields: bytes: Binary image data. path: Relative path to… See the full description on the dataset page: https://huggingface.co/datasets/Power108/trial_image_dataset.

  12. PIPr: A Dataset of Public Infrastructure as Code Programs

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.

    Contents:

    • metadata.zip: The dataset metadata and analysis results as CSV files.
    • scripts-and-logs.zip: Scripts and logs of the dataset creation.
    • LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
    • README.md: This document.
    • redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.

    This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.

    Metadata

    The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.

    repositories.csv:

    • ID (integer): GitHub repository ID
    • url (string): GitHub repository URL
    • downloaded (boolean): Whether cloning the repository succeeded
    • name (string): Repository name
    • description (string): Repository description
    • licenses (string, list of strings): Repository licenses
    • redistributable (boolean): Whether the repository's licenses permit redistribution
    • created (string, date & time): Time of the repository's creation
    • updated (string, date & time): Time of the last update to the repository
    • pushed (string, date & time): Time of the last push to the repository
    • fork (boolean): Whether the repository is a fork
    • forks (integer): Number of forks
    • archive (boolean): Whether the repository is archived
    • programs (string, list of strings): Project file path of each IaC program in the repository

    programs.csv:

    • ID (string): Project file path of the IaC program
    • repository (integer): GitHub repository ID of the repository containing the IaC program
    • directory (string): Path of the directory containing the IaC program's project file
    • solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
    • language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
    • name (string): IaC program name
    • description (string): IaC program description
    • runtime (string): Runtime string of the IaC program
    • testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • tests (string, list of strings): File paths of IaC program's tests

    testing-files.csv:

    • file (string): Testing file path
    • language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
    • techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
    • program (string): Project file path of the testing file's IaC program

    Dataset Creation

    scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

    1. A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
    2. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
    3. Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
    4. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.

    Searching Repositories

    The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

    1. Github access token.
    2. Name of the CSV output file.
    3. Filename to search for.
    4. File extensions to search for, separated by commas.
    5. Min file size for the search (for all files: 0).
    6. Max file size for the search or * for unlimited (for all files: *).

    Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/

    AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html

    CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup

    Limitations

    The script uses the GitHub code search API and inherits its limitations:

    • Only forks with more stars than the parent repository are included.
    • Only the repositories' default branches are considered.
    • Only files smaller than 384 KB are searchable.
    • Only repositories with fewer than 500,000 files are considered.
    • Only repositories that have had activity or have been returned in search results in the last year are considered.

    More details: https://docs.github.com/en/search-github/searching-on-github/searching-code

    The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api

    Downloading Repositories

    download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

    1. Name of the repositories CSV files generated through search-repositories.py, separated by commas.
    2. Output directory to download the repositories to.
    3. Name of the CSV output file.

    The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.

  13. f

    Ground Truth for Entity Relatedness Problem over DBpedia datasets

    • figshare.com
    zip
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Guillot Jiménez (2021). Ground Truth for Entity Relatedness Problem over DBpedia datasets [Dataset]. http://doi.org/10.6084/m9.figshare.15181086.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 17, 2021
    Dataset provided by
    figshare
    Authors
    Javier Guillot Jiménez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. More precisely, this problem can be defined as: “Given an RDF graph 'G' and a pair of entities 'a' and 'b', represented in 'G', compute the paths in 'G' from 'a' to 'b' that best describe the connectivity between them”.This dataset supports the evaluation of approaches that address the entity relatedness problem and contains a total of 240 ranked lists with 50 relationship paths each between entity pairs in two familiar domains, music and movies, in two subsets of the DBpedia that we called DBpedia21M and DBpedia45M. Specifically, we extracted data from the following two publicly available subsets of the English DBpedia corpus to form our two knowledge bases:1. mappingbased-objects: https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2021.03.01/mappingbased-objects_lang=en.ttl.bz22. infobox-properties: https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2021.03.01/infobox-properties_lang=en.ttl.bz2 DBpedia21M contains the statements in the mappingbased-objects dataset, and DBpedia45M contains the union of the statements in mappingbased-objects and in infobox-properties. In both cases, we exclude statements involving literals or blank nodes.For each dataset (DBpedia21M and DBpedia45M), the ground truth contains 120 ranked lists with 50 relationship paths each. Each list corresponds to the most relevant paths between one of the 20 entity pairs, 10 pairs from the music domain and 10 from the movie domain, found using different path search strategies.A path search strategy consists of an entity similarity measure and a path ranking measure. The ground truth was created using the following 6 strategies:1. Jaccard Index & Predicate Frequency Inverse Triple Frequency (PF-ITF)2. Jaccard Index & Exclusivity-based Relatedness (EBR)3. Jaccard Index & Pointwise Mutual Information (PMI)4. Wikipedia Link-based Measure (WLM) & PF-ITF5. WLM & EBR6. WLM & PMIThe filename of a file that contains the ranked list of 50 relationship paths between a pair of entities has the following format:[Dataset].[EntityPairID].[SearchStrategyID].[Entity1-Entity2].txtExample 1: DBpedia21M.1.2.Michael_Jackson-Whitney_Houston.txtExample 2: DBpedia45M.27.4.Paul_Newman-Joanne_Woodward.txtThe file in Example 1 contains the top-50 most relevant paths between Michael Jackson and Whitney Houston in DBpedia21M using the search strategy number 2 (Jaccard Index & EBR)The file in Example 2 contains the top-50 most relevant paths between Paul Newman and Joanne Woodward in DBpedia45M using the search strategy number 4 (WLM & PF-ITF)The data is splitted into 2 files, one for each dataset and compressed in .zip format:DBpedia21M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia21M dataset. DBpedia45M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia45M dataset.

  14. A

    ‘Path to a SAEIV line’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Path to a SAEIV line’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-europa-eu-path-to-a-saeiv-line-fc4f/e9e4f166/?iid=006-116&v=presentation
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Path to a SAEIV line’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/61a025be66bcd934d64ed79e on 17 January 2022.

    --- Dataset description provided by original source is as follows ---

    Picto dictionary public Dictionary Public

    This non-graphic dataset from the Operating Assistance and Traveller Information System (SAEIV) represents the variants of line paths used by vehicles in the TBM network. A line path is an orderly sequence of consecutive sections, with a direction (Go or back).

    This dataset can be linked to Course of a vehicle on a path, SAEIV commercial line, Elementary Pathway Round, Deviation Round, physical stop on the network, bus schedules for the next 14 days and Vehicle in service on the network

    This data set is refreshed all: 1 hour(s). Be careful, for performance reasons, this dataset (Table, Map, Analysis and Export tabs) can be updated less frequently than the source and a deviation may exist. We also invite you to use our Webservices (see Webservices BM tab) to retrieve the freshest data.

    --- Original source retains full ownership of the source dataset ---

  15. J

    No one true path: uncovering the interplay between geography, institutions,...

    • journaldata.zbw.eu
    • jda-test.zbw.eu
    txt, xls
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chih Ming Tan; Chih Ming Tan (2022). No one true path: uncovering the interplay between geography, institutions, and fractionalization in economic development (replication data) [Dataset]. http://doi.org/10.15456/jae.2022320.0720867021
    Explore at:
    xls(148480), txt(4148), txt(25343)Available download formats
    Dataset updated
    Dec 7, 2022
    Dataset provided by
    ZBW - Leibniz Informationszentrum Wirtschaft
    Authors
    Chih Ming Tan; Chih Ming Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Do institutions rule when explaining cross-country divergence? By employing regression tree analysis to uncover the existence and nature of multiple development clubs and growth regimes, this paper finds that to a large extent they do. However, the role of ethnic fractionalization cannot be dismissed. The findings suggest that sufficiently high-quality institutions may be necessary for the negative impact on development from high levels of ethnic fractionalization to be mitigated. Interestingly, I find no role for geographic factors-neither those associated with climate nor physical isolation-in explaining divergence. There is also no evidence to suggest a role for religious fractionalization.

  16. h

    climateset

    • huggingface.co
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ClimateSet (2024). climateset [Dataset]. https://huggingface.co/datasets/climateset/climateset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2024
    Authors
    ClimateSet
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Terms of Use

    By using the dataset, you agree to comply with the dataset license (CC-by-4.0-Deed).

      Download Instructions
    

    To download one file, please use from huggingface_hub import hf_hub_download

    Path of the directory where the data will be downloaded in your local machine

    local_directory = 'LOCAL_DIRECTORY'

    Relative path of the file in the repository

    filepath = 'FILE_PATH'

    repo_id = "climateset/climateset" repo_type = "dataset" hf_hub_download(repo_id=repo_id… See the full description on the dataset page: https://huggingface.co/datasets/climateset/climateset.

  17. 4

    Dataset about Port Research

    • data.4tu.nl
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihui Yang, Dataset about Port Research [Dataset]. http://doi.org/10.4121/14298851.v2
    Explore at:
    zipAvailable download formats
    Dataset provided by
    4TU.ResearchData
    Authors
    Zihui Yang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    在确定网络图中端口之间没有直接连接后,请通过port.sol.com.cn,SeaRates.com和McDistance运输计算工具获得端口之间的直接连接距离。如果三个查询数据之间有较大差异,则使用平均值法进行优化,获得表端口距离。
    Using the Floyd algorithm, the path between two ports in the port network graph is solved on the basis of the table Port Distance, there maybe multiple shortest paths between two ports, but this situation is not considered here, the only result will be the result of Python simulation, get the table Port Shortest Path.
    After get the Port Shortest Path, calculate the value of the shortest path between two ports, get the table Port Shortest Path Value.
    According to the shortest path between two ports, count the number of routes for each port, then use the K-Medoids, construting the model of strategic importance of ports, get the table Number of ports are crossed by the shortest path.
    根据“中间性中心性”模型的原理,通过“端口最短路径”表获得整个网络中每个端口的“中间性中心性”,然后使用K-Medoids来获得“端口中间性中心性”表。
    表“端口通过次数组”和“中间性组”表的值和内容组合在一起,得到表“总组”,以方便数据搜索。

  18. r

    Landsat 4-9 Tiling Grid Path/Row World Reference System (WRS-2) (USGS)

    • researchdata.edu.au
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Landsat Missions (2022). Landsat 4-9 Tiling Grid Path/Row World Reference System (WRS-2) (USGS) [Dataset]. https://researchdata.edu.au/landsat-4-9-2-usgs/2973832
    Explore at:
    Dataset updated
    Nov 9, 2022
    Dataset provided by
    Australian Institute of Marine Science (AIMS)
    Australian Ocean Data Network
    Authors
    Landsat Missions
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    https://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgshttps://www.usgs.gov/information-policies-and-instructions/acknowledging-or-crediting-usgs

    Time period covered
    Jul 16, 1982 - Jan 1, 2036
    Area covered
    Description

    This dataset shows the tiling grid and their Row and Path IDs for Landsat 4 - 9 satellite imagery. The IDs are useful for selecting imagery of an area of interest. Landsat 4 - 9 are a series of Earth observation satellites, part of the US Landsat program aimed at monitoring Earth's land surfaces since 1982.

    The Worldwide Reference System (WRS) is a global notation system used for cataloging and indexing Landsat imagery. It employs a grid-based system consisting of path and row numbers, where the path indicates the longitude and the row indicates the latitude, allowing users to easily locate and identify specific scenes covering a particular area on Earth.

    Landsat satellites 4,5,7, 8, and 9 follow WRS-2 which this dataset describes.

    This dataset corresponds to the descending Path Row identifiers as these correspond to day time scenes.

    eAtlas Notes: It should be noted that the extent boundaries of the scene polygons in this dataset are only indicative of the imagery extent. For Landsat 5 images the individual images move around by about 10 km and the shape of the Landsat 8 and 9 images do not match the shape of the WRS-2 polygons. The angle of the top and bottom edges are at a different angle to the imagery, where the imagery is more square in shape. The left and right edges of the polygons are also smaller than the imagery. As a result of this, this dataset is probably not suitable as a clipping mask for the imagery for these satellites.

    This dataset is suitable for determining the approximate extent of the imagery and the associated Row and Path IDs for a given scene.

    Why is this dataset in the eAtlas?: Landsat imagery is very useful for the studying and mapping of reef systems. Selecting imagery for study often requires knowing the Path and Row numbers for the area of interest. This dataset is intended as a reference layer. This metadata is included to link to from the associated mapping layer. The eAtlas is not the custodian of this dataset and copies of the data should be obtained from the original sources. The eAtlas does however keep a cached version of the dataset from the time this dataset was setup to make available should the original dataset no longer become available.

    eAtlas Processing: The original data was sourced from USGS (See links). No modifications to the underlying data were performed.

    Location of the data: This dataset is filed in the eAtlas enduring data repository at: data
    on-custodian\2020-2024\World_USGS_Landsat-WRS-2

  19. m

    Inter-Domain Path Computation under Node-defined Domain Uniqueness...

    • data.mendeley.com
    • narcis.nl
    Updated Feb 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thanh Pham Dinh (2022). Inter-Domain Path Computation under Node-defined Domain Uniqueness Constraint [Dataset]. http://doi.org/10.17632/tpg2nbcsc5.2
    Explore at:
    Dataset updated
    Feb 16, 2022
    Authors
    Thanh Pham Dinh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • The tested data for Inter-Domain Path Computation under Domain Uniqueness constraint (IDPCDU).

    • On account of no public dataset to be available for the IDPC-NDU problem, two distinct types of instances are created based on the dataset of IDPC-EDU, which is also a shortest-path problem. We first generated three parameters for each instance: number of nodes, number of domains, and number of edges. After that, an optimal path p where the weight of edges is equal to 1 and the number of domains on p is approximately the input graph’s domain number. Next, the noise is added to the instance by for every node in p, besides random weight edges, several random one-weight edges from that node to some other nodes not in p and some random edges with greater values of weight than the total cost of p are added into. These traps make simple greedy algorithms harder to find the optimal solution. Especially in Type 2, feasible paths whose length is less than three are removed. The datasets are categorized into two kinds regarding dimensionality: small instances, each of which has between 50 and 2000 vertices, and large instances, each of which has over 2000 vertices.

    • Filename idpc_

    First line of a file constains two intergers N and D, which are number of nodes and number of domains, respectively. Second line contains two integers s and t, which are the source node and terminal node. Every next line contains four integers u, v, w, d, represents an edge (u,v) has weight w and belong to domain d.

  20. e

    Inspire data set BPL “See path, construction lines”

    • data.europa.eu
    wfs, wms
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Inspire data set BPL “See path, construction lines” [Dataset]. https://data.europa.eu/88u/dataset/b80aa8e5-0113-45a0-9d3a-76906e858ea4
    Explore at:
    wms, wfsAvailable download formats
    Description

    According to INSPIRE transformed development plan “See Path, Building Lines” of the city of Sachsenheim based on an XPlanung dataset in version 5.0.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jia Li (2024). SCodeSearcher [Dataset]. http://doi.org/10.6084/m9.figshare.25359841.v2

SCodeSearcher

Explore at:
11 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
Mar 11, 2024
Dataset provided by
figshare
Authors
Jia Li
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

File path configurationBefore you start training the model, make sure that all file paths are correctly set to the paths in your local environment. This includes training data, where the model is saved, and any associated configuration files.- ** Training data paths ** : Check the path of the training data to make sure they point to the correct location.- ** Model save path ** : The 'checkpoint' directory is used to save model weights during training. Make sure that this path correctly points to the local directory where you want to save the model weights.** Important note ** : Before running the run.sh script, open the script and any related Python files, then check and update the path settings.## Soft contrastive learningTo soft contrastive learning, navigate to the corresponding directory and run the following command:shcd $Project_Pathbash run.sh## Parameter setting description- To adjust the weight ranges of positive samples, modify the softmax operation for 'ai' on line 158 of 'utils.py'.- To adjust the weight ranges of negative sample, adjust 'bi' on line 163 of 'utils.py'. ## Code SearchThe dataset file contains the code retrieval datasets and the code classification datasets. python run.py \ --output_dir=./python \ --config_name=/graphcodebert-base \ --model_name_or_path=/graphcodebert-base \ --tokenizer_name=/graphcodebert-base \ --lang=python \ --do_train \ --train_data_file=/dataset/CSN-Python/train.jsonl \ --eval_data_file=/dataset/CSN-Python/test.jsonl \ --test_data_file=/dataset/CSN-Python/test.jsonl \ --codebase_file=/dataset/CSN-Python/codebase.jsonl \ --num_train_epochs 20 \ --code_length 318 \ --data_flow_length 64 \ --nl_length 256 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 42

Search
Clear search
Close search
Google apps
Main menu