12 datasets found
  1. h

    bug-localization

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Tigina, bug-localization [Dataset]. https://huggingface.co/datasets/tiginamaria/bug-localization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Maria Tigina
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Bug Localization

    This is the data for Bug Localization benchmark.

      How-to
    

    Since the dataset is private, if you haven't used HF Hub before, add your token via huggingface-cli first: huggingface-cli login

    List all the available configs via datasets.get_dataset_config_names and choose an appropriate one

    Load the data via load_dataset: from datasets import load_dataset

    Select a configuration from ["py", "java", "kt", "mixed"]

    configuration = "py"

    Select a split from… See the full description on the dataset page: https://huggingface.co/datasets/tiginamaria/bug-localization.

  2. bugfiles

    • zenodo.org
    • explore.openaire.eu
    txt, zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Ye; Xin Ye (2020). bugfiles [Dataset]. http://doi.org/10.5281/zenodo.268486
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xin Ye; Xin Ye
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reference

    Studies who have been using the data (in any form) are required to add the following reference to their report/paper:

    @inproceedings{Ye:2014, author = {Ye, Xin and Bunescu, Razvan and Liu, Chang}, title = {Learning to Rank Relevant Files for Bug Reports Using Domain Knowledge}, booktitle = {Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering}, series = {FSE 2014}, year = {2014}, location = {Hong Kong, China}, pages = {689--699}, numpages = {11}, }

    About the Data

    Overview of Data

    This dataset contains bug reports, commit history, and API descriptions of six open source Java projects including Eclipse Platform UI, SWT, JDT, AspectJ, Birt, and Tomcat. This dataset was used to evaluate a learning to rank approach that recommends relevant files for bug reports.

    Dataset structure

    File list:

    • **AspectJ.[xlsxml]** – The bug reports and commit history of AspectJ.
    • **Birt.[xlsxml]** – The bug reports and commit history of Birt.
    • **Eclipse_Platform_UI.[xlsxml]** – The bug reports and commit history of Eclipse Platform UI.
    • **JDT.[xlsxml]** – The bug reports and commit history of JDT.
    • **SWT.[xlsxml]** – The bug reports and commit history of SWT.
    • **Tomcat.[xlsxml]** – The bug reports and commit history of Tomcat.

    Attribute Information

    • bug\_id– refers to the bug report id.
    • summary– refers to the bug report summary.
    • description– refers to the bug report description.
    • report\_time– refers to the bug report report time.
    • report\_timestamp– refers to the bug report report timestamp.
    • status– refers to the status of the bug report.
    • commit– refers to the SHA-1 hash id for the commit that fixed the bug report.
    • commit\_timestamp– refers to the commit timestamp.
    • files– contains the full path of every Java file that was fixed in this commit.
    • result– contains the position of every positive instance in our ranked list result.

    How to obtain the source code

    A before-fix version of the source code package needs to be checked out for each bug report. Taking Eclipse Bug 420972 for example, this bug was fixed at commit 657bd90. To check out the before-fix version 2143203 of the source code package, use the commandgit checkout 657bd90~1.

    Efficient indexing of the code

    If bug 420972 is the first bug processed by the system, we check out its before-fix version 2143203 and index all the corresponding source files. To process another bug report 423588, we need to check out its before-fix version 602d549 of the source code package. For efficiency reasons, we do not need to index all the source files again. Instead, we index only the changed files, i.e., files that were “Added”, “Modified”, or “Deleted” between the two bug reports. The changed files can be obtained as follows:

    • Added:git diff --name-status 2143203 602d549 | grep ".java$" | grep "^A"
    • Modified:git diff --name-status 2143203 602d549 | grep ".java$" | grep "^M"
    • Deleted:git diff --name-status 2143203 602d549 | grep ".java$" | grep "^D"

    Paper abstract

    When a new bug report is received, developers usually need to reproduce the bug and perform code reviews to find the cause, a process that can be tedious and time consuming. A tool for ranking all the source files of a project with respect to how likely they are to contain the cause of the bug would enable developers to narrow down their search and potentially could lead to a substantial increase in productivity. This paper introduces an adaptive ranking approach that leverages domain knowledge through functional decompositions of source code files into methods, API descriptions of library components used in the code, the bug-fixing history, and the code change history. Given a bug report, the ranking score of each source file is computed as a weighted combination of an array of features encoding domain knowledge, where the weights are trained automatically on previously solved bug reports using a learning-to-rank technique. We evaluated our system on six large scale open source Java projects, using the before-fix version of the project for every bug report. The experimental results show that the newly introduced learning-to-rank approach significantly outperforms two recent state-of-the-art methods in recommending relevant files for bug reports. In particular, our method makes correct recommendations within the top 10 ranked source files for over 70% of the bug reports in the Eclipse Platform and Tomcat projects.

  3. Z

    Data from: Lost in Translation: A Study of Bugs Introduced by Large Language...

    • data.niaid.nih.gov
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahimzada, Ali Reza (2024). Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8190051
    Explore at:
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Ibrahimzada, Ali Reza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artifact repository for the paper Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code, accepted at ICSE 2024, Lisbon, Portugal. Authors are Rangeet Pan* Ali Reza Ibrahimzada*, Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand.

    Install

    This repository contains the source code for reproducing the results in our paper. Please start by cloning this repository:

    git clone https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical

    We recommend using a virtual environment for running the scripts. Please download conda 23.11.0 from this link. You can create a virtual environment using the following command:

    conda create -n plempirical python=3.10.13

    After creating the virtual environment, you can activate it using the following command:

    conda activate plempirical

    You can run the following command to make sure that you are using the correct version of Python:

    python3 --version && pip3 --version

    Dependencies

    To install all software dependencies, please execute the following command:

    pip3 install -r requirements.txt

    As for hardware dependencies, we used 16 NVIDIA A100 GPUs with 80GBs of memory for inferencing models. The models can be inferenced on any combination of GPUs as long as the reader can properly distribute the model weights across the GPUs. We did not perform weight distribution since we had enough memory (80 GB) per GPU.

    Moreover, for compiling and testing the generated translations, we used Python 3.10, g++ 11, GCC Clang 14.0, Java 11, Go 1.20, Rust 1.73, and .Net 7.0.14 for Python, C++, C, Java, Go, Rust, and C#, respectively. Overall, we recommend using a machine with Linux OS and at least 32GB of RAM for running the scripts.

    For running scripts of alternative approaches, you need to make sure you have installed C2Rust, CxGO, and Java2C# on your machine. Please refer to their repositories for installation instructions. For Java2C#, you need to create a .csproj file like below:

    Exe
    net7.0
    enable
    enable
    

    Dataset

    We uploaded the dataset we used in our empirical study to Zenodo. The dataset is organized as follows:

    CodeNet

    AVATAR

    Evalplus

    Apache Commons-CLI

    Click

    Please download and unzip the dataset.zip file from Zenodo. After unzipping, you should see the following directory structure:

    PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── ...

    The structure of each dataset is as follows:

    1. CodeNet & Avatar: Each directory in these datasets correspond to a source language where each include two directories Code and TestCases for code snippets and test cases, respectively. Each code snippet has an id in the filename, where the id is used as a prefix for test I/O files.

    2. Evalplus: The source language code snippets follow a similar structure as CodeNet and Avatar. However, as a one time effort, we manually created the test cases in the target Java language inside a maven project, evalplus_java. To evaluate the translations from an LLM, we recommend moving the generated Java code snippets to the src/main/java directory of the maven project and then running the command mvn clean test surefire-report:report -Dmaven.test.failure.ignore=true to compile, test, and generate reports for the translations.

    3. Real-life Projects: The real-life-cli directory represents two real-life CLI projects from Java and Python. These datasets only contain code snippets as files and no test cases. As mentioned in the paper, the authors manually evaluated the translations for these datasets.

    Scripts

    We provide bash scripts for reproducing our results in this work. First, we discuss the translation script. For doing translation with a model and dataset, first you need to create a .env file in the repository and add the following:

    OPENAI_API_KEY= LLAMA2_AUTH_TOKEN= STARCODER_AUTH_TOKEN=

    1. Translation with GPT-4: You can run the following command to translate all Python -> Java code snippets in codenet dataset with the GPT-4 while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.7:

    bash scripts/translate.sh GPT-4 codenet Python Java 50 0.95 0.7 0

    1. Translation with CodeGeeX: Prior to running the script, you need to clone the CodeGeeX repository from here and use the instructions from their artifacts to download their model weights. After cloning it inside PLTranslationEmpirical and downloading the model weights, your directory structure should be like the following:

    PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── CodeGeeX ├── codegeex ├── codegeex_13b.pt # this file is the model weight ├── ... ├── ...

    You can run the following command to translate all Python -> Java code snippets in codenet dataset with the CodeGeeX while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

    bash scripts/translate.sh CodeGeeX codenet Python Java 50 0.95 0.2 0

    1. For all other models (StarCoder, CodeGen, LLaMa, TB-Airoboros, TB-Vicuna), you can execute the following command to translate all Python -> Java code snippets in codenet dataset with the StarCoder|CodeGen|LLaMa|TB-Airoboros|TB-Vicuna while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

    bash scripts/translate.sh StarCoder codenet Python Java 50 0.95 0.2 0

    1. For translating and testing pairs with traditional techniques (i.e., C2Rust, CxGO, Java2C#), you can run the following commands:

    bash scripts/translate_transpiler.sh codenet C Rust c2rust fix_report bash scripts/translate_transpiler.sh codenet C Go cxgo fix_reports bash scripts/translate_transpiler.sh codenet Java C# java2c# fix_reports bash scripts/translate_transpiler.sh avatar Java C# java2c# fix_reports

    1. For compile and testing of CodeNet, AVATAR, and Evalplus (Python to Java) translations from GPT-4, and generating fix reports, you can run the following commands:

    bash scripts/test_avatar.sh Python Java GPT-4 fix_reports 1 bash scripts/test_codenet.sh Python Java GPT-4 fix_reports 1 bash scripts/test_evalplus.sh Python Java GPT-4 fix_reports 1

    1. For repairing unsuccessful translations of Java -> Python in CodeNet dataset with GPT-4, you can run the following commands:

    bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 compile bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 runtime bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 incorrect

    1. For cleaning translations of open-source LLMs (i.e., StarCoder) in codenet, you can run the following command:

    bash scripts/clean_generations.sh StarCoder codenet

    Please note that for the above commands, you can change the dataset and model name to execute the same thing for other datasets and models. Moreover, you can refer to /prompts for different vanilla and repair prompts used in our study.

    Artifacts

    Please download the artifacts.zip file from our Zenodo repository. We have organized the artifacts as follows:

    RQ1 - Translations: This directory contains the translations from all LLMs and for all datasets. We have added an excel file to show a detailed breakdown of the translation results.

    RQ2 - Manual Labeling: This directory contains an excel file which includes the manual labeling results for all translation bugs.

    RQ3 - Alternative Approaches: This directory contains the translations from all alternative approaches (i.e., C2Rust, CxGO, Java2C#). We have added an excel file to show a detailed breakdown of the translation results.

    RQ4 - Mitigating Translation Bugs: This directory contains the fix results of GPT-4, StarCoder, CodeGen, and Llama 2. We have added an excel file to show a detailed breakdown of the fix results.

    Contact

    We look forward to hearing your feedback. Please contact Rangeet Pan or Ali Reza Ibrahimzada for any questions or comments 🙏.

  4. PIPr: A Dataset of Public Infrastructure as Code Programs

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.

    Contents:

    • metadata.zip: The dataset metadata and analysis results as CSV files.
    • scripts-and-logs.zip: Scripts and logs of the dataset creation.
    • LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
    • README.md: This document.
    • redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.

    This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.

    Metadata

    The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.

    repositories.csv:

    • ID (integer): GitHub repository ID
    • url (string): GitHub repository URL
    • downloaded (boolean): Whether cloning the repository succeeded
    • name (string): Repository name
    • description (string): Repository description
    • licenses (string, list of strings): Repository licenses
    • redistributable (boolean): Whether the repository's licenses permit redistribution
    • created (string, date & time): Time of the repository's creation
    • updated (string, date & time): Time of the last update to the repository
    • pushed (string, date & time): Time of the last push to the repository
    • fork (boolean): Whether the repository is a fork
    • forks (integer): Number of forks
    • archive (boolean): Whether the repository is archived
    • programs (string, list of strings): Project file path of each IaC program in the repository

    programs.csv:

    • ID (string): Project file path of the IaC program
    • repository (integer): GitHub repository ID of the repository containing the IaC program
    • directory (string): Path of the directory containing the IaC program's project file
    • solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
    • language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
    • name (string): IaC program name
    • description (string): IaC program description
    • runtime (string): Runtime string of the IaC program
    • testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • tests (string, list of strings): File paths of IaC program's tests

    testing-files.csv:

    • file (string): Testing file path
    • language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
    • techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
    • program (string): Project file path of the testing file's IaC program

    Dataset Creation

    scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

    1. A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
    2. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
    3. Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
    4. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.

    Searching Repositories

    The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

    1. Github access token.
    2. Name of the CSV output file.
    3. Filename to search for.
    4. File extensions to search for, separated by commas.
    5. Min file size for the search (for all files: 0).
    6. Max file size for the search or * for unlimited (for all files: *).

    Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/

    AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html

    CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup

    Limitations

    The script uses the GitHub code search API and inherits its limitations:

    • Only forks with more stars than the parent repository are included.
    • Only the repositories' default branches are considered.
    • Only files smaller than 384 KB are searchable.
    • Only repositories with fewer than 500,000 files are considered.
    • Only repositories that have had activity or have been returned in search results in the last year are considered.

    More details: https://docs.github.com/en/search-github/searching-on-github/searching-code

    The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api

    Downloading Repositories

    download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

    1. Name of the repositories CSV files generated through search-repositories.py, separated by commas.
    2. Output directory to download the repositories to.
    3. Name of the CSV output file.

    The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.

  5. f

    Data_Sheet_3_A Flexible, Extensible, Machine-Readable, Human-Intelligible,...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gideon Kruseman (2023). Data_Sheet_3_A Flexible, Extensible, Machine-Readable, Human-Intelligible, and Ontology-Agnostic Metadata Schema (OIMS).DOCX [Dataset]. http://doi.org/10.3389/fsufs.2022.767863.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Gideon Kruseman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper presents a lightweight, flexible, extensible, machine readable and human-intelligible metadata schema that does not depend on a specific ontology. The metadata schema for metadata of data files is based on the concept of data lakes where data is stored as they are. The purpose of the schema is to enhance data interoperability. The lack of interoperability of messy socio-economic datasets that contain a mixture of structured, semi-structured, and unstructured data means that many datasets are underutilized. Adding a minimum set of rich metadata and describing new and existing data dictionaries in a standardized way goes a long way to make these high-variety datasets interoperable and reusable and hence allows timely and actionable information to be gleaned from those datasets. The presented metadata schema OIMS can help to standardize the description of metadata. The paper introduces overall concepts of metadata, discusses design principles of metadata schemes, and presents the structure and an applied example of OIMS.

  6. Z

    Technical Leverage Dataset for Java Dependencies in Maven

    • data.niaid.nih.gov
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massacci, Fabio (2022). Technical Leverage Dataset for Java Dependencies in Maven [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6796848
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset provided by
    Massacci, Fabio
    Pashchencko, Ivan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In finance, leverage is the ratio between assets borrowed from others and one's own assets. A matching situation is present in software: by using free open-source software (FOSS) libraries a developer leverages on other people's code to multiply the offered functionalities with a much smaller own codebase. In finance as in software, leverage magnifies profits when returns from borrowing exceed costs of integration, but it may also magnify losses, in particular in the presence of security vulnerabilities. We aim to understand the level of technical leverage in the FOSS ecosystem and whether it can be a potential source of security vulnerabilities. Also, we introduce two metrics change distance and change direction to capture the amount and the evolution of the dependency on third-party libraries. Our analysis published in [1] shows that small and medium libraries (less than 100KLoC) have disproportionately more leverage on FOSS dependencies in comparison to large libraries. We show that leverage pays off as leveraged libraries only add a 4% delay in the time interval between library releases while providing four times more code than their own. However, libraries with such leverage (i.e., 75% of libraries in our sample) also have 1.6 higher odds of being vulnerable in comparison to the libraries with lower leverage.

    This dataset is the original dataset used in the publication [1]. It includes 8494 distinct library versions from the FOSS Maven-based Java libraries An online demo for computing the proposed metrics for real-world software libraries is also available under the following URL: https://techleverage.eu/.

    The original publication is [1]. An executive summary of the results is avialble as the publication [2]. This work has been funded by the European Union with the project AssureMOSS (https://www.assuremoss.eu).

    [1] Massacci, F., & Pashchenko, I. (2021, May). Technical leverage in a software ecosystem: Development opportunities and security risks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 1386-1397). IEEE.

    [2] Massacci, F., & Pashchenko, I. (2021). Technical Leverage: Dependencies Are a Mixed Blessing. IEEE Secur. Priv., 19(3), 58-62.

  7. Z

    (No) Influence of Continuous Integration on the Development Activity in...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knack, Jascha (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1140260
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Baltes, Sebastian
    Knack, Jascha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

    We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

    used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),

    were active for at least one year (365 days) before the first build with Travis CI (before_ci),

    used Travis CI at least for one year (during_ci),

    had commit or merge activity on the default branch in both of these phases, and

    used the default branch to trigger builds.

    To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

    We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

    We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

    have Java or Ruby as their project language

    used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)

    have commit activity for at least two years (730 days)

    are engineered software projects (at least 10 watchers)

    were not in the TravisTorrent dataset

    In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.

    This dataset contains the following files:

    tr_projects_sample_filtered_2.csv A CSV file with information about the 113 selected projects.

    tr_sample_commits_default_branch_before_ci.csv tr_sample_commits_default_branch_during_ci.csv One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_"). branch: The branch to which the commit was made. hash_value: The SHA1 hash value of the commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit.

    tr_sample_merges_default_branch_before_ci.csv tr_sample_merges_default_branch_during_ci.csv One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_"). branch: The destination branch of the merge. hash_value: The SHA1 hash value of the merge commit. merged_commits: Unique hash value prefixes of the commits merged with this commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit. pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message). source_user: GitHub login name of the user who initiated the pull request (extracted from log message). source_branch : Source branch of the pull request (extracted from log message).

    comparison_project_sample_800.csv A CSV file with information about the 800 projects in the comparison sample.

    commits_default_branch_before_mid.csv commits_default_branch_after_mid.csv One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

    merges_default_branch_before_mid.csv merges_default_branch_after_mid.csv One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.

  8. c

    ckanext-os - Extensions - CKAN Ecosystem Catalog

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-os - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-os
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The OS Widgets extension for CKAN enhances data portals with map-based search and preview capabilities, primarily developed by Ordnance Survey for data.gov.uk. This extension provides widgets, an API for preview lists and spatial datastore for spatial data previews. It integrates mapping functionality into CKAN to allow users to discover and visualize geospatial datasets and to show a shopping basket preview list to give users an idea on what data to preview. Key Features: Map-Based Search Widget: Enables users to search for datasets based on geographic location, improving dataset discoverability in specific areas of interest. Map Preview Widget: Allows users to visualize geospatial datasets directly within CKAN, providing an immediate understanding of the data's spatial extent and content. Preview List API: Provides an API to store & manage a list of selected datasets for previewing, working as a "shopping basket" to keep a list of packages to preview. It supports adding to, removing from, and listing the packages available in the preview list. Spatial Database Integration: Supports the preview of data with geo-references (latitude/longitude coordinates, postcodes, etc.) through a spatial database (PostGIS). Spatial Ingester Wrapper: Includes a wrapper to call the Spatial Ingester, a Java tool that converts data (typically CSV/XLS) and stores it in a PostGIS database. This converted data can then be served in WFS format for display in the Map Preview tool. Configurable Server Settings: Enables customization of server URLs and API keys for the widgets, allowing configuration of the geoserver, gazetteer and libraries used in the widgets. Proxy Configuration Support: Provides guidance on configuring an Apache proxy to improve the performance of GeoServer WFS calls, ensuring quick retrieval of boundary information. Technical Integration: The extension integrates with CKAN through plugins (ossearch, ospreview, oswfsserver) that are enabled in the CKAN configuration file. Configuration involves adding plugin names to the ckan.plugins setting and adjusting server URLs, spatial database connection details, and API keys according to the specific environment, ensuring seamless integration with existing CKAN deployments. In addition to the PostGIS dependency that has to be created, it is also dependent on other external libraries such as, Jquery, underscore, backbone, etc. Benefits & Impact: Enhanced Data Discovery: Map-based search significantly improves the discovery of geospatial datasets within CKAN, allowing users to easily find data relevant to their geographic area of interest. Improved Data Understanding: Map previews provide immediate visual context for geospatial datasets, leading to a better understanding of the data's spatial characteristics.

  9. Z

    Data Set htwddKogRob-TSDChangesSim for Localization and Lifelong Mapping

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bahrmann, Frank (2024). Data Set htwddKogRob-TSDChangesSim for Localization and Lifelong Mapping [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4270179
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset authored and provided by
    Bahrmann, Frank
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    This dataset provides log files recorded in a changed indoor environment with 18 dynamic obstacles. The changes from the original map to the simulated world are highlighted in the figure htwddKogRob-TSDChangesSim_changesHighlighted.png. The total distance travelled in this data set is 179.8 km. The prior knowledge map the robot got to localize and update is shown in htwddKogRob-TSDChangesSim_prior.png and the simulated now changed map is shown in htwddKogRob-TSDChangesSim_groundTruth.png (for both maps: 1px (\widehat{=}) 0.1m).

    The work was first presented in:

    A Fuzzy-based Adaptive Environment Model for Indoor Robot Localization

    Authors: Frank Bahrmann, Sven Hellbach, Hans-Joachim Böhme

    Date of Publication: 2016/10/6

    Conference: Telehealth and Assistive Technology / 847: Intelligent Systems and Robotics

    Publisher: ACTA Press

    Additionally, we present a video with the proposed algorithm and an insight of this dataset under:

    youtube.com/AugustDerSmarte

    https://www.youtube.com/watch?v=26NBFN_XeQg

    Instructions for use

    The zip archives contain ascii files, which hold the log files of the robot observations and robot poses. Since this data set was recorded in a simulated environment, the logfiles include both a changed starting position and a ground-truth pose. For further information, please refer to the header of the logfile. To simplify the parsing of the files, you can use these two Java snippets:

    Laser Range Measurements:

      List ranges = new ArrayList<>(numOfLaserRays);
      List errors = new ArrayList<>(numOfLaserRays);
    
      String s = line.substring(4);
      String delimiter = "()";
      StringTokenizer tokenizer = new StringTokenizer(s, delimiter);
    
      while(tokenizer.hasMoreElements()){
        String[] arr = tokenizer.nextToken().split(";");
        boolean usable = (arr[0].equals("0")?false:true);
        double range = Double.parseDouble(arr[1]);
    
        ranges.add(range);
        errors.add(usable?Error.OKAY:Error.INVALID_MEASUREMENT);
      }
    

    Poses:

      String poseString = line.split(":")[2];
      String[] elements = poseString.substring(1, poseString.length()-1).split(";");
      double x = Double.parseDouble(elements[0]);
      double y = Double.parseDouble(elements[1]);
      double phi = Double.parseDouble(elements[2]);
    
  10. Z

    Data Set htwddKogRob-InfReal for Localization and Lifelong Mapping

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bahrmann, Frank (2024). Data Set htwddKogRob-InfReal for Localization and Lifelong Mapping [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4269556
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset authored and provided by
    Bahrmann, Frank
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    This small dataset contains real world log files from a 2.2 km long patrol between two points of a previously known map (see htwddKogRob-InfReal.png | 1px (\widehat{=} ) 0.1m). The environment changes slightly and there are some dynamic obstacles. The figure (see htwddKogRob-InfReal_path.jpg) shows the path driven by the robots according to the real kilometers driven and the patrol points.

    The work was first presented in:

    A Fuzzy-based Adaptive Environment Model for Indoor Robot Localization

    Authors: Frank Bahrmann, Sven Hellbach, Hans-Joachim Böhme

    Date of Publication: 2016/10/6

    Conference: Telehealth and Assistive Technology / 847: Intelligent Systems and Robotics

    Publisher: ACTA Press

    Additionally, we present a video with the proposed algorithm and an insight of this dataset under:

    youtube.com/AugustDerSmarte

    https://www.youtube.com/watch?v=26NBFN_XeQg

    Instructions for use

    The zip archive contains ascii files, which contain the log files of the robot observations and robot poses. Since this data set was recorded in a real environment, the logfile provides only the odometry based robot poses. For further information, please refer to the header of the logfile. To simplify the parsing of the files, you can use these two Java snippets:

    Laser Range Measurements:

      List ranges = new ArrayList<>(numOfLaserRays);
      List errors = new ArrayList<>(numOfLaserRays);
    
      String s = line.substring(4);
      String delimiter = "()";
      StringTokenizer tokenizer = new StringTokenizer(s, delimiter);
    
      while(tokenizer.hasMoreElements()){
        String[] arr = tokenizer.nextToken().split(";");
        boolean usable = (arr[0].equals("0")?false:true);
        double range = Double.parseDouble(arr[1]);
    
        ranges.add(range);
        errors.add(usable?Error.OKAY:Error.INVALID_MEASUREMENT);
      }
    

    Poses:

      String poseString = line.split(":")[2];
      String[] elements = poseString.substring(1, poseString.length()-1).split(";");
      double x = Double.parseDouble(elements[0]);
      double y = Double.parseDouble(elements[1]);
      double phi = Double.parseDouble(elements[2]);
    
  11. Z

    Data Set htwddKogRob-TSDReal for Localization and Lifelong Mapping

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bahrmann, Frank (2024). Data Set htwddKogRob-TSDReal for Localization and Lifelong Mapping [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4270151
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset authored and provided by
    Bahrmann, Frank
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    This dataset represents a 4.7 km long tour (odometry path shown in htwddKogRob-TSDReal_path.png) in an environment whose representation (see map htwddKogRob-TSDReal.png | 1px (\widehat{=}) 0.1m) is now obsolete. Several static objects have been moved or removed, and there are varying numbers of dynamic obstacles (people walking around).

    The work was first presented in:

    A Fuzzy-based Adaptive Environment Model for Indoor Robot Localization

    Authors: Frank Bahrmann, Sven Hellbach, Hans-Joachim Böhme

    Date of Publication: 2016/10/6

    Conference: Telehealth and Assistive Technology / 847: Intelligent Systems and Robotics

    Publisher: ACTA Press

    Additionally, we present a video with the proposed algorithm and an insight of this dataset under:

    youtube.com/AugustDerSmarte

    https://www.youtube.com/watch?v=26NBFN_XeQg

    Instructions for use

    The zip archive contains ascii files, which hold the log files of the robot observations and robot poses. Since this data set was recorded in a real environment, the logfiles hold only the odometry based robot poses. For further information, please refer to the header of the logfiles. To simplify the parsing of the files, you can use these two Java snippets:

    Laser Range Measurements:

      List ranges = new ArrayList<>(numOfLaserRays);
      List errors = new ArrayList<>(numOfLaserRays);
    
      String s = line.substring(4);
      String delimiter = "()";
      StringTokenizer tokenizer = new StringTokenizer(s, delimiter);
    
      while(tokenizer.hasMoreElements()){
        String[] arr = tokenizer.nextToken().split(";");
        boolean usable = (arr[0].equals("0")?false:true);
        double range = Double.parseDouble(arr[1]);
    
        ranges.add(range);
        errors.add(usable?Error.OKAY:Error.INVALID_MEASUREMENT);
      }
    

    Poses:

      String poseString = line.split(":")[2];
      String[] elements = poseString.substring(1, poseString.length()-1).split(";");
      double x = Double.parseDouble(elements[0]);
      double y = Double.parseDouble(elements[1]);
      double phi = Double.parseDouble(elements[2]);
    
  12. r

    Cenozoic macroperforate planktonic foraminifera phylogeny of Aze & others...

    • researchdata.edu.au
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wade Bridget S.; Ogg James G.; Pearson Paul N.; Zehady Abdullah Khan; Haller Christian; Aze Tracy; Fordham Barry G.; Dr Barry Fordham (2018). Cenozoic macroperforate planktonic foraminifera phylogeny of Aze & others (2011). TimeScale Creator Evolutionary Tree. Corrected Version, July 2018. Five datapacks for Java software package. [Dataset]. http://doi.org/10.25911/5B8DF4DDB9497
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    The Australian National University
    The Australian National University Data Commons
    Authors
    Wade Bridget S.; Ogg James G.; Pearson Paul N.; Zehady Abdullah Khan; Haller Christian; Aze Tracy; Fordham Barry G.; Dr Barry Fordham
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    1826 - 2014
    Description

    Timescale Creator–database customization

    Features provided by Timescale Creator enhance the information which can be gleaned from the 2011 trees. These features can be provided either from functions already built into Timescale Creator, or via “in-house” programming within the database which has exploited the built-in functions to provide data and information on key issues of interest to the case study. It is this flexibility provided by the combination of Timescale Creator functions and datapacks programmed from the back-end relational database which is showcased below.

    Groups

    Colours were used in the original 2011 trees [1, Appendices 2, 3 ], and now in the Timescale Creator trees, to display eco- and morpho-groups (respectively). The Timescale Creator trees also add coloured group labels (rather than colouring the range labels as in the original trees), and this allows identification of groups without recourse to the legend. These group labels are positioned on ancestor–descendant branches, but have here been programmed to display only when the group membership changes from ancestor to descendant. As a result, they have the added advantage of highlighting origins and reappearances of the selected groups or properties in a phylogenetic context. A handy use of this feature is when, for example, this is programmed to apply to the generic assignment of morphospecies, making polyphyletic morphogenera, intentioned or otherwise, easy to spot.

    Lineage labels

    To label range lines on the lineage tree, the Timescale Creator version has been programmed to augment each lineage code with its list of contained morphospecies, e.g., the listing appended to Lineage N1-N3 is “H. holmdelensis > G. archeocompressa > G. planocompressa > G. compressa“. The morphospecies series in these listings is ordered by lowest occurrence, and so the >’s denote stratigraphic succession. (The >’s do not necessarily represent ancestor–descendant relationships; of course only a single line of descent could be expressed in such a format.) This allows the lineage and its proposed morphological succession to be grasped much more easily, including a ready comparison with the morphospecies tree.

    Pop-ups

    Pop-ups provide the most ample opportunity within Timescale Creator to provide access to supporting information for trees. Because pop-up windows are flexibly resizable and are coded in html, textual content has in effect few quota limitations and, in fact, can be employed to view external sources such as Internet sites and image files without the need to store them in the pop-up itself. They can also be programmed to follow a format tailored for the subject matter, as is done here.

    Pop-ups for the morphospecies tree display the contents of the 2011 paper’s summary table [1, Appendix S1, Table S3], including decoding of eco- and morpho-group numbers, range statistics from the Neptune portal, and tailoring the reference list to each morphospecies. They also incorporate the ancestor [from 1, Appendix S5, worksheet aM], specify the type of cladogenetic event (all are, in fact, budding for this budding/bifurcating topology [2]), and level of support for the ancestor–descendant proposal (see § Branches). Lineages containing the morphospecies are listed, along with their morphospecies content and age range (for details, see § Linkages between morphospecies and lineage trees [3]). Also included are the binomen’s original assignation and, where available, links to portals, Chronos [4][5-7] and the World Register of Marine Species (WoRMS) [8].

    Range lines

    Range-line styles have been used for the Timescale Creator version of the 2011 trees to depict four levels of confidence for ranges. Apart from accepted ranges (lines of usual thickness), two less-confident records of stratigraphic occurrence are depicted: “questioned” (thin line) and “questioned-and-rare” (broken line). For extensions to ranges that are not based on stratigraphic occurrences but are hypothesized (for various reasons), a “conjectured” range is separately recognised (dotted line) to ensure that stratigraphic and hypothesized categories are not conflated. There is an option to attach age labels (in Ma) to range lines, providing the chart with an explicit deep-time positioning throughout.

    Branches

    Similarly to ranges, branch-line styles have been used to depict three levels of stratophenetic support for ancestry. Almost all ancestor–descendant proposals for the 2011 study are presumed to be “Well Supported” (correspondence between detailed stratigraphic sequences and plausible phyletic series; drawn as a broken line). A small number have been categorised as less or better supported than the usual: “Not Well Supported” (only broad correspondence between stratigraphic order and suggestive phyletic series; drawn as a dotted line); or “Strongly Supported” (detailed morphometric–stratigraphic sequences from ancestor to descendant; continuous line).

    Linkages between morphospecies and lineage trees

    Many range points of the lineages of the 2011 study are herein directly linked to those of included morphospecies: not quite half of start dates and almost all of end dates. Brief details of this linkage are displayed in the “Stratigraphic Range (continued)” section of the pop-up, where the linkage will usually result in the same precalibrated Ma value between lineage and morphospecies range points, but these values will differ where there has been a correction or amendment of the original Ma value. The reason for choosing the morphospecies range point is usually briefly indicated. Where the original Ma value of the lineage range point is retained and not directly linked to a morphospecies point, the morphospecies and its time scale that are employed nonetheless for calibration are indicated.

    Pop-ups are also employed to more easily appreciate the linkages between morphospecies and lineages, following from the morphospecies content of lineages. These are displayed both in terms of the lineages in which a morphospecies occurs and in terms of the morphospecies included in a lineage, along with other information to help track these interrelationships.

    1. Aze T, Ezard TH, Purvis A, Coxall HK, Stewart DR, Wade BS, et al. A phylogeny of Cenozoic macroperforate planktonic foraminifera from fossil data. Biological Reviews of the Cambridge Philosophical Society. 2011;86(4):900-27. doi: 10.1111/j.1469-185X.2011.00178.x.
    2. see § Data, topologies, and taxa of the 2011 study’s trees: Tree topologies, above
    3. a morphospecies contained in more than one lineage is depicted in Figure 20a
    4. Support for on-going activity on the foraminiferal section of Chronos [116] no longer appears viable; other portals may need to be linked in later versions e.g., pforams@mikrotax [117, 118]
    5. Huber BT. Foraminiferal databases (Mesozoic Paleocene, Eocene Planktonic Foraminifera Taxonomic databases), Chronos Portal Washington (DC, USA): Consortium for Ocean Leadership for the Chronos Internal Coordinating Committee (Iowa State University and the National Science Foundation). Available from: http://portal.chronos.org/gridsphere/gridsphere?cid=foram_working_group (not updated in recent years).
    6. Young J, Huber BT, Bown P, Wade BS. pforams@mikrotax (UK Natural Environment Research Council), within mikrotax.org London: University College London. Available from: http://www.mikrotax.org/pforams/index.html.
    7. Huber BT, Petrizzo MR, Young JR, Falzoni F, Gilardoni SE, Bown PR, et al. Pforams@microtax: a new online taxonomic database for planktonic foraminifera. Micropaleontology. 2017;62(6):429-38.
    8. Hayward BW, Le Coze F, Gross O. World Foraminifera Database, World Register of Marine Species (WoRMS) Vlaams Instituut voor de Zee (Flanders Marine Institute), Oostende (Belgium): WoRMS Editorial Board; 2018 [2018-01-09]. Available from: http://www.marinespecies.org/foraminifera, http://www.marinespecies.org doi:10.14284/170
  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Maria Tigina, bug-localization [Dataset]. https://huggingface.co/datasets/tiginamaria/bug-localization

bug-localization

tiginamaria/bug-localization

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Maria Tigina
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Bug Localization

This is the data for Bug Localization benchmark.

  How-to

Since the dataset is private, if you haven't used HF Hub before, add your token via huggingface-cli first: huggingface-cli login

List all the available configs via datasets.get_dataset_config_names and choose an appropriate one

Load the data via load_dataset: from datasets import load_dataset

Select a configuration from ["py", "java", "kt", "mixed"]

configuration = "py"

Select a split from… See the full description on the dataset page: https://huggingface.co/datasets/tiginamaria/bug-localization.

Search
Clear search
Close search
Google apps
Main menu