26 datasets found
  1. d

    Harmonized Tariff Schedule of the United States (2025)

    • catalog.data.gov
    Updated Jul 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Tariff Affairs and Trade Agreements (2025). Harmonized Tariff Schedule of the United States (2025) [Dataset]. https://catalog.data.gov/dataset/harmonized-tariff-schedule-of-the-united-states-2024
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Office of Tariff Affairs and Trade Agreements
    Description

    This dataset is the current 2025 Harmonized Tariff Schedule plus all revisions for the current year. It provides the applicable tariff rates and statistical categories for all merchandise imported into the United States; it is based on the international Harmonized System, the global system of nomenclature that is used to describe most world trade in goods.

  2. Mapping of Goods and Services Identification Number to United Nations...

    • open.canada.ca
    • datasets.ai
    • +1more
    csv, html, xml
    Updated Jan 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Public Services and Procurement Canada (2025). Mapping of Goods and Services Identification Number to United Nations Standard Products and Services Code [Dataset]. https://open.canada.ca/data/en/dataset/588eab5b-7b16-4a26-b996-23b955965ffa
    Explore at:
    xml, csv, htmlAvailable download formats
    Dataset updated
    Jan 21, 2025
    Dataset provided by
    Public Services and Procurement Canadahttp://www.pwgsc.gc.ca/
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    In 2021, an international goods and services classification for procurement called the United Nations Standard Products and Services Code (UNSPSC, v21) was implemented to replace the Government of Canada’s Goods and Services Identification Numbers (GSIN) codes for categorizing procurement activities undertaken by the Government of Canada. For the transition from GSIN to UNSPSC, a subset of the entire version 21 UNSPSC list was created. The Mapping of GSIN-UNSPSC file below provides a suggested linkage between the subset of UNSPSC and higher levels of the GSIN code list. As procurement needs evolve, this file may be updated to include other UNSPSC v21 codes that are deemed to be required. In the interim, if the lowest level values within the UNSPSC structure does not relate to a specific category of goods or services, the use of the higher (related) level code from within the UNSPSC structure is appropriate. --- >Please note: This dataset is offered as a means to assist the user in finding specific UNSPSC codes, based on high-level comparisons to the legacy GSIN codes. It should not be considered a direct one-to-one mapping of these two categorization systems. For some categories, the linkages were only assessed at higher levels of the two structures (and then simply carried through indiscriminately to the related lower categories beneath those values). But given that the two systems do not necessarily group items in the same way throughout their structures, this could result in confusing connections in some cases. Please always select the UNSPSC code that best describes the applicable goods or services, even if the associated GSIN value as shown in this file is not directly relevant. --- The data is available in Comma Separated Values (CSV) file format and can be downloaded to sort, filter, and search information. The United Nations Standard Products and Services Code (UNSPSC) page on CanadaBuys offers a comprehensive guide on how to use this reference file. The Finding and using UNSPSC Codes page from CanadaBuys also contains additional information which may be of use. This dataset was originally published on June 22, 2016. The format and contents of the CSV file were revised on May 12, 2021. A copy of the original file was archived as a secondary resource to this dataset at that time (labelled ARCHIVED - Mapping of GSIN-UNSPSC in the resource list below). --- As of March 23, 2023, the data dictionary linked below includes entries for both the current and archived versions of the datafile, as well as for the datafiles of Goods and Services Identification Number (GSIN) dataset and the archived United Nations Standard Products and Services Codes (v10, released 2007) dataset.

  3. e

    EU Customs Tariff (TARIC)

    • data.europa.eu
    html
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Directorate-General for Taxation and Customs Union, EU Customs Tariff (TARIC) [Dataset]. https://data.europa.eu/data/datasets/eu-customs-tariff-taric?locale=en
    Explore at:
    htmlAvailable download formats
    Dataset authored and provided by
    Directorate-General for Taxation and Customs Union
    License

    http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj

    Area covered
    European Union
    Description

    Multilingual database covering all measures relating to tariff, commercial and agricultural legislation. Provides a clear view of what to do when importing or exporting goods.

    TARIC, the integrated Tariff of the European Union, is a multilingual database in which are integrated all measures relating to EU customs tariff, commercial and agricultural legislation. By integrating and coding these measures, the TARIC secures their uniform application by all Member States and gives all economic operators a clear view of all measures to be undertaken when importing into the EU or exporting goods from the EU. It also makes it possible to collect EU-wide statistics for the measures concerned.

    The TARIC contains the following main categories of measures:

    • Tariff measures;

    • Agricultural measures;

    • Trade Defence instruments;

    • Prohibitions and restrictions to import and export;

    • Surveillance of movements of goods at import and export.

    More information can be found under the European Binding Tariff Information for tariff information.

  4. c

    Data from: CSV file of names, times, and locations of images collected by an...

    • s.cnmilf.com
    • catalog.data.gov
    • +1more
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). CSV file of names, times, and locations of images collected by an unmanned aerial system (UAS) flying over Black Beach, Falmouth, Massachusetts on 18 March 2016 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/csv-file-of-names-times-and-locations-of-images-collected-by-an-unmanned-aerial-system-uas
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Falmouth, Massachusetts, Black Beach
    Description

    Imagery acquired with unmanned aerial systems (UAS) and coupled with structure from motion (SfM) photogrammetry can produce high-resolution topographic and visual reflectance datasets that rival or exceed lidar and orthoimagery. These new techniques are particularly useful for data collection of coastal systems, which requires high temporal and spatial resolution datasets. The U.S. Geological Survey worked in collaboration with members of the Marine Biological Laboratory and Woods Hole Analytics at Black Beach, in Falmouth, Massachusetts to explore scientific research demands on UAS technology for topographic and habitat mapping applications. This project explored the application of consumer-grade UAS platforms as a cost-effective alternative to lidar and aerial/satellite imagery to support coastal studies requiring high-resolution elevation or remote sensing data. A small UAS was used to capture low-altitude photographs and GPS devices were used to survey reference points. These data were processed in an SfM workflow to create an elevation point cloud, an orthomosaic image, and a digital elevation model.

  5. h

    journal-of-student-research-hs-articles

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Igor Katkov, journal-of-student-research-hs-articles [Dataset]. https://huggingface.co/datasets/ikatkov/journal-of-student-research-hs-articles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Igor Katkov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    High school research articles crawled on Journal of student research https://www.jsr.org/

      Dataset Structure
    

    CSV file of the following format - title,URL,names,date,abstract

      Source Data
    

    https://www.jsr.org/hs/index.php/path/section/view/hs-research-articles/

  6. PIPr: A Dataset of Public Infrastructure as Code Programs

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.

    Contents:

    • metadata.zip: The dataset metadata and analysis results as CSV files.
    • scripts-and-logs.zip: Scripts and logs of the dataset creation.
    • LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
    • README.md: This document.
    • redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.

    This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.

    Metadata

    The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.

    repositories.csv:

    • ID (integer): GitHub repository ID
    • url (string): GitHub repository URL
    • downloaded (boolean): Whether cloning the repository succeeded
    • name (string): Repository name
    • description (string): Repository description
    • licenses (string, list of strings): Repository licenses
    • redistributable (boolean): Whether the repository's licenses permit redistribution
    • created (string, date & time): Time of the repository's creation
    • updated (string, date & time): Time of the last update to the repository
    • pushed (string, date & time): Time of the last push to the repository
    • fork (boolean): Whether the repository is a fork
    • forks (integer): Number of forks
    • archive (boolean): Whether the repository is archived
    • programs (string, list of strings): Project file path of each IaC program in the repository

    programs.csv:

    • ID (string): Project file path of the IaC program
    • repository (integer): GitHub repository ID of the repository containing the IaC program
    • directory (string): Path of the directory containing the IaC program's project file
    • solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
    • language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
    • name (string): IaC program name
    • description (string): IaC program description
    • runtime (string): Runtime string of the IaC program
    • testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • tests (string, list of strings): File paths of IaC program's tests

    testing-files.csv:

    • file (string): Testing file path
    • language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
    • techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
    • program (string): Project file path of the testing file's IaC program

    Dataset Creation

    scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

    1. A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
    2. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
    3. Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
    4. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.

    Searching Repositories

    The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

    1. Github access token.
    2. Name of the CSV output file.
    3. Filename to search for.
    4. File extensions to search for, separated by commas.
    5. Min file size for the search (for all files: 0).
    6. Max file size for the search or * for unlimited (for all files: *).

    Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/

    AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html

    CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup

    Limitations

    The script uses the GitHub code search API and inherits its limitations:

    • Only forks with more stars than the parent repository are included.
    • Only the repositories' default branches are considered.
    • Only files smaller than 384 KB are searchable.
    • Only repositories with fewer than 500,000 files are considered.
    • Only repositories that have had activity or have been returned in search results in the last year are considered.

    More details: https://docs.github.com/en/search-github/searching-on-github/searching-code

    The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api

    Downloading Repositories

    download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

    1. Name of the repositories CSV files generated through search-repositories.py, separated by commas.
    2. Output directory to download the repositories to.
    3. Name of the CSV output file.

    The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.

  7. Z

    Data from: Dataset from : Browsing is a strong filter for savanna tree...

    • data.niaid.nih.gov
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wayne Twine (2021). Dataset from : Browsing is a strong filter for savanna tree seedlings in their first growing season [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4972083
    Explore at:
    Dataset updated
    Oct 1, 2021
    Dataset provided by
    Craddock Mthabini
    Archibald, Sally
    Wayne Twine
    Nicola Stevens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data presented here were used to produce the following paper:

    Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.

    The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588

    For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za

    Description of file(s):

    File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"

    The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.

    The data consist of one .csv file with the following column names:

    treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)

    File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"

    The data consist of one .csv file with the following column names:

    treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low

    File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

    Consists of a .csv file with the following column headings

    Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high

    File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

    Consists of a .csv file with the following column headings

    shtspec species name species_code species code genus genus rainclass low/medium/high seed mass mass of seed (g per 1000seeds)
    Surv_intercept coefficient of the model predicting survival from age of clip for this species Surv_slope coefficient of the model predicting survival from age of clip for this species GR_intercept coefficient of the model predicting stem diameter from seedling age for this species GR_slope coefficient of the model predicting stem diameter from seedling age for this species species_code species code max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite

  8. MaRV Scripts and Dataset

    • zenodo.org
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). MaRV Scripts and Dataset [Dataset]. http://doi.org/10.5281/zenodo.14450098
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

    Our dataset is located at the path dataset/MaRV.json

    The guidelines for replicating the study are provided below:

    Requirements

    1. Software Dependencies:

    • Python 3.10+ with packages in requirements.txt
    • Git: Required to clone repositories.
    • Java 17: RefactoringMiner requires Java 17 to perform the analysis.
    • PHP 8.0: Required to host the Web tool.
    • MySQL 8: Required to store the Web tool data.

    2. Environment Variables:

    • Create a .env file based on .env.example in the src folder and set the variables:
      • CSV_PATH: Path to the CSV file containing the list of repositories to be processed.
      • CLONE_DIR: Directory where repositories will be cloned.
      • JAVA_PATH: Path to the Java executable.
      • REFACTORING_MINER_PATH: Path to RefactoringMiner.

    Refactoring Technique Selection

    1. Environment Setup:

    • Ensure all dependencies are installed. Install the required Python packages with:
      pip install -r requirements.txt
      

    2. Configuring the Repositories CSV:

    • The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

    3. Executing the Script:

    • Configure the environment variables in the .env file and set up the repositories CSV, then run:
      python3 src/run_rm.py
      
    • The RefactoringMiner output from the 126 repositories of our study is available at:
      https://zenodo.org/records/14395034

    4. Script Behavior:

    • The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.
    • Results and Logs:
      • Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.
      • Logs for each repository, including error messages, are saved as .log files in the same directory.

    5. Count Refactorings:

    • To count instances for each refactoring technique, run:
      python3 src/count_refactorings.py
      
    • The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

    Data Gathering

    • To collect snippets before and after refactoring and their metadata, run:

      python3 src/diff.py '[refactoring technique]'
      

      Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

    • The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

    • Dataset Availability:

      • The snippets and metadata from the 126 repositories of our study are available in the dataset directory.
    • To generate the SQL file for the Web tool, run:

      python3 src/generate_refactorings_sql.py
      

    Web Tool for Manual Evaluation

    • The Web tool scripts are available in the web directory.
    • Populate the data/output/snippets folder with the output of src/diff.py.
    • Run the sql/create_database.sql script in your database.
    • Import the SQL file generated by src/generate_refactorings_sql.py.
    • Run dataset.php to generate the MaRV dataset file.
    • The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.
  9. o

    Data and Code for High Throughput FTIR Analysis of Macro and Microplastics...

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Win Cowger; Lisa Roscher; Hannah Jebens; Ali Chamas; Benjamin D Maurer; Lukas Gehrke; Gunnar Gerdts; Sebastian Primpke (2023). Data and Code for High Throughput FTIR Analysis of Macro and Microplastics with Plate Readers [Dataset]. http://doi.org/10.5281/zenodo.7772572
    Explore at:
    Dataset updated
    Apr 19, 2023
    Authors
    Win Cowger; Lisa Roscher; Hannah Jebens; Ali Chamas; Benjamin D Maurer; Lukas Gehrke; Gunnar Gerdts; Sebastian Primpke
    Description

    Data and source code for reproducing the analysis conducted in "High Throughput FTIR Analysis of Macro and Microplastics with Plate Readers" All materials are licensed for noncommercial purposes https://creativecommons.org/licenses/by-nc/4.0/ HIDA_Publication.R has source code for doing data cleanup and analysis on data in database.zip. databasedata.zip holds all raw and analyzed data. - ATR, Reflectance, and Transmission folders has all data used in the manuscript. In a raw (.0) and combined (export.csv) format for each of the plates analyzed (folder numbers). - Plots folder has images of each spectrum. - cell_information.csv has the raw ids and comments made at the time the particles were assessed. - classes_reference_2.csv has the transformations used to standardize open specy's terms to polymer classes. - CleanedSpectra_raw.csv has the total cleaned up database of all spectral intensities in long format. - joined_cell_metadata.csv has the metadata for each plate well analyzed. - library_metadata.csv has metadata for each spectrum in raw form for each particle id. - Lisa_Plate_6.csv has the metadata from Lisa Roscher used in this study. - Metadata_raw.csv has the conformed metadata that can be paired with the CleanedSpectra_raw.csv file. - OpenSpecy_Classification_Baseline.csv has the particle metadata combined with Open Specy's classes identified after baseline correcting and smoothing the spectra with the standard Open Specy routine. - OpenSpecy_Classification_Raw.csv has the particle metadata combined with Open Specy's identified classes if using the raw spectra. - particle_spectrum_match.csv converts particle ids to their reference in the Polymer_Material_Database_AWI_V2_Win.xlsx file. - Polymer_Material_Database_AWI_V2_Win.xlsx metadata on materials from Primpke's database. - polymer_metadata_2.csv can be used to crosswalk polymer categories to more or less specific terminology. - spread_os.csv is the reference database used in CleanedSpectra_raw.csv that has been spread to wide format. - Top Correlation Data20221201-125621.csv is a download of results from Open Specy's beta tool that provides the top ids from the reference database.

  10. Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  11. Student Performance

    • kaggle.com
    Updated Oct 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Student Performance [Dataset]. https://www.kaggle.com/datasets/whenamancodes/student-performance
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 7, 2022
    Dataset provided by
    Kaggle
    Authors
    Aman Chauhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

    Attributes for both Maths.csv (Math course) and Portuguese.csv (Portuguese language course) datasets:

    ColumnsDescription
    schoolstudent's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
    sexstudent's sex (binary: 'F' - female or 'M' - male)
    agestudent's age (numeric: from 15 to 22)
    addressstudent's home address type (binary: 'U' - urban or 'R' - rural)
    famsizefamily size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
    Pstatusparent's cohabitation status (binary: 'T' - living together or 'A' - apart)
    Medumother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
    Fedufather's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
    Mjobmother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    Fjobfather's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    reasonreason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
    guardianstudent's guardian (nominal: 'mother', 'father' or 'other')
    traveltimehome to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
    studytimeweekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
    failuresnumber of past class failures (numeric: n if 1<=n<3, else 4)
    schoolsupextra educational support (binary: yes or no)
    famsupfamily educational support (binary: yes or no)
    paidextra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
    activitiesextra-curricular activities (binary: yes or no)
    nurseryattended nursery school (binary: yes or no)
    higherwants to take higher education (binary: yes or no)
    internetInternet access at home (binary: yes or no)
    romanticwith a romantic relationship (binary: yes or no)
    famrelquality of family relationships (numeric: from 1 - very bad to 5 - excellent)
    freetimefree time after school (numeric: from 1 - very low to 5 - very high)
    gooutgoing out with friends (numeric: from 1 - very low to 5 - very high)
    Dalcworkday alcohol consumption (numeric: from 1 - very low to 5 - very high)
    Walcweekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
    healthcurrent health status (numeric: from 1 - very bad to 5 - very good)
    absencesnumber of school absences (numeric: from 0 to 93)

    These grades are related with the course subject, Math or Portuguese:

    GradeDescription
    G1first period grade (numeric: from 0 to 20)
    G2second period grade (numeric: from 0 to 20)
    G3final grade (numeric: from 0 to 20, output target)

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha

  12. H

    JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Feb 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Garousi-Nejad; David Tarboton (2022). JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at SNOTEL sites and a Jupyter Notebook to merge/reprocess data [Dataset]. http://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
    Explore at:
    zip(52.2 KB)Available download formats
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    HydroShare
    Authors
    Irene Garousi-Nejad; David Tarboton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.

    The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.

    Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly

    Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.

  13. 4

    Code underlying the publication: Wind pattern clustering of high frequent...

    • data.4tu.nl
    zip
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Becker (2024). Code underlying the publication: Wind pattern clustering of high frequent field measurements for dynamic wind farm flow control [Dataset]. http://doi.org/10.4121/02cbb452-4900-4c0a-95ae-5bdb5ce42ed7.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 27, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Marcus Becker
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Time period covered
    Nov 21, 2019 - Jul 7, 2023
    Area covered
    Description

    Code used to generate the wind direction time series used in the publication "Wind pattern clustering of high frequent field measurements for dynamic wind farm flow control" by M. Becker, D. Allaerts and J.W. van Wingerden (in preparation for the TORQUE conference 2024)


    The TenneT_BSA_* files convert the raw data from the KNMI [1] into one file with all data at 119m height. This is equivalent to the hub-height of the DTU 10MW reference turbine. Note that there is a channels switch in the data. That's why there are two functions to read in the data.


    The output dataset is given in the CombinedDataAt199m.csv file.


    The two hpc06_trajectories_* files are then used to segment the data into time series of requested length. This code also contains the filtering and interpolation of the data. The output are two .csv files, one with wind direction trajectories and one with wind speed trajectories.


    Two examples are given by WindDirTraj.csv and WindVelTraj.csv - they have been generated with a length of 30 data points and with an offset of 30 data points (no overlapping).


    The code of hpc06_cluster_dir* can then be used to cluster the given data.


    The remaining files are supplementary to plot data, to calculate distances in radial data etc. including the kmeans360.m function which is the modified function of the Matlab kmeans function which also works for radial data.


    [1] https://dataplatform.knmi.nl/dataset/windlidar-nz-wp-platform-1s-1

  14. f

    Data_Sheet_2_High-income ZIP codes in New York City demonstrate higher case...

    • frontiersin.figshare.com
    application/csv
    Updated Jun 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven T. L. Tung; Mosammat M. Perveen; Kirsten N. Wohlars; Robert A. Promisloff; Mary F. Lee-Wong; Anthony M. Szema (2024). Data_Sheet_2_High-income ZIP codes in New York City demonstrate higher case rates during off-peak COVID-19 waves.CSV [Dataset]. http://doi.org/10.3389/fpubh.2024.1384156.s002
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Jun 20, 2024
    Dataset provided by
    Frontiers
    Authors
    Steven T. L. Tung; Mosammat M. Perveen; Kirsten N. Wohlars; Robert A. Promisloff; Mary F. Lee-Wong; Anthony M. Szema
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    IntroductionOur study explores how New York City (NYC) communities of various socioeconomic strata were uniquely impacted by the COVID-19 pandemic.MethodsNew York City ZIP codes were stratified into three bins by median income: high-income, middle-income, and low-income. Case, hospitalization, and death rates obtained from NYCHealth were compared for the period between March 2020 and April 2022.ResultsCOVID-19 transmission rates among high-income populations during off-peak waves were higher than transmission rates among low-income populations. Hospitalization rates among low-income populations were higher during off-peak waves despite a lower transmission rate. Death rates during both off-peak and peak waves were higher for low-income ZIP codes.DiscussionThis study presents evidence that while high-income areas had higher transmission rates during off-peak periods, low-income areas suffered greater adverse outcomes in terms of hospitalization and death rates. The importance of this study is that it focuses on the social inequalities that were amplified by the pandemic.

  15. m

    Spin-Split Materials

    • data.mendeley.com
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu He (2023). Spin-Split Materials [Dataset]. http://doi.org/10.17632/638n79nnjj.1
    Explore at:
    Dataset updated
    Apr 13, 2023
    Authors
    Yu He
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data includes a table file named "total_data.csv" and four folders named "Basic properties", "Lattice parameters", "Electronic structures (with SOC)", and "Spin-split data". The "total_data.csv" lists the spin-splitting material information we have screened out , such as "Formula", "E-fermi (eV)", "Space group", "Spin-split type", "Max split energy (eV)", "Basic properties", "Lattice parameters", "Electronic structures (with SOC)", "Spin-split data", and "Data source". "Spin-split type" could be Rashba, Dresselhaus and Zeeman. One material may have multiple spin-split band structures. The "Max split energy (eV)" shows the maximum split energy among all the split energies of the material. The "Basic properties" column provides a csv file name, such as "icsd-100114-Be2Li2Sb2_bp.csv". According to this file name, the corresponding csv file can be found in the folder "Basic properties". This file contains information such as "Space group", "Site group", "Space group number", "Band gap (PBE) (eV)", and "Total energy/atom (eV)". The "Lattice parameters" provides a csv file name, such as "icsd-100114-Be2Li2Sb2_lp.csv". According to this file name, the corresponding csv file can be found in the folder “Lattice parameters”. This file contains the lattice constants a, b, c, α, β, and γ. The "Electronic structures (with SOC)" provides a png file name, such as "icsd-100114-Be2Li2Sb2_band_SOC.png". According to this file name, the corresponding png file can be found in the folder "Electronic structures (with SOC)". This file is the band structure (with SOC) diagram of the material in the range of -3 eV to 3 eV, and the spin-split band are marked in the figure. The "Spin-split data" provides a csv file name, such as "icsd-100114-Be2Li2Sb2_Es_SOC.csv". The details of the spin-split properties of all marked spin-split band could be found in the csv file of "Spin-split data" folder. The csv file contains "Point" (the number of spin-split points marked in png file), "Spin-split type", "K-point/K-path" (the high symmetry k-point/k-path with spin-split), "Split energy (eV)", and "Spin split parameter" (the symbol of split energy, Er, Ed and Ez).

  16. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(151045619431 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  17. d

    Data and scripts associated with a manuscript investigating impacts of solid...

    • search.dataone.org
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg (2023). Data and scripts associated with a manuscript investigating impacts of solid phase extraction on freshwater organic matter optical signatures and mass spectrometry pairing [Dataset]. http://doi.org/10.15485/1995543
    Explore at:
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg
    Time period covered
    Aug 30, 2021 - Sep 15, 2021
    Area covered
    Description

    This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.

  18. Assembly Shellcode Dataset

    • kaggle.com
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Assembly Shellcode Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/assembly-shellcode-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Assembly Shellcode Dataset

    The Largest Collection of Linux Assembly Shellcodes

    By SoLID (From Huggingface) [source]

    About this dataset

    The dataset consists of multiple files for different purposes. The validation.csv file contains a set of carefully selected assembly shellcodes that serve the purpose of validation. These shellcodes are used to ensure the accuracy and integrity of any models or algorithms trained on this dataset.

    The train.csv file contains both the intent column, which describes the purpose or objective behind each specific shellcode, and its corresponding assembly code snippets in order to facilitate supervised learning during training procedures. This file proves to be immensely valuable for researchers, practitioners, and developers seeking to study or develop effective techniques for dealing with malicious code analysis or security-related tasks.

    For testing purposes, the test.csv file provides yet another collection of assembly shellcodes that can be employed as test cases to assess the performance, robustness, and generalization capability of various models or methodologies developed within this domain.

    How to use the dataset

    Understanding the Dataset

    The dataset consists of multiple files that serve different purposes:

    • train.csv: This file contains the intent and corresponding assembly code snippets for training purposes. It can be used to train machine learning models or develop algorithms based on shellcode analysis.

    • test.csv: The test.csv file in the dataset contains a collection of assembly shellcodes specifically designed for testing purposes. You can use these shellcodes to evaluate and validate your models or analysis techniques.

    • validation.csv: The validation.csv file includes a set of assembly shellcodes that are specifically reserved for validation purposes. These shellcodes can be used separately to ensure the accuracy and reliability of your models.

    Columns in the Dataset

    The columns available in each CSV file are as follows:

    • intent: The intent column describes the purpose or objective of each specific shellcode entry. It provides information regarding what action or achievement is intended by using that particular piece of code.

    • snippet: The snippet column contains the actual assembly code corresponding to each intent entry in its respective row. It includes all necessary instructions and data required to execute the desired action specified by that intent.

    Utilizing the Dataset

    To effectively utilize this dataset, follow these general steps:

    • Familiarize yourself with assembly language: Assembly language is essential when working with shellcodes since they consist of low-level machine instructions understood by processors directly.

    • Explore intents: Start by analyzing and understanding different intents present in the dataset entries thoroughly. Each intent represents a specific goal or purpose behind creating an individual piece of code.

    • Examine snippets: Review the assembly code snippets corresponding to each intent entry. Carefully study the instructions and data used in the shellcode, as they directly influence their intended actions.

    • Train your models: If you are working on machine learning or algorithm development, utilize the train.csv file to train your models based on the labeled intent and snippet data provided. This step will enable you to build powerful tools for analyzing or detecting shellcodes automatically.

    • Evaluate using test datasets: Use the various assembly shellcodes present in test.csv to evaluate and validate your trained models or analysis techniques. This evaluation will help

    Research Ideas

    • Malware analysis: The dataset can be used for studying and analyzing various shellcode techniques used in malware attacks. Researchers and security professionals can use this dataset to develop detection and prevention mechanisms against such attacks.
    • Penetration testing: Security experts can use this dataset to simulate real-world attack scenarios and test the effectiveness of their defensive measures. By having access to a diverse range of shellcodes, they can identify vulnerabilities in systems and patch them before malicious actors exploit them.
    • Machine learning training: This dataset can be used to train machine learning models for automatic detection or classification of shellcodes. By combining the intent column (which describes the objective of each shellcode) with the corresponding assembly code snippets, researchers can develop algorithms that automatically identify the purpose or ...
  19. O

    ECG in High Intensity Exercise Dataset

    • opendatalab.com
    • ekoizpen-zientifikoa.ehu.eus
    • +3more
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Lausanne, ECG in High Intensity Exercise Dataset [Dataset]. https://opendatalab.com/OpenDataLab/ECG_in_High_Intensity_Exercise_etc
    Explore at:
    zip(17746043 bytes)Available download formats
    Dataset provided by
    University of Lausanne
    École Polytechnique Fédérale de Lausanne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data presented here was extracted from a larger dataset collected through a collaboration between the Embedded Systems Laboratory (ESL) of the Swiss Federal Institute of Technology in Lausanne (EPFL), Switzerland and the Institute of Sports Sciences of the University of Lausanne (ISSUL). In this dataset, we report the extracted segments used for an analysis of R peak detection algorithms during high intensity exercise. Protocol of the experiments The protocol of the experiment was the following. 22 subjects performing a cardio-pulmonary maximal exercise test on a cycle ergometer, using a gas mask. A single-lead electrocardiogram (ECG) was measured using the BIOPAC system. An initial 3 min of rest were recorded. After this baseline, the subjects started cycling at a power of 60W or 90W depending on their fitness level. Then, the power of the cycle ergometer was increased by 30W every 3 min till exhaustion (in terms of maximum oxygen uptake or VO2max). Finally, physiology experts assessed the so-called ventilatory thresholds and the VO2max based on the pulmonary data (volume of oxygen and CO2). Description of the extracted dataset The characteristics of the dataset are the following: We report only 20 out of 22 subjects that were used for the analysis, because for two subjects the signals were too corrupted or not complete. Specifically, subjects 5 and 12 were discarded. The ECG signal was sampled at 500 Hz and then downsampled at 250 Hz. The original ECG signal were measured at maximum 10 mV. Then, they were scaled down by a factor of 1000, hence the data is represented in uV. For each subject, 5 segments of 20 s were extracted from the ECG recordings and chosen based on different phases of the maximal exercise test (i.e., before and after the so-called second ventilatory threshold or VT2, before and in the middle of VO2max, and during the recovery after exhaustion) to represent different intensities of physical activity. seg1 --> [VT2-50,VT2-30] seg2 --> [VT2+60,VT2+80] seg3 --> [VO2max-50,VO2max-30] seg4 --> [VO2max-10,VO2max+10] seg5 --> [VO2max+60,VO2max+80] The R peak locations were manually annotated in all segments and reviewed by a physician of the Lausanne University Hospital, CHUV. Only segment 5 of subject 9 could not be annotated since there was a problem with the input signal. So, the total number of segments extracted were 20 * 5 - 1 = 99. Format of the extracted dataset The dataset is divided in two main folders: The folder ecg_segments/ contains the ECG signals saved in two formats, .csv and .mat. This folder includes both raw (ecg_raw) and processed (ecg) signals. The processing consists of a morphological filtering and a relative energy non filtering method to enhance the R peaks. The .csv files contain only the signal, while the .mat files include the signal, the time vector within the maximal stress test, the sampling frequency and the unit of the signal amplitude (uV, as we mentioned before). The folder manual_annotations/ contains the sample indices of the annotated R peaks in .csv format. The annotation was done on the processed signals.

  20. MetaGraspNet Difficulty 1

    • kaggle.com
    zip
    Updated Mar 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuhao Chen (2022). MetaGraspNet Difficulty 1 [Dataset]. https://www.kaggle.com/datasets/metagrasp/metagraspnetdifficulty1-easy
    Explore at:
    zip(4103890817 bytes)Available download formats
    Dataset updated
    Mar 19, 2022
    Authors
    Yuhao Chen
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    MetaGraspNet dataset

    This repository contains the MetaGraspNet Dataset described in the paper "MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis" (https://arxiv.org/abs/2112.14663 ).

    There has been increasing interest in smart factories powered by robotics systems to tackle repetitive, laborious tasks. One particular impactful yet challenging task in robotics-powered smart factory applications is robotic grasping: using robotic arms to grasp objects autonomously in different settings. Robotic grasping requires a variety of computer vision tasks such as object detection, segmentation, grasp prediction, pick planning, etc. While significant progress has been made in leveraging of machine learning for robotic grasping, particularly with deep learning, a big challenge remains in the need for large-scale, high-quality RGBD datasets that cover a wide diversity of scenarios and permutations.

    To tackle this big, diverse data problem, we are inspired by the recent rise in the concept of metaverse, which has greatly closed the gap between virtual worlds and the physical world. In particular, metaverses allow us to create digital twins of real-world manufacturing scenarios and to virtually create different scenarios from which large volumes of data can be generated for training models. We present MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types, and is split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance in a manner that is more appropriate for robotic grasp applications compared to existing general-purpose performance metrics. This repository contains the first phase of MetaGraspNet benchmark dataset which includes detailed object detection, segmentation, layout annotations, and a script for layout-weighted performance metric (https://github.com/y2863/MetaGraspNet ).

    https://raw.githubusercontent.com/y2863/MetaGraspNet/main/.github/500.png">

    Citing MetaGraspNet

    If you use MetaGraspNet dataset or metric in your research, please use the following BibTeX entry. BibTeX @article{chen2021metagraspnet, author = {Yuhao Chen and E. Zhixuan Zeng and Maximilian Gilles and Alexander Wong}, title = {MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis}, journal = {arXiv preprint arXiv:2112.14663}, year = {2021} }

    File Structure

    This dataset is arranged in the following file structure:

    root
    |-- meta-grasp
      |-- scene0
        |-- 0_camera_params.json
        |-- 0_depth.png
        |-- 0_rgb.png
        |-- 0_order.csv
        ...
      |-- scene1
      ...
    |-- difficulty-n-coco-label.json
    

    Each scene is an unique arrangement of objects, which we then display at various different angles. For each shot of a scene, we provide the camera parameters (x_camara_params.json), a depth image (x_depth.png), an rgb image (x_rgb.png), as well as a matrix representation of the ordering of each object (x_order.csv). The full label for the image are all available in difficulty-n-coco-label.json (where n is the difficulty level of the dataset) in the coco data format.

    Understanding order.csv

    The matrix describes a pairwise obstruction relationship between each object within the image. Given a "parent" object covering a "child" object: relationship_matrix[child_id, parent_id] = -1

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Office of Tariff Affairs and Trade Agreements (2025). Harmonized Tariff Schedule of the United States (2025) [Dataset]. https://catalog.data.gov/dataset/harmonized-tariff-schedule-of-the-united-states-2024

Harmonized Tariff Schedule of the United States (2025)

Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Office of Tariff Affairs and Trade Agreements
Description

This dataset is the current 2025 Harmonized Tariff Schedule plus all revisions for the current year. It provides the applicable tariff rates and statistical categories for all merchandise imported into the United States; it is based on the international Harmonized System, the global system of nomenclature that is used to describe most world trade in goods.

Search
Clear search
Close search
Google apps
Main menu