26 datasets found

d
Harmonized Tariff Schedule of the United States (2025)
catalog.data.gov
Updated Jul 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Tariff Affairs and Trade Agreements (2025). Harmonized Tariff Schedule of the United States (2025) [Dataset]. https://catalog.data.gov/dataset/harmonized-tariff-schedule-of-the-united-states-2024
Explore at:
Dataset updated
Jul 11, 2025
Dataset provided by
Office of Tariff Affairs and Trade Agreements
Description
This dataset is the current 2025 Harmonized Tariff Schedule plus all revisions for the current year. It provides the applicable tariff rates and statistical categories for all merchandise imported into the United States; it is based on the international Harmonized System, the global system of nomenclature that is used to describe most world trade in goods.
Mapping of Goods and Services Identification Number to United Nations...
open.canada.ca
datasets.ai
+1more
csv, html, xml
Updated Jan 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public Services and Procurement Canada (2025). Mapping of Goods and Services Identification Number to United Nations Standard Products and Services Code [Dataset]. https://open.canada.ca/data/en/dataset/588eab5b-7b16-4a26-b996-23b955965ffa
Explore at:
xml, csv, htmlAvailable download formats
Dataset updated
Jan 21, 2025
Dataset provided by
Public Services and Procurement Canadahttp://www.pwgsc.gc.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
In 2021, an international goods and services classification for procurement called the United Nations Standard Products and Services Code (UNSPSC, v21) was implemented to replace the Government of Canada’s Goods and Services Identification Numbers (GSIN) codes for categorizing procurement activities undertaken by the Government of Canada. For the transition from GSIN to UNSPSC, a subset of the entire version 21 UNSPSC list was created. The Mapping of GSIN-UNSPSC file below provides a suggested linkage between the subset of UNSPSC and higher levels of the GSIN code list. As procurement needs evolve, this file may be updated to include other UNSPSC v21 codes that are deemed to be required. In the interim, if the lowest level values within the UNSPSC structure does not relate to a specific category of goods or services, the use of the higher (related) level code from within the UNSPSC structure is appropriate. --- >Please note: This dataset is offered as a means to assist the user in finding specific UNSPSC codes, based on high-level comparisons to the legacy GSIN codes. It should not be considered a direct one-to-one mapping of these two categorization systems. For some categories, the linkages were only assessed at higher levels of the two structures (and then simply carried through indiscriminately to the related lower categories beneath those values). But given that the two systems do not necessarily group items in the same way throughout their structures, this could result in confusing connections in some cases. Please always select the UNSPSC code that best describes the applicable goods or services, even if the associated GSIN value as shown in this file is not directly relevant. --- The data is available in Comma Separated Values (CSV) file format and can be downloaded to sort, filter, and search information. The United Nations Standard Products and Services Code (UNSPSC) page on CanadaBuys offers a comprehensive guide on how to use this reference file. The Finding and using UNSPSC Codes page from CanadaBuys also contains additional information which may be of use. This dataset was originally published on June 22, 2016. The format and contents of the CSV file were revised on May 12, 2021. A copy of the original file was archived as a secondary resource to this dataset at that time (labelled ARCHIVED - Mapping of GSIN-UNSPSC in the resource list below). --- As of March 23, 2023, the data dictionary linked below includes entries for both the current and archived versions of the datafile, as well as for the datafiles of Goods and Services Identification Number (GSIN) dataset and the archived United Nations Standard Products and Services Codes (v10, released 2007) dataset.
e
EU Customs Tariff (TARIC)
data.europa.eu
html
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Directorate-General for Taxation and Customs Union, EU Customs Tariff (TARIC) [Dataset]. https://data.europa.eu/data/datasets/eu-customs-tariff-taric?locale=en
Explore at:
htmlAvailable download formats
Dataset authored and provided by
Directorate-General for Taxation and Customs Union
License
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Area covered
European Union
Description
Multilingual database covering all measures relating to tariff, commercial and agricultural legislation. Provides a clear view of what to do when importing or exporting goods.

TARIC, the integrated Tariff of the European Union, is a multilingual database in which are integrated all measures relating to EU customs tariff, commercial and agricultural legislation. By integrating and coding these measures, the TARIC secures their uniform application by all Member States and gives all economic operators a clear view of all measures to be undertaken when importing into the EU or exporting goods from the EU. It also makes it possible to collect EU-wide statistics for the measures concerned.

The TARIC contains the following main categories of measures:

Tariff measures;

Agricultural measures;

Trade Defence instruments;

Prohibitions and restrictions to import and export;

Surveillance of movements of goods at import and export.

More information can be found under the European Binding Tariff Information for tariff information.
c
Data from: CSV file of names, times, and locations of images collected by an...
s.cnmilf.com
catalog.data.gov
+1more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). CSV file of names, times, and locations of images collected by an unmanned aerial system (UAS) flying over Black Beach, Falmouth, Massachusetts on 18 March 2016 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/csv-file-of-names-times-and-locations-of-images-collected-by-an-unmanned-aerial-system-uas
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Falmouth, Massachusetts, Black Beach
Description
Imagery acquired with unmanned aerial systems (UAS) and coupled with structure from motion (SfM) photogrammetry can produce high-resolution topographic and visual reflectance datasets that rival or exceed lidar and orthoimagery. These new techniques are particularly useful for data collection of coastal systems, which requires high temporal and spatial resolution datasets. The U.S. Geological Survey worked in collaboration with members of the Marine Biological Laboratory and Woods Hole Analytics at Black Beach, in Falmouth, Massachusetts to explore scientific research demands on UAS technology for topographic and habitat mapping applications. This project explored the application of consumer-grade UAS platforms as a cost-effective alternative to lidar and aerial/satellite imagery to support coastal studies requiring high-resolution elevation or remote sensing data. A small UAS was used to capture low-altitude photographs and GPS devices were used to survey reference points. These data were processed in an SfM workflow to create an elevation point cloud, an orthomosaic image, and a digital elevation model.
h
journal-of-student-research-hs-articles
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Igor Katkov, journal-of-student-research-hs-articles [Dataset]. https://huggingface.co/datasets/ikatkov/journal-of-student-research-hs-articles
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Igor Katkov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Dataset Name

High school research articles crawled on Journal of student research https://www.jsr.org/

Dataset Structure

CSV file of the following format - title,URL,names,date,abstract

Source Data

https://www.jsr.org/hs/index.php/path/section/view/hs-research-articles/
PIPr: A Dataset of Public Infrastructure as Code Programs
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10173400
Dataset updated
Nov 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
metadata.zip: The dataset metadata and analysis results as CSV files.
scripts-and-logs.zip: Scripts and logs of the dataset creation.
LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
README.md: This document.
redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
Metadata
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
ID (integer): GitHub repository ID
url (string): GitHub repository URL
downloaded (boolean): Whether cloning the repository succeeded
name (string): Repository name
description (string): Repository description
licenses (string, list of strings): Repository licenses
redistributable (boolean): Whether the repository's licenses permit redistribution
created (string, date & time): Time of the repository's creation
updated (string, date & time): Time of the last update to the repository
pushed (string, date & time): Time of the last push to the repository
fork (boolean): Whether the repository is a fork
forks (integer): Number of forks
archive (boolean): Whether the repository is archived
programs (string, list of strings): Project file path of each IaC program in the repository
programs.csv:
ID (string): Project file path of the IaC program
repository (integer): GitHub repository ID of the repository containing the IaC program
directory (string): Path of the directory containing the IaC program's project file
solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
name (string): IaC program name
description (string): IaC program description
runtime (string): Runtime string of the IaC program
testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
tests (string, list of strings): File paths of IaC program's tests
testing-files.csv:
file (string): Testing file path
language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
program (string): Project file path of the testing file's IaC program
Dataset Creation
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.
Searching Repositories
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token.
Name of the CSV output file.
Filename to search for.
File extensions to search for, separated by commas.
Min file size for the search (for all files: 0).
Max file size for the search or * for unlimited (for all files: *).
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
Limitations
The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included.
Only the repositories' default branches are considered.
Only files smaller than 384 KB are searchable.
Only repositories with fewer than 500,000 files are considered.
Only repositories that have had activity or have been returned in search results in the last year are considered.
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
Downloading Repositories
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas.
Output directory to download the repositories to.
Name of the CSV output file.
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
Z
Data from: Dataset from : Browsing is a strong filter for savanna tree...
data.niaid.nih.gov
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wayne Twine (2021). Dataset from : Browsing is a strong filter for savanna tree seedlings in their first growing season [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4972083
Explore at:
Dataset updated
Oct 1, 2021
Dataset provided by
Craddock Mthabini
Archibald, Sally
Wayne Twine
Nicola Stevens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data presented here were used to produce the following paper:

Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.

The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588

For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za

Description of file(s):

File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"

The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.

The data consist of one .csv file with the following column names:

treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)

File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"

The data consist of one .csv file with the following column names:

treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low

File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

Consists of a .csv file with the following column headings

Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high

File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"

Consists of a .csv file with the following column headings

shtspec species name species_code species code genus genus rainclass low/medium/high seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species Surv_slope coefficient of the model predicting survival from age of clip for this species GR_intercept coefficient of the model predicting stem diameter from seedling age for this species GR_slope coefficient of the model predicting stem diameter from seedling age for this species species_code species code max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite
MaRV Scripts and Dataset
zenodo.org
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2024). MaRV Scripts and Dataset [Dataset]. http://doi.org/10.5281/zenodo.14450098
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14450098
Dataset updated
Dec 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

Our dataset is located at the path dataset/MaRV.json

The guidelines for replicating the study are provided below:

Requirements

1. Software Dependencies:

Python 3.10+ with packages in requirements.txt

Git: Required to clone repositories.

Java 17: RefactoringMiner requires Java 17 to perform the analysis.

PHP 8.0: Required to host the Web tool.

MySQL 8: Required to store the Web tool data.

2. Environment Variables:

Create a .env file based on .env.example in the src folder and set the variables:

CSV_PATH: Path to the CSV file containing the list of repositories to be processed.

CLONE_DIR: Directory where repositories will be cloned.

JAVA_PATH: Path to the Java executable.

REFACTORING_MINER_PATH: Path to RefactoringMiner.

Refactoring Technique Selection

1. Environment Setup:

Ensure all dependencies are installed. Install the required Python packages with:
pip install -r requirements.txt

2. Configuring the Repositories CSV:

The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

3. Executing the Script:

Configure the environment variables in the .env file and set up the repositories CSV, then run:
python3 src/run_rm.py

The RefactoringMiner output from the 126 repositories of our study is available at:
https://zenodo.org/records/14395034

4. Script Behavior:

The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.

Results and Logs:

Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.

Logs for each repository, including error messages, are saved as .log files in the same directory.

5. Count Refactorings:

To count instances for each refactoring technique, run:
python3 src/count_refactorings.py

The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

Data Gathering

To collect snippets before and after refactoring and their metadata, run:

python3 src/diff.py '[refactoring technique]'

Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

Dataset Availability:

The snippets and metadata from the 126 repositories of our study are available in the dataset directory.

To generate the SQL file for the Web tool, run:

python3 src/generate_refactorings_sql.py

Web Tool for Manual Evaluation

The Web tool scripts are available in the web directory.

Populate the data/output/snippets folder with the output of src/diff.py.

Run the sql/create_database.sql script in your database.

Import the SQL file generated by src/generate_refactorings_sql.py.

Run dataset.php to generate the MaRV dataset file.

The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.
o
Data and Code for High Throughput FTIR Analysis of Macro and Microplastics...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Win Cowger; Lisa Roscher; Hannah Jebens; Ali Chamas; Benjamin D Maurer; Lukas Gehrke; Gunnar Gerdts; Sebastian Primpke (2023). Data and Code for High Throughput FTIR Analysis of Macro and Microplastics with Plate Readers [Dataset]. http://doi.org/10.5281/zenodo.7772572
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7772572
Dataset updated
Apr 19, 2023
Authors
Win Cowger; Lisa Roscher; Hannah Jebens; Ali Chamas; Benjamin D Maurer; Lukas Gehrke; Gunnar Gerdts; Sebastian Primpke
Description
Data and source code for reproducing the analysis conducted in "High Throughput FTIR Analysis of Macro and Microplastics with Plate Readers" All materials are licensed for noncommercial purposes https://creativecommons.org/licenses/by-nc/4.0/ HIDA_Publication.R has source code for doing data cleanup and analysis on data in database.zip. databasedata.zip holds all raw and analyzed data. - ATR, Reflectance, and Transmission folders has all data used in the manuscript. In a raw (.0) and combined (export.csv) format for each of the plates analyzed (folder numbers). - Plots folder has images of each spectrum. - cell_information.csv has the raw ids and comments made at the time the particles were assessed. - classes_reference_2.csv has the transformations used to standardize open specy's terms to polymer classes. - CleanedSpectra_raw.csv has the total cleaned up database of all spectral intensities in long format. - joined_cell_metadata.csv has the metadata for each plate well analyzed. - library_metadata.csv has metadata for each spectrum in raw form for each particle id. - Lisa_Plate_6.csv has the metadata from Lisa Roscher used in this study. - Metadata_raw.csv has the conformed metadata that can be paired with the CleanedSpectra_raw.csv file. - OpenSpecy_Classification_Baseline.csv has the particle metadata combined with Open Specy's classes identified after baseline correcting and smoothing the spectra with the standard Open Specy routine. - OpenSpecy_Classification_Raw.csv has the particle metadata combined with Open Specy's identified classes if using the raw spectra. - particle_spectrum_match.csv converts particle ids to their reference in the Polymer_Material_Database_AWI_V2_Win.xlsx file. - Polymer_Material_Database_AWI_V2_Win.xlsx metadata on materials from Primpke's database. - polymer_metadata_2.csv can be used to crosswalk polymer categories to more or less specific terminology. - spread_os.csv is the reference database used in CleanedSpectra_raw.csv that has been spread to wide format. - Top Correlation Data20221201-125621.csv is a download of results from Open Specy's beta tool that provides the top ids from the reference database.
Data from: Data and code from: Environmental influences on drying rate of...
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

Student Performance

kaggle.com

Updated Oct 7, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Aman Chauhan (2022). Student Performance [Dataset]. https://www.kaggle.com/datasets/whenamancodes/student-performance

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 7, 2022

Dataset provided by

Kaggle

Authors

Aman Chauhan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Attributes for both Maths.csv (Math course) and Portuguese.csv (Portuguese language course) datasets:

Columns	Description
school	student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
sex	student's sex (binary: 'F' - female or 'M' - male)
age	student's age (numeric: from 15 to 22)
address	student's home address type (binary: 'U' - urban or 'R' - rural)
famsize	family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
Pstatus	parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
Medu	mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
Fedu	father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
Mjob	mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
Fjob	father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
reason	reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
guardian	student's guardian (nominal: 'mother', 'father' or 'other')
traveltime	home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime	weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures	number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup	extra educational support (binary: yes or no)
famsup	family educational support (binary: yes or no)
paid	extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities	extra-curricular activities (binary: yes or no)
nursery	attended nursery school (binary: yes or no)
higher	wants to take higher education (binary: yes or no)
internet	Internet access at home (binary: yes or no)
romantic	with a romantic relationship (binary: yes or no)
famrel	quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime	free time after school (numeric: from 1 - very low to 5 - very high)
goout	going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc	workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc	weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health	current health status (numeric: from 1 - very bad to 5 - very good)
absences	number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

Grade	Description
G1	first period grade (numeric: from 0 to 20)
G2	second period grade (numeric: from 0 to 20)
G3	final grade (numeric: from 0 to 20, output target)

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha

H
JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...
beta.hydroshare.org
hydroshare.org
+1more
zip
Updated Feb 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Garousi-Nejad; David Tarboton (2022). JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at SNOTEL sites and a Jupyter Notebook to merge/reprocess data [Dataset]. http://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
Explore at:
zip(52.2 KB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
Dataset updated
Feb 11, 2022
Dataset provided by
HydroShare
Authors
Irene Garousi-Nejad; David Tarboton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.

The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.

Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly

Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.
4
Code underlying the publication: Wind pattern clustering of high frequent...
data.4tu.nl
zip
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcus Becker (2024). Code underlying the publication: Wind pattern clustering of high frequent field measurements for dynamic wind farm flow control [Dataset]. http://doi.org/10.4121/02cbb452-4900-4c0a-95ae-5bdb5ce42ed7.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/02cbb452-4900-4c0a-95ae-5bdb5ce42ed7.v1
Dataset updated
Feb 27, 2024
Dataset provided by
4TU.ResearchData
Authors
Marcus Becker
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Time period covered
Nov 21, 2019 - Jul 7, 2023
Area covered

Description
Code used to generate the wind direction time series used in the publication "Wind pattern clustering of high frequent field measurements for dynamic wind farm flow control" by M. Becker, D. Allaerts and J.W. van Wingerden (in preparation for the TORQUE conference 2024)

The TenneT_BSA_* files convert the raw data from the KNMI [1] into one file with all data at 119m height. This is equivalent to the hub-height of the DTU 10MW reference turbine. Note that there is a channels switch in the data. That's why there are two functions to read in the data.

The output dataset is given in the CombinedDataAt199m.csv file.

The two hpc06_trajectories_* files are then used to segment the data into time series of requested length. This code also contains the filtering and interpolation of the data. The output are two .csv files, one with wind direction trajectories and one with wind speed trajectories.

Two examples are given by WindDirTraj.csv and WindVelTraj.csv - they have been generated with a length of 30 data points and with an offset of 30 data points (no overlapping).

The code of hpc06_cluster_dir* can then be used to cluster the given data.

The remaining files are supplementary to plot data, to calculate distances in radial data etc. including the kmeans360.m function which is the modified function of the Matlab kmeans function which also works for radial data.

[1] https://dataplatform.knmi.nl/dataset/windlidar-nz-wp-platform-1s-1
f
Data_Sheet_2_High-income ZIP codes in New York City demonstrate higher case...
frontiersin.figshare.com
application/csv
Updated Jun 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven T. L. Tung; Mosammat M. Perveen; Kirsten N. Wohlars; Robert A. Promisloff; Mary F. Lee-Wong; Anthony M. Szema (2024). Data_Sheet_2_High-income ZIP codes in New York City demonstrate higher case rates during off-peak COVID-19 waves.CSV [Dataset]. http://doi.org/10.3389/fpubh.2024.1384156.s002
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2024.1384156.s002
Dataset updated
Jun 20, 2024
Dataset provided by
Frontiers
Authors
Steven T. L. Tung; Mosammat M. Perveen; Kirsten N. Wohlars; Robert A. Promisloff; Mary F. Lee-Wong; Anthony M. Szema
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
IntroductionOur study explores how New York City (NYC) communities of various socioeconomic strata were uniquely impacted by the COVID-19 pandemic.MethodsNew York City ZIP codes were stratified into three bins by median income: high-income, middle-income, and low-income. Case, hospitalization, and death rates obtained from NYCHealth were compared for the period between March 2020 and April 2022.ResultsCOVID-19 transmission rates among high-income populations during off-peak waves were higher than transmission rates among low-income populations. Hospitalization rates among low-income populations were higher during off-peak waves despite a lower transmission rate. Death rates during both off-peak and peak waves were higher for low-income ZIP codes.DiscussionThis study presents evidence that while high-income areas had higher transmission rates during off-peak periods, low-income areas suffered greater adverse outcomes in terms of hospitalization and death rates. The importance of this study is that it focuses on the social inequalities that were amplified by the pandemic.
m
Spin-Split Materials
data.mendeley.com
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu He (2023). Spin-Split Materials [Dataset]. http://doi.org/10.17632/638n79nnjj.1
Explore at:
Unique identifier
https://doi.org/10.17632/638n79nnjj.1
Dataset updated
Apr 13, 2023
Authors
Yu He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data includes a table file named "total_data.csv" and four folders named "Basic properties", "Lattice parameters", "Electronic structures (with SOC)", and "Spin-split data". The "total_data.csv" lists the spin-splitting material information we have screened out , such as "Formula", "E-fermi (eV)", "Space group", "Spin-split type", "Max split energy (eV)", "Basic properties", "Lattice parameters", "Electronic structures (with SOC)", "Spin-split data", and "Data source". "Spin-split type" could be Rashba, Dresselhaus and Zeeman. One material may have multiple spin-split band structures. The "Max split energy (eV)" shows the maximum split energy among all the split energies of the material. The "Basic properties" column provides a csv file name, such as "icsd-100114-Be2Li2Sb2_bp.csv". According to this file name, the corresponding csv file can be found in the folder "Basic properties". This file contains information such as "Space group", "Site group", "Space group number", "Band gap (PBE) (eV)", and "Total energy/atom (eV)". The "Lattice parameters" provides a csv file name, such as "icsd-100114-Be2Li2Sb2_lp.csv". According to this file name, the corresponding csv file can be found in the folder “Lattice parameters”. This file contains the lattice constants a, b, c, α, β, and γ. The "Electronic structures (with SOC)" provides a png file name, such as "icsd-100114-Be2Li2Sb2_band_SOC.png". According to this file name, the corresponding png file can be found in the folder "Electronic structures (with SOC)". This file is the band structure (with SOC) diagram of the material in the range of -3 eV to 3 eV, and the spin-split band are marked in the figure. The "Spin-split data" provides a csv file name, such as "icsd-100114-Be2Li2Sb2_Es_SOC.csv". The details of the spin-split properties of all marked spin-split band could be found in the csv file of "Spin-split data" folder. The csv file contains "Point" (the number of spin-split points marked in png file), "Spin-split type", "K-point/K-path" (the high symmetry k-point/k-path with spin-split), "Split energy (eV)", and "Spin split parameter" (the symbol of split energy, Er, Ed and Ez).
Meta Kaggle Code
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(151045619431 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
d
Data and scripts associated with a manuscript investigating impacts of solid...
search.dataone.org
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg (2023). Data and scripts associated with a manuscript investigating impacts of solid phase extraction on freshwater organic matter optical signatures and mass spectrometry pairing [Dataset]. http://doi.org/10.15485/1995543
Explore at:
Unique identifier
https://doi.org/10.15485/1995543
Dataset updated
Aug 21, 2023
Dataset provided by
ESS-DIVE
Authors
Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg
Time period covered
Aug 30, 2021 - Sep 15, 2021
Area covered

Description
This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.
Assembly Shellcode Dataset
kaggle.com
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Assembly Shellcode Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/assembly-shellcode-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Assembly Shellcode Dataset

The Largest Collection of Linux Assembly Shellcodes

By SoLID (From Huggingface) [source]

About this dataset

The dataset consists of multiple files for different purposes. The validation.csv file contains a set of carefully selected assembly shellcodes that serve the purpose of validation. These shellcodes are used to ensure the accuracy and integrity of any models or algorithms trained on this dataset.

The train.csv file contains both the intent column, which describes the purpose or objective behind each specific shellcode, and its corresponding assembly code snippets in order to facilitate supervised learning during training procedures. This file proves to be immensely valuable for researchers, practitioners, and developers seeking to study or develop effective techniques for dealing with malicious code analysis or security-related tasks.

For testing purposes, the test.csv file provides yet another collection of assembly shellcodes that can be employed as test cases to assess the performance, robustness, and generalization capability of various models or methodologies developed within this domain.

How to use the dataset

Understanding the Dataset

The dataset consists of multiple files that serve different purposes:

train.csv: This file contains the intent and corresponding assembly code snippets for training purposes. It can be used to train machine learning models or develop algorithms based on shellcode analysis.

test.csv: The test.csv file in the dataset contains a collection of assembly shellcodes specifically designed for testing purposes. You can use these shellcodes to evaluate and validate your models or analysis techniques.

validation.csv: The validation.csv file includes a set of assembly shellcodes that are specifically reserved for validation purposes. These shellcodes can be used separately to ensure the accuracy and reliability of your models.

Columns in the Dataset

The columns available in each CSV file are as follows:

intent: The intent column describes the purpose or objective of each specific shellcode entry. It provides information regarding what action or achievement is intended by using that particular piece of code.

snippet: The snippet column contains the actual assembly code corresponding to each intent entry in its respective row. It includes all necessary instructions and data required to execute the desired action specified by that intent.

Utilizing the Dataset

To effectively utilize this dataset, follow these general steps:

Familiarize yourself with assembly language: Assembly language is essential when working with shellcodes since they consist of low-level machine instructions understood by processors directly.

Explore intents: Start by analyzing and understanding different intents present in the dataset entries thoroughly. Each intent represents a specific goal or purpose behind creating an individual piece of code.

Examine snippets: Review the assembly code snippets corresponding to each intent entry. Carefully study the instructions and data used in the shellcode, as they directly influence their intended actions.

Train your models: If you are working on machine learning or algorithm development, utilize the train.csv file to train your models based on the labeled intent and snippet data provided. This step will enable you to build powerful tools for analyzing or detecting shellcodes automatically.

Evaluate using test datasets: Use the various assembly shellcodes present in test.csv to evaluate and validate your trained models or analysis techniques. This evaluation will help

Research Ideas

Malware analysis: The dataset can be used for studying and analyzing various shellcode techniques used in malware attacks. Researchers and security professionals can use this dataset to develop detection and prevention mechanisms against such attacks.

Penetration testing: Security experts can use this dataset to simulate real-world attack scenarios and test the effectiveness of their defensive measures. By having access to a diverse range of shellcodes, they can identify vulnerabilities in systems and patch them before malicious actors exploit them.

Machine learning training: This dataset can be used to train machine learning models for automatic detection or classification of shellcodes. By combining the intent column (which describes the objective of each shellcode) with the corresponding assembly code snippets, researchers can develop algorithms that automatically identify the purpose or ...
O
ECG in High Intensity Exercise Dataset
opendatalab.com
ekoizpen-zientifikoa.ehu.eus
+3more
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Lausanne, ECG in High Intensity Exercise Dataset [Dataset]. https://opendatalab.com/OpenDataLab/ECG_in_High_Intensity_Exercise_etc
Explore at:
zip(17746043 bytes)Available download formats
Dataset provided by
University of Lausanne
École Polytechnique Fédérale de Lausanne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data presented here was extracted from a larger dataset collected through a collaboration between the Embedded Systems Laboratory (ESL) of the Swiss Federal Institute of Technology in Lausanne (EPFL), Switzerland and the Institute of Sports Sciences of the University of Lausanne (ISSUL). In this dataset, we report the extracted segments used for an analysis of R peak detection algorithms during high intensity exercise. Protocol of the experiments The protocol of the experiment was the following. 22 subjects performing a cardio-pulmonary maximal exercise test on a cycle ergometer, using a gas mask. A single-lead electrocardiogram (ECG) was measured using the BIOPAC system. An initial 3 min of rest were recorded. After this baseline, the subjects started cycling at a power of 60W or 90W depending on their fitness level. Then, the power of the cycle ergometer was increased by 30W every 3 min till exhaustion (in terms of maximum oxygen uptake or VO2max). Finally, physiology experts assessed the so-called ventilatory thresholds and the VO2max based on the pulmonary data (volume of oxygen and CO2). Description of the extracted dataset The characteristics of the dataset are the following: We report only 20 out of 22 subjects that were used for the analysis, because for two subjects the signals were too corrupted or not complete. Specifically, subjects 5 and 12 were discarded. The ECG signal was sampled at 500 Hz and then downsampled at 250 Hz. The original ECG signal were measured at maximum 10 mV. Then, they were scaled down by a factor of 1000, hence the data is represented in uV. For each subject, 5 segments of 20 s were extracted from the ECG recordings and chosen based on different phases of the maximal exercise test (i.e., before and after the so-called second ventilatory threshold or VT2, before and in the middle of VO2max, and during the recovery after exhaustion) to represent different intensities of physical activity. seg1 --> [VT2-50,VT2-30] seg2 --> [VT2+60,VT2+80] seg3 --> [VO2max-50,VO2max-30] seg4 --> [VO2max-10,VO2max+10] seg5 --> [VO2max+60,VO2max+80] The R peak locations were manually annotated in all segments and reviewed by a physician of the Lausanne University Hospital, CHUV. Only segment 5 of subject 9 could not be annotated since there was a problem with the input signal. So, the total number of segments extracted were 20 * 5 - 1 = 99. Format of the extracted dataset The dataset is divided in two main folders: The folder ecg_segments/ contains the ECG signals saved in two formats, .csv and .mat. This folder includes both raw (ecg_raw) and processed (ecg) signals. The processing consists of a morphological filtering and a relative energy non filtering method to enhance the R peaks. The .csv files contain only the signal, while the .mat files include the signal, the time vector within the maximal stress test, the sampling frequency and the unit of the signal amplitude (uV, as we mentioned before). The folder manual_annotations/ contains the sample indices of the annotated R peaks in .csv format. The annotation was done on the processed signals.
MetaGraspNet Difficulty 1
kaggle.com
zip
Updated Mar 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuhao Chen (2022). MetaGraspNet Difficulty 1 [Dataset]. https://www.kaggle.com/datasets/metagrasp/metagraspnetdifficulty1-easy
Explore at:
zip(4103890817 bytes)Available download formats
Dataset updated
Mar 19, 2022
Authors
Yuhao Chen
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
MetaGraspNet dataset

This repository contains the MetaGraspNet Dataset described in the paper "MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis" (https://arxiv.org/abs/2112.14663 ).

There has been increasing interest in smart factories powered by robotics systems to tackle repetitive, laborious tasks. One particular impactful yet challenging task in robotics-powered smart factory applications is robotic grasping: using robotic arms to grasp objects autonomously in different settings. Robotic grasping requires a variety of computer vision tasks such as object detection, segmentation, grasp prediction, pick planning, etc. While significant progress has been made in leveraging of machine learning for robotic grasping, particularly with deep learning, a big challenge remains in the need for large-scale, high-quality RGBD datasets that cover a wide diversity of scenarios and permutations.

To tackle this big, diverse data problem, we are inspired by the recent rise in the concept of metaverse, which has greatly closed the gap between virtual worlds and the physical world. In particular, metaverses allow us to create digital twins of real-world manufacturing scenarios and to virtually create different scenarios from which large volumes of data can be generated for training models. We present MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types, and is split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance in a manner that is more appropriate for robotic grasp applications compared to existing general-purpose performance metrics. This repository contains the first phase of MetaGraspNet benchmark dataset which includes detailed object detection, segmentation, layout annotations, and a script for layout-weighted performance metric (https://github.com/y2863/MetaGraspNet ).

https://raw.githubusercontent.com/y2863/MetaGraspNet/main/.github/500.png">

Citing MetaGraspNet

If you use MetaGraspNet dataset or metric in your research, please use the following BibTeX entry. BibTeX @article{chen2021metagraspnet, author = {Yuhao Chen and E. Zhixuan Zeng and Maximilian Gilles and Alexander Wong}, title = {MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis}, journal = {arXiv preprint arXiv:2112.14663}, year = {2021} }

File Structure

This dataset is arranged in the following file structure:

root |-- meta-grasp |-- scene0 |-- 0_camera_params.json |-- 0_depth.png |-- 0_rgb.png |-- 0_order.csv ... |-- scene1 ... |-- difficulty-n-coco-label.json

Each scene is an unique arrangement of objects, which we then display at various different angles. For each shot of a scene, we provide the camera parameters (x_camara_params.json), a depth image (x_depth.png), an rgb image (x_rgb.png), as well as a matrix representation of the ordering of each object (x_order.csv). The full label for the image are all available in difficulty-n-coco-label.json (where n is the difficulty level of the dataset) in the coco data format.

Understanding order.csv

The matrix describes a pairwise obstruction relationship between each object within the image. Given a "parent" object covering a "child" object: relationship_matrix[child_id, parent_id] = -1

Facebook

Twitter

Click to copy link

Link copied

Cite

Office of Tariff Affairs and Trade Agreements (2025). Harmonized Tariff Schedule of the United States (2025) [Dataset]. https://catalog.data.gov/dataset/harmonized-tariff-schedule-of-the-united-states-2024

Harmonized Tariff Schedule of the United States (2025)

Explore at:

Dataset updated

Jul 11, 2025

Dataset provided by

Office of Tariff Affairs and Trade Agreements

Description

This dataset is the current 2025 Harmonized Tariff Schedule plus all revisions for the current year. It provides the applicable tariff rates and statistical categories for all merchandise imported into the United States; it is based on the international Harmonized System, the global system of nomenclature that is used to describe most world trade in goods.

Clear search

Close search

Google apps

Main menu

Harmonized Tariff Schedule of the United States (2025)

Mapping of Goods and Services Identification Number to United Nations...

EU Customs Tariff (TARIC)

Data from: CSV file of names, times, and locations of images collected by an...

journal-of-student-research-hs-articles

PIPr: A Dataset of Public Infrastructure as Code Programs

Metadata

Dataset Creation

Searching Repositories

Limitations

Downloading Repositories

Data from: Dataset from : Browsing is a strong filter for savanna tree...

MaRV Scripts and Dataset

Requirements

1. Software Dependencies:

2. Environment Variables:

Refactoring Technique Selection

1. Environment Setup:

2. Configuring the Repositories CSV:

3. Executing the Script:

4. Script Behavior:

5. Count Refactorings:

Data Gathering

Web Tool for Manual Evaluation

Data and Code for High Throughput FTIR Analysis of Macro and Microplastics...

Data from: Data and code from: Environmental influences on drying rate of...

Student Performance

Attributes for both Maths.csv (Math course) and Portuguese.csv (Portuguese language course) datasets:

These grades are related with the course subject, Math or Portuguese:

JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...

Code underlying the publication: Wind pattern clustering of high frequent...

Data_Sheet_2_High-income ZIP codes in New York City demonstrate higher case...

Spin-Split Materials

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data and scripts associated with a manuscript investigating impacts of solid...

Assembly Shellcode Dataset

Assembly Shellcode Dataset

The Largest Collection of Linux Assembly Shellcodes

About this dataset

How to use the dataset

Understanding the Dataset

Columns in the Dataset

Utilizing the Dataset

Research Ideas

ECG in High Intensity Exercise Dataset

MetaGraspNet Difficulty 1

MetaGraspNet dataset

Citing MetaGraspNet

File Structure

Understanding order.csv

Harmonized Tariff Schedule of the United States (2025)See More Versions

Harmonized Tariff Schedule of the United States (2025)