This dataset is the current 2025 Harmonized Tariff Schedule plus all revisions for the current year. It provides the applicable tariff rates and statistical categories for all merchandise imported into the United States; it is based on the international Harmonized System, the global system of nomenclature that is used to describe most world trade in goods.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
In 2021, an international goods and services classification for procurement called the United Nations Standard Products and Services Code (UNSPSC, v21) was implemented to replace the Government of Canada’s Goods and Services Identification Numbers (GSIN) codes for categorizing procurement activities undertaken by the Government of Canada. For the transition from GSIN to UNSPSC, a subset of the entire version 21 UNSPSC list was created. The Mapping of GSIN-UNSPSC file below provides a suggested linkage between the subset of UNSPSC and higher levels of the GSIN code list. As procurement needs evolve, this file may be updated to include other UNSPSC v21 codes that are deemed to be required. In the interim, if the lowest level values within the UNSPSC structure does not relate to a specific category of goods or services, the use of the higher (related) level code from within the UNSPSC structure is appropriate. --- >Please note: This dataset is offered as a means to assist the user in finding specific UNSPSC codes, based on high-level comparisons to the legacy GSIN codes. It should not be considered a direct one-to-one mapping of these two categorization systems. For some categories, the linkages were only assessed at higher levels of the two structures (and then simply carried through indiscriminately to the related lower categories beneath those values). But given that the two systems do not necessarily group items in the same way throughout their structures, this could result in confusing connections in some cases. Please always select the UNSPSC code that best describes the applicable goods or services, even if the associated GSIN value as shown in this file is not directly relevant. --- The data is available in Comma Separated Values (CSV) file format and can be downloaded to sort, filter, and search information. The United Nations Standard Products and Services Code (UNSPSC) page on CanadaBuys offers a comprehensive guide on how to use this reference file. The Finding and using UNSPSC Codes page from CanadaBuys also contains additional information which may be of use. This dataset was originally published on June 22, 2016. The format and contents of the CSV file were revised on May 12, 2021. A copy of the original file was archived as a secondary resource to this dataset at that time (labelled ARCHIVED - Mapping of GSIN-UNSPSC in the resource list below). --- As of March 23, 2023, the data dictionary linked below includes entries for both the current and archived versions of the datafile, as well as for the datafiles of Goods and Services Identification Number (GSIN) dataset and the archived United Nations Standard Products and Services Codes (v10, released 2007) dataset.
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Multilingual database covering all measures relating to tariff, commercial and agricultural legislation. Provides a clear view of what to do when importing or exporting goods.
TARIC, the integrated Tariff of the European Union, is a multilingual database in which are integrated all measures relating to EU customs tariff, commercial and agricultural legislation. By integrating and coding these measures, the TARIC secures their uniform application by all Member States and gives all economic operators a clear view of all measures to be undertaken when importing into the EU or exporting goods from the EU. It also makes it possible to collect EU-wide statistics for the measures concerned.
The TARIC contains the following main categories of measures:
Tariff measures;
Agricultural measures;
Trade Defence instruments;
Prohibitions and restrictions to import and export;
Surveillance of movements of goods at import and export.
More information can be found under the European Binding Tariff Information for tariff information.
Imagery acquired with unmanned aerial systems (UAS) and coupled with structure from motion (SfM) photogrammetry can produce high-resolution topographic and visual reflectance datasets that rival or exceed lidar and orthoimagery. These new techniques are particularly useful for data collection of coastal systems, which requires high temporal and spatial resolution datasets. The U.S. Geological Survey worked in collaboration with members of the Marine Biological Laboratory and Woods Hole Analytics at Black Beach, in Falmouth, Massachusetts to explore scientific research demands on UAS technology for topographic and habitat mapping applications. This project explored the application of consumer-grade UAS platforms as a cost-effective alternative to lidar and aerial/satellite imagery to support coastal studies requiring high-resolution elevation or remote sensing data. A small UAS was used to capture low-altitude photographs and GPS devices were used to survey reference points. These data were processed in an SfM workflow to create an elevation point cloud, an orthomosaic image, and a digital elevation model.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Dataset Name
High school research articles crawled on Journal of student research https://www.jsr.org/
Dataset Structure
CSV file of the following format - title,URL,names,date,abstract
Source Data
https://www.jsr.org/hs/index.php/path/section/view/hs-research-articles/
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
programs.csv:
testing-files.csv:
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
The script uses the GitHub code search API and inherits its limitations:
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented here were used to produce the following paper:
Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.
The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588
For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za
Description of file(s):
File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"
The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)
File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low
File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high
File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
shtspec species name
species_code species code
genus genus
rainclass low/medium/high
seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species
Surv_slope coefficient of the model predicting survival from age of clip for this species
GR_intercept coefficient of the model predicting stem diameter from seedling age for this species
GR_slope coefficient of the model predicting stem diameter from seedling age for this species
species_code species code
max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species
duiker_sd standard deviation of bite diameter for a duiker for this species
max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species
kudu_sd standard deviation of bite diameter for a kudu for this species
mean_bite_diam_duiker_mm mean etc
duiker_mean_sd standard devaition etc
mean_bite_diameter_kudu_mm mean etc
kudu_mean_sd standard deviation etc
AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite
AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite
AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite
AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite
AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite
AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.
Our dataset is located at the path dataset/MaRV.json
The guidelines for replicating the study are provided below:
requirements.txt
.env
file based on .env.example
in the src
folder and set the variables:
CSV_PATH
: Path to the CSV file containing the list of repositories to be processed.CLONE_DIR
: Directory where repositories will be cloned.JAVA_PATH
: Path to the Java executable.REFACTORING_MINER_PATH
: Path to RefactoringMiner.pip install -r requirements.txt
CSV_PATH
should contain a column named name
with GitHub repository names (format: username/repo
)..env
file and set up the repositories CSV, then run:
python3 src/run_rm.py
CLONE_DIR
, retrieves the default branch, and runs RefactoringMiner to analyze it..json
files in CLONE_DIR
..log
files in the same directory.python3 src/count_refactorings.py
refactoring_count_by_type_and_file
, shows the number of refactorings for each technique, grouped by repository.To collect snippets before and after refactoring and their metadata, run:
python3 src/diff.py '[refactoring technique]'
Replace [refactoring technique]
with the desired technique name (e.g., Extract Method
).
The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.
Dataset Availability:
dataset
directory.To generate the SQL file for the Web tool, run:
python3 src/generate_refactorings_sql.py
web
directory.data/output/snippets
folder with the output of src/diff.py
.sql/create_database.sql
script in your database.src/generate_refactorings_sql.py
.dataset.php
to generate the MaRV dataset file.dataset
directory of the replication package.Data and source code for reproducing the analysis conducted in "High Throughput FTIR Analysis of Macro and Microplastics with Plate Readers" All materials are licensed for noncommercial purposes https://creativecommons.org/licenses/by-nc/4.0/ HIDA_Publication.R has source code for doing data cleanup and analysis on data in database.zip. databasedata.zip holds all raw and analyzed data. - ATR, Reflectance, and Transmission folders has all data used in the manuscript. In a raw (.0) and combined (export.csv) format for each of the plates analyzed (folder numbers). - Plots folder has images of each spectrum. - cell_information.csv has the raw ids and comments made at the time the particles were assessed. - classes_reference_2.csv has the transformations used to standardize open specy's terms to polymer classes. - CleanedSpectra_raw.csv has the total cleaned up database of all spectral intensities in long format. - joined_cell_metadata.csv has the metadata for each plate well analyzed. - library_metadata.csv has metadata for each spectrum in raw form for each particle id. - Lisa_Plate_6.csv has the metadata from Lisa Roscher used in this study. - Metadata_raw.csv has the conformed metadata that can be paired with the CleanedSpectra_raw.csv file. - OpenSpecy_Classification_Baseline.csv has the particle metadata combined with Open Specy's classes identified after baseline correcting and smoothing the spectra with the standard Open Specy routine. - OpenSpecy_Classification_Raw.csv has the particle metadata combined with Open Specy's identified classes if using the raw spectra. - particle_spectrum_match.csv converts particle ids to their reference in the Polymer_Material_Database_AWI_V2_Win.xlsx file. - Polymer_Material_Database_AWI_V2_Win.xlsx metadata on materials from Primpke's database. - polymer_metadata_2.csv can be used to crosswalk polymer categories to more or less specific terminology. - spread_os.csv is the reference database used in CleanedSpectra_raw.csv that has been spread to wide format. - Top Correlation Data20221201-125621.csv is a download of results from Open Specy's beta tool that provides the top ids from the reference database.
This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Columns | Description |
---|---|
school | student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) |
sex | student's sex (binary: 'F' - female or 'M' - male) |
age | student's age (numeric: from 15 to 22) |
address | student's home address type (binary: 'U' - urban or 'R' - rural) |
famsize | family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) |
Pstatus | parent's cohabitation status (binary: 'T' - living together or 'A' - apart) |
Medu | mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
Fedu | father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
Mjob | mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') |
Fjob | father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') |
reason | reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other') |
guardian | student's guardian (nominal: 'mother', 'father' or 'other') |
traveltime | home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) |
studytime | weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) |
failures | number of past class failures (numeric: n if 1<=n<3, else 4) |
schoolsup | extra educational support (binary: yes or no) |
famsup | family educational support (binary: yes or no) |
paid | extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) |
activities | extra-curricular activities (binary: yes or no) |
nursery | attended nursery school (binary: yes or no) |
higher | wants to take higher education (binary: yes or no) |
internet | Internet access at home (binary: yes or no) |
romantic | with a romantic relationship (binary: yes or no) |
famrel | quality of family relationships (numeric: from 1 - very bad to 5 - excellent) |
freetime | free time after school (numeric: from 1 - very low to 5 - very high) |
goout | going out with friends (numeric: from 1 - very low to 5 - very high) |
Dalc | workday alcohol consumption (numeric: from 1 - very low to 5 - very high) |
Walc | weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) |
health | current health status (numeric: from 1 - very bad to 5 - very good) |
absences | number of school absences (numeric: from 0 to 93) |
Grade | Description |
---|---|
G1 | first period grade (numeric: from 0 to 20) |
G2 | second period grade (numeric: from 0 to 20) |
G3 | final grade (numeric: from 0 to 20, output target) |
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.
The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.
Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly
Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Code used to generate the wind direction time series used in the publication "Wind pattern clustering of high frequent field measurements for dynamic wind farm flow control" by M. Becker, D. Allaerts and J.W. van Wingerden (in preparation for the TORQUE conference 2024)
The TenneT_BSA_* files convert the raw data from the KNMI [1] into one file with all data at 119m height. This is equivalent to the hub-height of the DTU 10MW reference turbine. Note that there is a channels switch in the data. That's why there are two functions to read in the data.
The output dataset is given in the CombinedDataAt199m.csv file.
The two hpc06_trajectories_* files are then used to segment the data into time series of requested length. This code also contains the filtering and interpolation of the data. The output are two .csv files, one with wind direction trajectories and one with wind speed trajectories.
Two examples are given by WindDirTraj.csv and WindVelTraj.csv - they have been generated with a length of 30 data points and with an offset of 30 data points (no overlapping).
The code of hpc06_cluster_dir* can then be used to cluster the given data.
The remaining files are supplementary to plot data, to calculate distances in radial data etc. including the kmeans360.m function which is the modified function of the Matlab kmeans function which also works for radial data.
[1] https://dataplatform.knmi.nl/dataset/windlidar-nz-wp-platform-1s-1
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionOur study explores how New York City (NYC) communities of various socioeconomic strata were uniquely impacted by the COVID-19 pandemic.MethodsNew York City ZIP codes were stratified into three bins by median income: high-income, middle-income, and low-income. Case, hospitalization, and death rates obtained from NYCHealth were compared for the period between March 2020 and April 2022.ResultsCOVID-19 transmission rates among high-income populations during off-peak waves were higher than transmission rates among low-income populations. Hospitalization rates among low-income populations were higher during off-peak waves despite a lower transmission rate. Death rates during both off-peak and peak waves were higher for low-income ZIP codes.DiscussionThis study presents evidence that while high-income areas had higher transmission rates during off-peak periods, low-income areas suffered greater adverse outcomes in terms of hospitalization and death rates. The importance of this study is that it focuses on the social inequalities that were amplified by the pandemic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data includes a table file named "total_data.csv" and four folders named "Basic properties", "Lattice parameters", "Electronic structures (with SOC)", and "Spin-split data". The "total_data.csv" lists the spin-splitting material information we have screened out , such as "Formula", "E-fermi (eV)", "Space group", "Spin-split type", "Max split energy (eV)", "Basic properties", "Lattice parameters", "Electronic structures (with SOC)", "Spin-split data", and "Data source". "Spin-split type" could be Rashba, Dresselhaus and Zeeman. One material may have multiple spin-split band structures. The "Max split energy (eV)" shows the maximum split energy among all the split energies of the material. The "Basic properties" column provides a csv file name, such as "icsd-100114-Be2Li2Sb2_bp.csv". According to this file name, the corresponding csv file can be found in the folder "Basic properties". This file contains information such as "Space group", "Site group", "Space group number", "Band gap (PBE) (eV)", and "Total energy/atom (eV)". The "Lattice parameters" provides a csv file name, such as "icsd-100114-Be2Li2Sb2_lp.csv". According to this file name, the corresponding csv file can be found in the folder “Lattice parameters”. This file contains the lattice constants a, b, c, α, β, and γ. The "Electronic structures (with SOC)" provides a png file name, such as "icsd-100114-Be2Li2Sb2_band_SOC.png". According to this file name, the corresponding png file can be found in the folder "Electronic structures (with SOC)". This file is the band structure (with SOC) diagram of the material in the range of -3 eV to 3 eV, and the spin-split band are marked in the figure. The "Spin-split data" provides a csv file name, such as "icsd-100114-Be2Li2Sb2_Es_SOC.csv". The details of the spin-split properties of all marked spin-split band could be found in the csv file of "Spin-split data" folder. The csv file contains "Point" (the number of spin-split points marked in png file), "Spin-split type", "K-point/K-path" (the high symmetry k-point/k-path with spin-split), "Split energy (eV)", and "Spin split parameter" (the symbol of split energy, Er, Ed and Ez).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SoLID (From Huggingface) [source]
The dataset consists of multiple files for different purposes. The validation.csv file contains a set of carefully selected assembly shellcodes that serve the purpose of validation. These shellcodes are used to ensure the accuracy and integrity of any models or algorithms trained on this dataset.
The train.csv file contains both the intent column, which describes the purpose or objective behind each specific shellcode, and its corresponding assembly code snippets in order to facilitate supervised learning during training procedures. This file proves to be immensely valuable for researchers, practitioners, and developers seeking to study or develop effective techniques for dealing with malicious code analysis or security-related tasks.
For testing purposes, the test.csv file provides yet another collection of assembly shellcodes that can be employed as test cases to assess the performance, robustness, and generalization capability of various models or methodologies developed within this domain.
Understanding the Dataset
The dataset consists of multiple files that serve different purposes:
train.csv
: This file contains the intent and corresponding assembly code snippets for training purposes. It can be used to train machine learning models or develop algorithms based on shellcode analysis.
test.csv
: The test.csv file in the dataset contains a collection of assembly shellcodes specifically designed for testing purposes. You can use these shellcodes to evaluate and validate your models or analysis techniques.
validation.csv
: The validation.csv file includes a set of assembly shellcodes that are specifically reserved for validation purposes. These shellcodes can be used separately to ensure the accuracy and reliability of your models.Columns in the Dataset
The columns available in each CSV file are as follows:
intent: The intent column describes the purpose or objective of each specific shellcode entry. It provides information regarding what action or achievement is intended by using that particular piece of code.
snippet: The snippet column contains the actual assembly code corresponding to each intent entry in its respective row. It includes all necessary instructions and data required to execute the desired action specified by that intent.
Utilizing the Dataset
To effectively utilize this dataset, follow these general steps:
Familiarize yourself with assembly language: Assembly language is essential when working with shellcodes since they consist of low-level machine instructions understood by processors directly.
Explore intents: Start by analyzing and understanding different intents present in the dataset entries thoroughly. Each intent represents a specific goal or purpose behind creating an individual piece of code.
Examine snippets: Review the assembly code snippets corresponding to each intent entry. Carefully study the instructions and data used in the shellcode, as they directly influence their intended actions.
Train your models: If you are working on machine learning or algorithm development, utilize the
train.csv
file to train your models based on the labeled intent and snippet data provided. This step will enable you to build powerful tools for analyzing or detecting shellcodes automatically.Evaluate using test datasets: Use the various assembly shellcodes present in
test.csv
to evaluate and validate your trained models or analysis techniques. This evaluation will help
- Malware analysis: The dataset can be used for studying and analyzing various shellcode techniques used in malware attacks. Researchers and security professionals can use this dataset to develop detection and prevention mechanisms against such attacks.
- Penetration testing: Security experts can use this dataset to simulate real-world attack scenarios and test the effectiveness of their defensive measures. By having access to a diverse range of shellcodes, they can identify vulnerabilities in systems and patch them before malicious actors exploit them.
- Machine learning training: This dataset can be used to train machine learning models for automatic detection or classification of shellcodes. By combining the intent column (which describes the objective of each shellcode) with the corresponding assembly code snippets, researchers can develop algorithms that automatically identify the purpose or ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented here was extracted from a larger dataset collected through a collaboration between the Embedded Systems Laboratory (ESL) of the Swiss Federal Institute of Technology in Lausanne (EPFL), Switzerland and the Institute of Sports Sciences of the University of Lausanne (ISSUL). In this dataset, we report the extracted segments used for an analysis of R peak detection algorithms during high intensity exercise.
Protocol of the experiments
The protocol of the experiment was the following.
22 subjects performing a cardio-pulmonary maximal exercise test on a cycle ergometer, using a gas mask. A single-lead electrocardiogram (ECG) was measured using the BIOPAC system.
An initial 3 min of rest were recorded.
After this baseline, the subjects started cycling at a power of 60W or 90W depending on their fitness level.
Then, the power of the cycle ergometer was increased by 30W every 3 min till exhaustion (in terms of maximum oxygen uptake or VO2max).
Finally, physiology experts assessed the so-called ventilatory thresholds and the VO2max based on the pulmonary data (volume of oxygen and CO2).
Description of the extracted dataset
The characteristics of the dataset are the following:
We report only 20 out of 22 subjects that were used for the analysis, because for two subjects the signals were too corrupted or not complete. Specifically, subjects 5 and 12 were discarded.
The ECG signal was sampled at 500 Hz and then downsampled at 250 Hz. The original ECG signal were measured at maximum 10 mV. Then, they were scaled down by a factor of 1000, hence the data is represented in uV.
For each subject, 5 segments of 20 s were extracted from the ECG recordings and chosen based on different phases of the maximal exercise test (i.e., before and after the so-called second ventilatory threshold or VT2, before and in the middle of VO2max, and during the recovery after exhaustion) to represent different intensities of physical activity.
seg1 --> [VT2-50,VT2-30]
seg2 --> [VT2+60,VT2+80]
seg3 --> [VO2max-50,VO2max-30]
seg4 --> [VO2max-10,VO2max+10]
seg5 --> [VO2max+60,VO2max+80]
The R peak locations were manually annotated in all segments and reviewed by a physician of the Lausanne University Hospital, CHUV. Only segment 5 of subject 9 could not be annotated since there was a problem with the input signal. So, the total number of segments extracted were 20 * 5 - 1 = 99.
Format of the extracted dataset
The dataset is divided in two main folders:
The folder ecg_segments/
contains the ECG signals saved in two formats, .csv
and .mat
. This folder includes both raw (ecg_raw
) and processed (ecg
) signals. The processing consists of a morphological filtering and a relative energy non filtering method to enhance the R peaks. The .csv
files contain only the signal, while the .mat
files include the signal, the time vector within the maximal stress test, the sampling frequency and the unit of the signal amplitude (uV, as we mentioned before).
The folder manual_annotations/
contains the sample indices of the annotated R peaks in .csv
format. The annotation was done on the processed signals.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This repository contains the MetaGraspNet Dataset described in the paper "MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis" (https://arxiv.org/abs/2112.14663 ).
There has been increasing interest in smart factories powered by robotics systems to tackle repetitive, laborious tasks. One particular impactful yet challenging task in robotics-powered smart factory applications is robotic grasping: using robotic arms to grasp objects autonomously in different settings. Robotic grasping requires a variety of computer vision tasks such as object detection, segmentation, grasp prediction, pick planning, etc. While significant progress has been made in leveraging of machine learning for robotic grasping, particularly with deep learning, a big challenge remains in the need for large-scale, high-quality RGBD datasets that cover a wide diversity of scenarios and permutations.
To tackle this big, diverse data problem, we are inspired by the recent rise in the concept of metaverse, which has greatly closed the gap between virtual worlds and the physical world. In particular, metaverses allow us to create digital twins of real-world manufacturing scenarios and to virtually create different scenarios from which large volumes of data can be generated for training models. We present MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types, and is split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance in a manner that is more appropriate for robotic grasp applications compared to existing general-purpose performance metrics. This repository contains the first phase of MetaGraspNet benchmark dataset which includes detailed object detection, segmentation, layout annotations, and a script for layout-weighted performance metric (https://github.com/y2863/MetaGraspNet ).
https://raw.githubusercontent.com/y2863/MetaGraspNet/main/.github/500.png">
If you use MetaGraspNet dataset or metric in your research, please use the following BibTeX entry.
BibTeX
@article{chen2021metagraspnet,
author = {Yuhao Chen and E. Zhixuan Zeng and Maximilian Gilles and
Alexander Wong},
title = {MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis},
journal = {arXiv preprint arXiv:2112.14663},
year = {2021}
}
This dataset is arranged in the following file structure:
root
|-- meta-grasp
|-- scene0
|-- 0_camera_params.json
|-- 0_depth.png
|-- 0_rgb.png
|-- 0_order.csv
...
|-- scene1
...
|-- difficulty-n-coco-label.json
Each scene is an unique arrangement of objects, which we then display at various different angles. For each shot of a scene, we provide the camera parameters (x_camara_params.json
), a depth image (x_depth.png
), an rgb image (x_rgb.png
), as well as a matrix representation of the ordering of each object (x_order.csv
). The full label for the image are all available in difficulty-n-coco-label.json
(where n is the difficulty level of the dataset) in the coco data format.
The matrix describes a pairwise obstruction relationship between each object within the image. Given a "parent" object covering a "child" object:
relationship_matrix[child_id, parent_id] = -1
This dataset is the current 2025 Harmonized Tariff Schedule plus all revisions for the current year. It provides the applicable tariff rates and statistical categories for all merchandise imported into the United States; it is based on the international Harmonized System, the global system of nomenclature that is used to describe most world trade in goods.