Spot the Difference Corpus is a corpus of task-oriented spontaneous dialogues which contains 54 interactions between pairs of subjects interacting to find differences in two very similar scenes. The corpus includes rich transcriptions, annotations, audio and video.
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our analyses are based on 148×148 time- and frequency-domain correlation matrices. A correlation matrix covers all the possible use cases of every activity metric listed in the article. With these activity metrics and different preprocessing methods, we were able to calculate 148 different activity signals from multiple datasets of a single measurement. Each cell of a correlation matrix contains the mean and standard deviation of the calculated Pearson’s correlation coefficients between two types of activity signals based on 42 different subjects’ 10-days-long motion. The small correlation matrices presented both in the article and in the appendixes are derived from these 148 × 148 correlation matrices. This published Excel workbook contains multiple sheets labelled according to their content. The mean and standard deviation values for both time- and frequency-domain correlations can be found on their own separate sheet. Moreover, we reproduced the correlation matrix with an alternatively parametrized digital filter, which doubled the number of sheets to 8. In the Excel workbook, we used the same notation for both the datasets and activity metrics as presented in this article with an extension to the PIM metric: PIMs denotes the PIM metric where we used Simpson’s 3/8 rule integration method, PIMr indicates the PIM metric where we calculated the integral by simple numerical integration (Riemann sum). (XLSX)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.
Tagging scheme:
Aligned (AL) - A concept is represented as a class in both models, either
with the same name or using synonyms or clearly linkable names;
Wrongly represented (WR) - A class in the domain expert model is
incorrectly represented in the student model, either (i) via an attribute,
method, or relationship rather than class, or
(ii) using a generic term (e.g., user'' instead of
urban
planner'');
System-oriented (SO) - A class in CM-Stud that denotes a technical
implementation aspect, e.g., access control. Classes that represent legacy
system or the system under design (portal, simulator) are legitimate;
Omitted (OM) - A class in CM-Expert that does not appear in any way in
CM-Stud;
Missing (MI) - A class in CM-Stud that does not appear in any way in
CM-Expert.
All the calculations and information provided in the following sheets
originate from that raw data.
Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.
Sheet 3 (Size-Ratio):
The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.
Sheet 4 (Overall):
Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.
For sheet 4 as well as for the following four sheets, diverging stacked bar
charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:
Sheet 5 (By-Notation):
Model correctness and model completeness is compared by notation - UC, US.
Sheet 6 (By-Case):
Model correctness and model completeness is compared by case - SIM, HOS, IFA.
Sheet 7 (By-Process):
Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.
Sheet 8 (By-Grade):
Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
You will find three datasets containing heights of the high school students.
All heights are in inches.
The data is simulated. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls.
Height Statistics (inches) | Boys | Girls |
---|---|---|
Mean | 67 | 62 |
Standard Deviation | 2.9 | 2.2 |
There are 500 measurements for each gender.
Here are the datasets:
hs_heights.csv: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.
hs_heights_pair.csv: has two columns. The first column has boy's heights. The second column contains girl's heights.
hs_heights_flag.csv: has two columns. The first column has the flag is_girl. The second column contains a girl's height if the flag is 1. Otherwise, it contains a boy's height.
To see how I generated this dataset, check this out: https://github.com/ysk125103/datascience101/tree/main/datasets/high_school_heights
Image by Gillian Callison from Pixabay
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1. Overview
This repository contains datasets used to evaluate potential improvements to flood detectability afforded by combining data collected by Landsat, Sentinel-2, and Sentinel-1 for the first time globally. The datasets were produced as part of the manuscript "A multi-sensor approach for increased measurements of floods and their societal impacts from space" which is currently in review.
2. Dataset Descriptions
There are two datasets included here.
(a) A global grid of revisit periods of Landsat, Sentinel-1, Sentinel-2 Satellites and their combination [GlobalMedianRevisits.zip]
A global dataset of revisit periods of individual satellites and their combination based on a 0.5-degree resolution grid.
Revisit periods are defined as the time between two consecutive observations of a particular point on the surface, for the satellite missions Landsat, Sentinel-2 and Sentinel-1. The grid was created using ArcMap 10.8.1 and intersections of the grid were used to create points. For each individual point, average revisit times (i.e., to account for irregular revisits, downlink issues) were calculated for each individual satellite and the composite of the three satellites. Averaged revisit times for each of these points were calculated based on the number of image tiles that intersected a particular grid point with more than a 30-minute time difference between each other acquired between 01 Jan 2016 and 31 Dec 2020.
The following equation is used to calculate revisit periods:
Average revisit time for a grid point = (Number of days between 01 Jan 2016 and 31 Dec 2020 (1827)) / (Total Number of Images captured)
Only revisits occurring between 82.5 N and 55 S of land grid points are considered; Antarctica is omitted from analysis. For satellite missions that consist of two spacecraft orbiting simultaneously (Sentinel-1 A/B, and Sentinel-2 A/B), images acquired by both satellites were used in average revisit period calculation for a given grid point. Sum totals of image tiles of all three missions are used to calculate composite point-based revisit times.
(b) Average revisit periods of satellites for flood records in the DFO database [FloodInfo.zip]
Average Revisit Times of Landsat, Sentinel-1, Sentinel-2 and their ensemble are calculated for 5130 flood records in the Dartmouth Flood Observatory's (DFO) flood record database. These were appended to the already existing attributes of the database.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Arithmetic Dataset
This dataset focuses on fundamental arithmetic operations. It includes examples covering:
Addition: Problems involving the sum of two or more numbers. Subtraction: Problems involving finding the difference between two numbers. Multiplication: Problems involving the product of two or more numbers. Roots: Problems involving finding the square root and potentially other roots of numbers. Cubes: Problems involving calculating the cube of a number. Squares: Problems… See the full description on the dataset page: https://huggingface.co/datasets/evanto/arithmetic-qa-dataset-200M.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
———————————————————————————————— ORIGINAL PAPERS ————————————————————————————————
Lioi, G., Cury, C., Perronnet, L., Mano, M., Bannier, E., Lécuyer, A., & Barillot, C. (2019). Simultaneous MRI-EEG during a motor imagery neurofeedback task: an open access brain imaging dataset for multi-modal data integration Authors. BioRxiv. https://doi.org/https://doi.org/10.1101/862375
Mano, Marsel, Anatole Lécuyer, Elise Bannier, Lorraine Perronnet, Saman Noorzadeh, and Christian Barillot. 2017. “How to Build a Hybrid Neurofeedback Platform Combining EEG and FMRI.” Frontiers in Neuroscience 11 (140). https://doi.org/10.3389/fnins.2017.00140 Perronnet, Lorraine, L Anatole, Marsel Mano, Elise Bannier, Maureen Clerc, Christian Barillot, Lorraine Perronnet, et al. 2017. “Unimodal Versus Bimodal EEG-FMRI Neurofeedback of a Motor Imagery Task.” Frontiers in Human Neuroscience 11 (193). https://doi.org/10.3389/fnhum.2017.00193.
This dataset named XP1 can be pull together with the dataset XP2, available here : https://openneuro.org/datasets/ds002338. Data acquisition methods have been described in Perronnet et al. (2017, Frontiers in Human Neuroscience). Simultaneous 64 channels EEG and fMRI during right-hand motor imagery and neurofeedback (NF) were acquired in this study (as well as in XP2). For this study, 10 subjects performed three types of NF runs (bimodal EEG-fMRI NF, unimodal EEG-NF and fMRI-NF).
————————————————————————————————
EXPERIMENTAL PARADIGM
————————————————————————————————
Subjects were instructed to perform a kinaesthetic motor imagery of the right hand and to find their own strategy to control and bring the ball to the target.
The experimental protocol consisted of 6 EEG-fMRI runs with a 20s block design alternating rest and task
motor localizer run (task-motorloc) - 8 blocks X (20s rest+20 s task)
motor imagery run without NF (task-MIpre) -5 blocks X (20s rest+20 s task)
three NF runs with different NF conditions (task-eegNF, task-fmriNF, task-eegfmriNF) occurring in random order- 10 blocks X (20s rest+20 s task)
motor imagery run without NF (task-MIpost) - 5 blocks X (20s rest+20 s task)
———————————————————————————————— EEG DATA ———————————————————————————————— EEG data was recorded using a 64-channel MR compatible solution from Brain Products (Brain Products GmbH, Gilching, Germany).
RAW EEG DATA
EEG was sampled at 5kHz with FCz as the reference electrode and AFz as the ground electrode, and a resolution of 0.5 microV. Following the BIDs arborescence, raw eeg data for each task can be found for each subject in
XP1/sub-xp1*/eeg
in Brain Vision Recorder format (File Version 1.0). Each raw EEG recording includes three files: the data file (.eeg), the header file (.vhdr) and the marker file (*.vmrk). The header file contains information about acquisition parameters and amplifier setup. For each electrode, the impedance at the beginning of the recording is also specified. For all subjects, channel 32 is the ECG channel. The 63 other channels are EEG channels.
The marker file contains the list of markers assigned to the EEG recordings and their properties (marker type, marker ID and position in data points). Three type of markers are relevant for the EEG processing:
R128 (Response): is the fMRI volume marker to correct for the gradient artifact
S 99 (Stimulus): is the protocol marker indicating the start of the Rest block
S 2 (Stimulus): is the protocol marker indicating the start of the Task (Motor Execution Motor Imagery or Neurofeedback)
Warning : in few EEG data, the first S99 marker might be missing, but can be easily “added” 20 s before the first S 2.
PREPROCESSED EEG DATA
Following the BIDs arborescence, processed eeg data for each task and subject in the pre-processed data folder :
XP1/derivatives/sub-xp1*/eeg_pp/*eeg_pp.*
and following the Brain Analyzer format. Each processed EEG recording includes three files: the data file (.dat), the header file (.vhdr) and the marker file (*.vmrk), containing information similar to those described for raw data. In the header file of preprocessed data channels location are also specified. In the marker file the location in data points of the identified heart pulse (R marker) are specified as well.
EEG data were pre-processed using BrainVision Analyzer II Software, with the following steps: Automatic gradient artifact correction using the artifact template subtraction method (Sliding average calculation with 21 intervals for sliding average and all channels enabled for correction. Downsampling with factor: 25 (200 Hz) Low Pass FIR Filter:Cut-off Frequency: 50 Hz. Ballistocardiogram (pulse) artifact correction using a semiautomatic procedure (Pulse Template searched between 40 s and 240 s in the ECG channel with the following parameters:Coherence Trigger = 0.5, Minimal Amplitude = 0.5, Maximal Amplitude = 1.3. The identified pulses were marked with R. Segmentation relative to the first block marker (S 99) for all the length of the training protocol (las S 2 + 20 s).
EEG NF SCORES
Neurofeedback scores can be found in the .mat structures in
XP1/derivatives/sub-xp1*/NF_eeg/d_sub*NFeeg_scores.mat
Structures names NF_eeg are composed of the following subfields:
NF_eeg
→ .nf_laterality (NF score computed as for real-time calculation - equation (1))
→ .filteegpow_left (Bandpower of the filtered eeg signal in C1)
→ .filteegpow_right (Bandpower of the filtered eeg signal in C2)
→ .nf (vector of NF scores -4 per s- computed as in eq 3) for comparison with XP2
→ .smoothed
→ .eegdata (64 X 200 X 400 matrix, with the pre-processed EEG signals according to the steps described above)
→ .method
Where the subfield method contains information about the laplacian filtered used and the frequency band of interest.
———————————————————————————————— BOLD fMRI DATA ———————————————————————————————— All DICOM files were converted to Nifti-1 and then in BIDs format (version 2.1.4) using the software dcm2niix (version v1.0.20190720 GVV7.4.0)
fMRI acquisitions were performed using echo- planar imaging (EPI) and covering the entire brain with the following parameters
3T Siemens Verio EPI sequence TR=2 s TE=23 ms Resolution 2x2x4 mm3 FOV = 210×210mm2 N of slices: 32 No slice gap
As specified in the relative task event files in XP1\ *events.tsv files onset, the scanner began the EPI pulse sequence two seconds prior to the start of the protocol (first rest block), so the the first two TRs should be discarded. The useful TRs for the runs are therefore
task-motorloc: 320 s (2 to 322) task-MIpre and task-MIpost: 200 s (2 to 202) task-eegNF, task-fmriNF, task-eegfmriNF: 400 s (2 to 402)
In task events files for the different tasks, each column represents:
Following the BIDs arborescence, the functional data and relative metadata are found for each subject in the following directory
XP1/sub-xp1*/func
BOLD-NF SCORES
For each subject and NF session, a matlab structure with BOLD-NF features can be found in
XP1/derivatives/sub-xp1*/NF_bold/
For each subject and NF session, a Matlab structure with BOLD-NF features can be found in
XP1/derivatives/sub-xp1*/NF_bold/
In view of BOLD-NF scores computation, fMRI data were preprocessed using SPM8 and with the following steps: slice-time correction, spatial realignment and coregistration with the anatomical scan, spatial smoothing with a 6 mm Gaussian kernel and normalization to the Montreal Neurological Institute (MNI) template. For each session, a first level general linear model analysis was then performed. The resulting activation maps (voxel-wise Family-Wise error corrected at p < 0.05) were used to define two ROIs (9x9x3 voxels) around the maximum of activation in the left and right motor cortex. The BOLD-NF scores (fMRI laterality index) were calculated as the difference between percentage signal change in the left and right motor ROIs as for the online NF calculation. A smoothed and normalized version of the NF scores over the precedent three volumes was also computed. To allow for comparison and aggregation of the two datasets XP1 and XP2 we also computed NF scores considering the left motor cortex and a background as for online NF calculation in XP2.
In the NF_bold folder, the Matlab files sub-xp1*_task-*_NFbold_scores.mat have therefore the following structure :
NF_bold → .nf_laterality (calculated as for online NF calculation) → .smoothnf_laterality → .normnf_laterality → .nf (calculated as for online NF calculation in XP2) → .roimean_left (averaged BOLD signal in the left motor ROI) → .roimean_right (averaged BOLD signal in the right motor ROI) → .bgmean (averaged BOLD signal in the background slice) → .method
Where the subfield ".method" contains information about the ROI size (.roisize), the background mask (.bgmask) and ROI masks (.roimask_left,.roimask_right ). More details about signal processing and NF calculation can be
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1]
The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities.
Data Set description:
The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine.
The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated.
The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments.
Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities.
References:
[1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted)
[2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012.
[3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265.
[4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055.
The two datasets provided here were used to provide inter-rater reliability statistics for the application of a metaphor identification procedure to texts written in English. Three experienced metaphor researchers applied the Metaphor Identification Procedure Vrije Universiteit (MIPVU) to approximately 1500 words of text from two English-language newspaper articles. The dataset Eng1 contains each researcher’s independent analysis of the lexical demarcation and metaphorical status of each word in the sample. The dataset Eng2 contains a second analysis of the same texts by the same three researchers, carried out after a comparison of our responses in Eng 1 and a troubleshooting session where we discussed our differences. The accompanying R-code was used to produce the three-way and pairwise inter-rater reliability data reported in Section 3.2 of the chapter: How do I determine what comprises a lexical unit? The headings in both datasets are identical, although the order of the columns differs in the two files. In both datasets, each line corresponds to one orthographic word from the newspaper texts. Chapter Abstract: The first part of this chapter discusses various ‘nitty-gritty’ practical aspects about the original MIPVU intended for the English language. Our focus in these first three sections is on common pitfalls for novice MIPVU users that we have encountered when teaching the procedure. First, we discuss how to determine what comprises a lexical unit (section 3.2). We then move on to how to determine a more basic meaning of a lexical unit (section 3.3), and subsequently discuss how to compare and contrast contextual and basic senses (section 3.4). We illustrate our points with actual examples taken from some of our teaching sessions, as well as with our own study into inter-rater reliability, conducted for the purposes of this new volume about MIPVU in multiple languages. Section 3.5 shifts to another topic that new MIPVU users ask about – namely, which practical tools they can use to annotate their data in an efficient way. Here we discuss some tools that we find useful, illustrating how we utilized them in our inter-rater reliability study. We close this part with section 3.6, a brief discussion about reliability testing. The second part of this chapter adopts more of a bird’s-eye view. Here we leave behind the more technical questions of how to operationalize MIPVU and its steps, and instead respond more directly to the question posed above: Do we really have to identify every metaphor in every bit of our data? We discuss possible approaches for research projects involving metaphor identification, by exploring a number of important questions that all researchers need to ask themselves (preferably before they embark on a major piece of research). Section 3.7 weighs some of the differences between quantitative and qualitative approaches in metaphor research projects, while section 3.8 talks about considerations when it comes to choosing which texts to investigate, as well as possible research areas where metaphor identification can play a useful role. We close this chapter in section 3.9 with a recap of our ‘take-away’ points – that is, a summary of the highlights from our entire discussion.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Fencing is a major anthropogenic feature affecting human relationships, ecological processes, and wildlife distributions and movements, but its impacts are difficult to quantify due to a widespread lack of spatial data. We created a fence model and compared outputs to a fence mapping approach using satellite imagery in two counties in southwest Montana, USA to advance fence data development for use in research and management. The model incorporated road, land cover, ownership, and grazing boundary spatial layers to predict fence locations. We validated the model using data collected on randomized road transects (n = 330). The model predicted 34,706.4 km of fences with a mean fence density of 0.93 km/km2 and a maximum density of 14.9 km/km2. We also digitized fences using Google Earth Pro in a random subset of our study area in survey townships (n = 50). The Google Earth approach showed greater agreement (K = 0.76) with known samples than the fence model (K = 0.56) yet was unable to map fences in forests and was significantly more time intensive. We also compared fence attributes by land ownership and land cover variables to assess factors that may influence fence specifications (e.g., wire heights) and types (e.g., number of barbed wires). Private lands were more likely to have fences with lower bottom wires and higher top wires than those on public lands with sample means at 22 cm and 26.4 cm, and 115.2 cm and 110.97, respectively. Both bottom wire means were well below recommended heights for ungulates navigating underneath fencing (≥ 46 cm), while top wire means were closer to the 107 cm maximum fence height recommendation. We found that both fence type and land ownership were correlated (χ2 = 45.52, df = 5, p = 0.001) as well as fence type and land cover type (χ2 = 140.73, df = 15, p = 0.001). We provide tools for estimating fence locations, and our novel fence type assessment demonstrates an opportunity for updated policy to encourage the adoption of “wildlife-friendlier” fencing standards to facilitate wildlife movement in the western U.S. while supporting rural livelihoods. Methods For the fence model and fence density layers, the data was adapted from publicly available spatial layers informed by local expert opinion in Beaverhead and Madison Counties, MT. Data used included Montana Department of Transportation road layers, land ownership data from Montana State Library cadastral database, land cover data from the 2019 Montana Department of Revenue Final Land Unit (FLU), and railroad data from the Montana State Library. The data was processed in ArcMap 10.6.1 to form a hierarchical predictive fence location and density GIS model. For the Google Earth mapped fences, data was collected by examining satellite imagery and tracing visible fence lines in Google Earth Pro version 7.3.3 (Google 2020) within the bounds of 50 random survey township polygons in Beaverhead and Madison Counties.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Synthetic mass accumulation rates have been calculated for ODP Site 707 using depth-density and depth-porosity functions to estimate values for these parameters with increasing sediment thickness, at 1 Ma time intervals determined on the basis of published microfossil datums. These datums were the basis of the age model used by Peterson and Backman (1990, doi:10.2973/odp.proc.sr.115.163.1990) to calculate actual mass accumulation rate data using density and porosity measurements. A comparison is made between the synthetic and actual mass accumulation rate values for the time interval 37 Ma to the Recent for 1 Myr time intervals. There is a correlation coefficient of 0.993 between the two data sets, with an absolute difference generally less than 0.1 g/cm**2/kyr. We have used the method to extend the mass accumulation rate analysis back to the Late Paleocene (60 Ma) for Site 707. Providing age datums (e.g. fossil or magnetic anomaly data) are available the generation of synthetic mass accumulation rates can be calculated for any sediment sequence.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Mass spectrometry imaging (MSI) is a technique that provides comprehensive molecular information with high spatial resolution from tissue. Today, there is a strong push toward sharing data sets through public repositories in many research fields where MSI is commonly applied; yet, there is no standardized protocol for analyzing these data sets in a reproducible manner. Shifts in the mass-to-charge ratio (m/z) of molecular peaks present a major obstacle that can make it impossible to distinguish one compound from another. Here, we present a label-free m/z alignment approach that is compatible with multiple instrument types and makes no assumptions on the sample’s molecular composition. Our approach, MSIWarp (https://github.com/horvatovichlab/MSIWarp), finds an m/z recalibration function by maximizing a similarity score that considers both the intensity and m/z position of peaks matched between two spectra. MSIWarp requires only centroid spectra to find the recalibration function and is thereby readily applicable to almost any MSI data set. To deal with particularly misaligned or peak-sparse spectra, we provide an option to detect and exclude spurious peak matches with a tailored random sample consensus (RANSAC) procedure. We evaluate our approach with four publicly available data sets from both time-of-flight (TOF) and Orbitrap instruments and demonstrate up to 88% improvement in m/z alignment.
The U.S. Department of Energy's National Renewable Energy Laboratory collaborates with the solar industry to establish high quality solar and meteorological measurements. This Solar Resource and Meteorological Assessment Project (SOLRMAP) provides high quality measurements to support deployment of power projects in the United States. The no-funds-exchanged collaboration brings NREL solar resource assessment expertise together with industry needs for measurements. The end result is high quality data sets to support the financing, design, and monitoring of large scale solar power projects for industry in addition to research-quality data for NREL model development. NREL provides consultation for instrumentation and station deployment, along with instrument calibrations, data acquisition, quality assessment, data distribution, and summary reports. Industry participants provide equipment, infrastructure, and station maintenance.
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
https://eidc.ceh.ac.uk/licences/historic-SPI/plainhttps://eidc.ceh.ac.uk/licences/historic-SPI/plain
5km gridded Standardised Precipitation Index (SPI) data for Great Britain, which is a drought index based on the probability of precipitation for a given accumulation period as defined by McKee et al [1]. There are seven accumulation periods: 1, 3, 6, 9, 12, 18, 24 months and for each period SPI is calculated for each of the twelve calendar months. Note that values in monthly (and for longer accumulation periods also annual) time series of the data therefore are likely to be autocorrelated. The standard period which was used to fit the gamma distribution is 1961-2010. The dataset covers the period from 1862 to 2015. This version supersedes previous versions (version 2 and 3) of the same dataset due to minor errors in the data files. NOTE: the difference between this dataset with the previously published dataset 'Gridded Standardized Precipitation Index (SPI) using gamma distribution with standard period 1961-2010 for Great Britain [SPIgamma61-10]' (Tanguy et al., 2015; https://doi.org/10.5285/94c9eaa3-a178-4de4-8905-dbfab03b69a0) , apart from the temporal and spatial extent, is the underlying rainfall data from which SPI was calculated. In the previously published dataset, CEH-GEAR (Tanguy et al., 2014; https://doi.org/10.5285/5dc179dc-f692-49ba-9326-a6893a503f6e) was used, whereas in this new version, Met Office 5km rainfall grids were used (see supporting information for more details). The methodology to calculate SPI is the same in the two datasets. [1] McKee, T. B., Doesken, N. J., Kleist, J. (1993). The Relationship of Drought Frequency and Duration to Time Scales. Eighth Conference on Applied Climatology, 17-22 January 1993, Anaheim, California.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zip file contains data files for 3 activities described in the accompanying PPT slides 1. an excel spreadsheet for analysing gain scores in a 2 group, 2 times data array. this activity requires access to –https://campbellcollaboration.org/research-resources/effect-size-calculator.html to calculate effect size.2. an AMOS path model and SPSS data set for an autoregressive, bivariate path model with cross-lagging. This activity is related to the following article: Brown, G. T. L., & Marshall, J. C. (2012). The impact of training students how to write introductions for academic essays: An exploratory, longitudinal study. Assessment & Evaluation in Higher Education, 37(6), 653-670. doi:10.1080/02602938.2011.5632773. an AMOS latent curve model and SPSS data set for a 3-time latent factor model with an interaction mixed model that uses GPA as a predictor of the LCM start and slope or change factors. This activity makes use of data reported previously and a published data analysis case: Peterson, E. R., Brown, G. T. L., & Jun, M. C. (2015). Achievement emotions in higher education: A diary study exploring emotions across an assessment event. Contemporary Educational Psychology, 42, 82-96. doi:10.1016/j.cedpsych.2015.05.002andBrown, G. T. L., & Peterson, E. R. (2018). Evaluating repeated diary study responses: Latent curve modeling. In SAGE Research Methods Cases Part 2. Retrieved from http://methods.sagepub.com/case/evaluating-repeated-diary-study-responses-latent-curve-modeling doi:10.4135/9781526431592
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.
The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.
Methodology
To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).
These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.
To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.
Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.
Description of the data in this data set
Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies
The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information
Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}
Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?
Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))
HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)
Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx
Licenses or restrictions CC-BY
For more info, see README.txt
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Note: Please use the following view to be able to see the entire Dataset Description: https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Portal-Manifest-Metadata/x2z6-swxe
Dataset Description Outline (5 sections)
• INTRODUCTION
• WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF?
• WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA?
• HOW DOES THE PORTAL MANIFEST METADATA DATASET RELATE TO THE OTHER TWO MANIFEST DATASETS PUBLISHED IN CT OPEN DATA?
• IMPORTANT NOTES
INTRODUCTION • All of DEEP’s paper hazardous waste manifest records were recently scanned and “indexed”. • Indexing consisted of 6 basic pieces of information or “metadata” taken from each manifest about the Generator and stored with the scanned image. The metadata enables searches by: Site Town, Site Address, Generator Name, Generator ID Number, Manifest ID Number and Date of Shipment. • All of the metadata and scanned images are available electronically via DEEP’s Document Online Search Portal at: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/ • Therefore, it is no longer necessary to visit the DEEP Records Center in Hartford for manifest records or information. • This CT Data dataset “Hazardous Waste Portal Manifest Metadata” (or “Portal Manifest Metadata”) was copied from the DEEP Document Online Search Portal, and includes only the metadata – no images.
WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF? The Portal Manifest Metadata is a good search tool to use along with the Portal. Searching the Portal Manifest Metadata can provide the following advantages over searching the Portal: • faster searches, especially for “large searches” - those with a large number of search returns unlimited number of search returns (Portal is limited to 500); • larger display of search returns; • search returns can be sorted and filtered online in CT Data; and • search returns and the entire dataset can be downloaded from CT Data and used offline (e.g. download to Excel format) • metadata from searches can be copied from CT Data and pasted into the Portal search fields to quickly find single scanned images. The main advantages of the Portal are: • it provides access to scanned images of manifest documents (CT Data does not); and • images can be downloaded one or multiple at a time.
WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA? All hazardous waste manifest records received and maintained by the DEEP Manifest Program; including: • manifests originating from a Connecticut Generator or sent to a Connecticut Destination Facility including manifests accompanying an exported shipment • manifests with RCRA hazardous waste listed on them (such manifests may also have non-RCRA hazardous waste listed) • manifests from a Generator with a Connecticut Generator ID number (permanent or temporary number) • manifests with sufficient quantities of RCRA hazardous waste listed for DEEP to consider the Generator to be a Small or Large Quantity Generator • manifests with PCBs listed on them from 2016 to 6-29-2018. • Note: manifests sent to a CT Destination Facility were indexed by the Connecticut or Out of State Generator. Searches by CT Designated Facility are not possible unless such facility is the Generator for the purposes of manifesting.
All other manifests were considered “non-hazardous” manifests and not scanned. They were discarded after 2 years in accord with DEEP records retention schedule. Non-hazardous manifests include: • Manifests with only non-RCRA hazardous waste listed • Manifests from generators that did not have a permanent or temporary Generator ID number • Sometimes non-hazardous manifests were considered “Hazardous Manifests” and kept on file if DEEP had reason to believe the generator should have had a permanent or temporary Generator ID number. These manifests were scanned and included in the Portal.
Dates included: manifests with shipment dates from 1980 to present • States were the primary keepers of manifest records until June 29, 2018. Any manifest regarding a Connecticut Generator or Destination Facility should have been sent to DEEP, and should be present in the Portal and CT Data. • June 30, 2018 was the start of the EPA e-Manifest program. Most manifests with a shipment date on and after this date are sent to, and maintained by the EPA. • For information from EPA regarding these newer manifests: • Overview: https://rcrapublic.epa.gov/rcrainfoweb/action/modules/em/emoverview • To search by site, use EPA’s Sites List: https://rcrapublic.epa.gov/rcrainfoweb/action/modules/hd/handlerindex (Tip: Change the Location field from “National” to “Connecticut”) • Manifests still sent to DEEP on or after 6-30-2018 include: • manifests from exported shipments; and • manifest copies submitted pursuant to discrepancy reports and unmanifested shipments.
HOW DOES THE PORTAL MANIFEST METADATA RELATE TO THE OTHER TWO MANIFEST DATASETS PUBLISHED IN CT DATA?
• DEEP has posted in CT Data two other datasets about the same hazardous waste documents which are the subject of the Portal and the Portal Manifest Metadata Copy.
• There are likely some differences in the metadata between the Portal Manifest Metadata and the two others. DEEP recommends using all data sources for a complete search.
• These two datasets were the best search tool DEEP had available to the public prior to the Portal and the Metadata Copy:
• “Hazardous Waste Manifest Data (CT) 1984 – 2008”
https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Manifest-Data-CT-1984-2008/h6d8-qiar; and
• “Hazardous Waste Manifest Data (CT) 1984 – 2008: Generator Summary View”
https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Manifest-Data-CT-1984-2008-Generat/72mi-3f82.
• The only difference between these two datasets is:
• the first dataset includes all of the metadata transcribed from the manifests.
• the second “Generator Summary View” dataset is a smaller subset of the first, requested for convenience by the public.
Both of these datasets:
• Are copies of metadata from a manifest database maintained by DEEP. No scanned images are available as a companion to these datasets.
• The date range of the manifests for these datasets is 1984 to approximately 2008.
IMPORTANT NOTES (4): NOTE 1: Some manifest images are effectively unavailable via the Portal and the Portal Metadata due to incomplete or incorrect metadata. Such errors may be the result of unintentional data entry error, errors on the manifests or illegible manifests. • Incomplete or incorrect metadata may prevent a manifest from being found by a search. DEEP is currently working to complete the metadata as best it can. • Please report errors to the DEEP Manifest Program at deep.manifests@ct.gov. • DEEP will publish updates regarding this work here and through the DEEP Hazardous Waste Advisory Committee listserv. To sign up for this listserv, visit this webpage: https://portal.ct.gov/DEEP/Waste-Management-and-Disposal/Hazardous-Waste-Advisory-Committee/HWAC-Home. NOTE 2: This dataset does not replace the potential need for a full review of other files publicly available either on-line and/or at CT DEEP’s Records Center. For a complete review of agency records for this or other agency programs, you can perform your own search in our DEEP public file room located at 79 Elm Street, Hartford CT or at our DEEP Online Search Portal at: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/Home. NOTE 3: Other DEEP programs or state and federal agencies may maintain manifest records (e.g., DEEP Emergency Response, US Environmental Protection Agency, etc.) These other manifests were not scanned along with those from the Manifest Program files. However, most likely these other manifests are duplicate copies of manifests available via the Portal. NOTE 4: search tips for using the Portal and CT Data: • If your search will yield a small number of search returns, try using the Portal for your search. “Small” is meant to mean fewer than the 500 maximum search returns allowed using the Portal. • Start your search as broadly as possible – try entering just the town and the street name, or a portion of the street name that is likely to be spelled correctly • For searches yielding a large number of search returns, try using first the Portal Manifest Metadata in CT Data. • Try downloading the metadata and sorting, filtering, etc. the data to look for related spellings, etc. • Once you narrow down you research, copy the manifest number of a manifest you are interested in, and paste it into the Agency ID field of the Portal search page. • If you are using information from older information sources for consistency, you may want to search the two datasets copied from the older DEEP Manifest Database.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The data provided are two datasets obtained in one experimental campaign conducted as part of workpackage B of the
“Multi-fidelity probabilistic design framework for complex marine structures” project. Both datasets are from experiments
where a model was placed in a wave-current tank in the faculty of ME of the Delft University of Technology. The model
is stationairy and the waves and current flow past the model. The modelled forward velocity of the model was 0.25 m/s and
the waves were irregular and the model was free to heave and pitch. As this test facillity has been used before there is a previous
publication with an explanation. Please cite it as follows when using the data (bibtex at the bottom of this description):
Boon, A. D. and Wellens, P. R. (2023) ‘The effect of surge on extreme wave impacts and an insight into clustering’,
Journal of Marine Structures.
The data itself can also be referenced. Please cite it as follows (bibtex at the bottom of this description):
Boon, A.D. and Wellens, P.R. (2023) 'Large experimental data set for extreme wave impacts on S175 ship with and without
surge and with various bow drafts and freeboards (Repository)'. 4TU.ResearchData
The first dataset is focussed on finding the influence of surge on green water and slamming. The model was thus free to surge
but for half of the tests surge was restricted to find the difference. These experiments were conducted with the S175 model.
A publication has been made based on this dataset: "A. D. Boon and P. R. Wellens, ‘The effect of surge on extreme wave impacts
and an insight into clustering’, Journal of Ship Research, 2024.
The second dataset is focussed on finding the influence of freeboard hight and draft on green water and slamming. To find the
influence the back half of the S175 was kept but three different axe-like bows were attached to the back. For each of the three
different bow shapes attachements were placed at the bow to test three different freeboard heights per bow. There is also a
publication based on this dataset: "A. D. Boon and P. R. Wellens, ‘How draft and freeboard affect green water: a probabilistic
analysis of a large experimental dataset’, Proceedings of the ASME 2024 43th International Conference on Ocean, Offshore and
Arctic Engineering, June 9-14, 2024"
For further information you can reach us through our contact information:
Anna Boon (a.d.boon@tudelft.nl/adymfnaboon@gmail.com)
Peter Wellens (p.r.wellens@tudelft.nl)
Spot the Difference Corpus is a corpus of task-oriented spontaneous dialogues which contains 54 interactions between pairs of subjects interacting to find differences in two very similar scenes. The corpus includes rich transcriptions, annotations, audio and video.