Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.
Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.
Overview statistics:
Number of subjects: 30
Number of transmitter locations: 6
Number of receiver locations: 6
Number of measurement frequencies: 19
Input voltage: 1 V
Load resistance: 50 ohm and 1 megaohm
Measurement group statistics:
Height: 174.10 (7.15)
Weight: 72.85 (16.26)
BMI: 23.94 (4.70)
Body fat %: 21.53 (7.55)
Age group: 29.00 (11.25)
Male/female ratio: 50%
Included files:
experiment_protocol_description.docx - protocol used in the experiments
electrode_placement_schematic.png - schematic of placement locations
electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject
RawData - the full measurement results and experiment info sheets
all_measurements.csv - the most important results extracted to .csv
all_measurements_filtered.csv - same, but after z-score filtering
all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row
all_measurements_by_freq_filtered.csv - same, but after z-score filtering
summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets
process_json_files.py - script that creates .csv from the raw data
filter_results.py - outlier removal based on z-score
plot_sample_curves.py - visualization of a randomly selected measurement result subset
plot_measurement_group.py - visualization of the measurement group
CSV file columns:
subject_id - participant's random unique ID
experiment_id - measurement session's number for the participant
height - participant's height, cm
weight - participant's weight, kg
BMI - body mass index, computed from the valued above
body_fat_% - body fat composition, as measured by bioimpedance scales
age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.
male - 1 if male, 0 if female
tx_point - transmitter point number
rx_point - receiver point number
distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!
tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.
rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.
total_fat_level - sum of rx and tx fat levels
bias - constant term to simplify data analytics, always equal to 1.0
CSV file columns, frequency-specific:
tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py
script from the voltage drop
rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance
rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance
Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.
References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.
Contact information: info@edi.lv
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
During the development of salient object detection algorithms, benchmark datasets have played a critical role. However, existing benchmark datasets commonly suffer from dataset bias, making it challenging to fully reflect the performance of different algorithms or capture the technical characteristics of certain typical applications. To address these limitations, we have undertaken two key initiatives: (1) We designed a new benchmark dataset, MTMS300 (Multiple Targets and Multiple Scales), tailored to reconnaissance and surveillance applications. This dataset contains 300 color visible-light images from land, sea, and air scenarios, featuring: Reduced center bias, Balanced distribution of target-to-image area ratios, Diverse image sizes, Multiple targets per image.(2) We curated a new benchmark dataset, DSC (Difficult Scenes in Common), by identifying images from publicly available benchmarks that pose significant challenges (with low metric scores) for most non-deep-learning algorithms. The proposed datasets exhibit distinct characteristics, enabling more comprehensive evaluation of visual saliency algorithms. This advancement will drive the development of visual saliency algorithms toward task-specific applications.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides processed and normalized/standardized indices for the management activity 'Mergers and Acquisitions' (M&A). Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding M&A dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "mergers and acquisitions" + "mergers and acquisitions corporate". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Mergers and Acquisitions + Mergers & Acquisitions. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching M&A-related keywords [("mergers and acquisitions" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (M&A Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Mergers and Acquisitions (2006, 2008, 2010, 2012, 2014, 2017). Note: Not reported before 2006 or after 2017. Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Mergers and Acquisitions (2006-2017). Note: Not reported before 2006 or after 2017. Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding M&A dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Knowledge Graph Construction Workshop 2024: challenge
Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.
Task description
The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.
We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.
Track 1: Conformance
The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:
RML-Core
RML-IO
RML-CC
RML-FNML
RML-Star
These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.
Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.
Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.
Track 2: Performance
Part 1: Knowledge Graph Construction Parameters
These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.
Data
Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).
Number of input files: scaling the number of datasets (1, 5, 10, 15).
Mappings
Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)
Part 2: GTFS-Madrid-Bench
The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.
Scaling
GTFS-1 SQL
GTFS-10 SQL
GTFS-100 SQL
GTFS-1000 SQL
Heterogeneity
GTFS-100 XML + JSON
GTFS-100 CSV + XML
GTFS-100 CSV + JSON
GTFS-100 SQL + XML + JSON + CSV
Example pipeline
The ground truth dataset and baseline results are generated in different stepsfor each parameter:
The provided CSV files and SQL schema are loaded into a MySQL relational database.
Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format
The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.
Each parameter has its own directory in the ground truth dataset with thefollowing files:
Input dataset as CSV.
Mapping file as RML.
Execution plan for the pipeline in metadata.json.
Datasets
Knowledge Graph Construction Parameters
The dataset consists of:
Input dataset as CSV for each parameter.
Mapping file as RML for each parameter.
Baseline results for each parameter with the example pipeline.
Ground truth dataset for each parameter generated with the example pipeline.
Format
All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.
GTFS-Madrid-Bench
The dataset consists of:
Input dataset as CSV with SQL schema for the scaling and a combination of XML,
CSV, and JSON is provided for the heterogeneity.
Mapping file as RML for both scaling and heterogeneity.
SPARQL queries to retrieve the results.
Baseline results with the example pipeline.
Ground truth dataset generated with the example pipeline.
Format
CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.
Evaluation criteria
Submissions must evaluate the following metrics:
Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.
Expected output
Duplicate values
Scale Number of Triples
0 percent 2000000 triples
25 percent 1500020 triples
50 percent 1000020 triples
75 percent 500020 triples
100 percent 20 triples
Empty values
Scale Number of Triples
0 percent 2000000 triples
25 percent 1500000 triples
50 percent 1000000 triples
75 percent 500000 triples
100 percent 0 triples
Mappings
Scale Number of Triples
1TM + 15POM 1500000 triples
3TM + 5POM 1500000 triples
5TM + 3POM 1500000 triples
15TM + 1POM 1500000 triples
Properties
Scale Number of Triples
1M rows 1 column 1000000 triples
1M rows 10 columns 10000000 triples
1M rows 20 columns 20000000 triples
1M rows 30 columns 30000000 triples
Records
Scale Number of Triples
10K rows 20 columns 200000 triples
100K rows 20 columns 2000000 triples
1M rows 20 columns 20000000 triples
10M rows 20 columns 200000000 triples
Joins
1-1 joins
Scale Number of Triples
0 percent 0 triples
25 percent 125000 triples
50 percent 250000 triples
75 percent 375000 triples
100 percent 500000 triples
1-N joins
Scale Number of Triples
1-10 0 percent 0 triples
1-10 25 percent 125000 triples
1-10 50 percent 250000 triples
1-10 75 percent 375000 triples
1-10 100 percent 500000 triples
1-5 50 percent 250000 triples
1-10 50 percent 250000 triples
1-15 50 percent 250005 triples
1-20 50 percent 250000 triples
1-N joins
Scale Number of Triples
10-1 0 percent 0 triples
10-1 25 percent 125000 triples
10-1 50 percent 250000 triples
10-1 75 percent 375000 triples
10-1 100 percent 500000 triples
5-1 50 percent 250000 triples
10-1 50 percent 250000 triples
15-1 50 percent 250005 triples
20-1 50 percent 250000 triples
N-M joins
Scale Number of Triples
5-5 50 percent 1374085 triples
10-5 50 percent 1375185 triples
5-10 50 percent 1375290 triples
5-5 25 percent 718785 triples
5-5 50 percent 1374085 triples
5-5 75 percent 1968100 triples
5-5 100 percent 2500000 triples
5-10 25 percent 719310 triples
5-10 50 percent 1375290 triples
5-10 75 percent 1967660 triples
5-10 100 percent 2500000 triples
10-5 25 percent 719370 triples
10-5 50 percent 1375185 triples
10-5 75 percent 1968235 triples
10-5 100 percent 2500000 triples
GTFS Madrid Bench
Generated Knowledge Graph
Scale Number of Triples
1 395953 triples
10 3959530 triples
100 39595300 triples
1000 395953000 triples
Queries
Query Scale 1 Scale 10 Scale 100 Scale 1000
Q1 58540 results 585400 results No results available No results available
Q2
636 results
11998 results
125565 results
1261368 results
Q3 421 results 4207 results 42067 results 420667 results
Q4 13 results 130 results 1300 results 13000 results
Q5 35 results 350 results 3500 results 35000 results
Q6 1 result 1 result 1 result 1 result
Q7 68 results 67 results 67 results 53 results
Q8 35460 results 354600 results No results available No results available
Q9 130 results 1300
The study of spatial patterns in biotic compositional variability in deep time is key to understanding the macroecological response of species assemblages to global change. Globally warm climatic phases are marked by the expansion of megathermal climates into currently extra-tropical areas. However, there is currently little information on whether vegetation in these ‘paratropical’ regions resembled spatially modern tropical or extra-tropical biomes. In this paper we explore spatial heterogeneity in extra-tropical megathermal vegetation, using sporomorph (pollen and spore) data from the Late Paleocene Calvert Bluff and Tuscahoma formations of the formerly paratropical US Gulf Coast (Texas, Mississippi and Alabama). The dataset comprises 139 sporomorph taxa recorded from 56 samples. Additive diversity partitioning, non-metric multidimensional scaling, and cluster analysis show compositional heterogeneity both spatially and lithologically within the US Gulf Coastal Plain (GCP) microflora....
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides processed and normalized/standardized indices for the management tool group focused on 'Activity-Based Costing' (ABC) and 'Activity-Based Management' (ABM). Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding ABC/ABM dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "activity based costing" + "activity based management" + "activity based costing management". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Activity Based Management + Activity Based Costing. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching ABC/ABM-related keywords [("activity based costing" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (ABC/ABM Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Activity-Based Costing (1993); Activity-Based Management (1999, 2000, 2002, 2004). Note: Not reported after 2004. Processing: Semantic Grouping: Data points for "Activity-Based Costing" and "Activity-Based Management" were treated as a single conceptual series. Normalization: Combined series normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Activity-Based Costing (1993); Activity-Based Management (1999-2004). Note: Not reported after 2004. Processing: Semantic Grouping: Data points treated as a single conceptual series. Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding ABC/ABM dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Contains data and code for the manuscript 'Mean landscape-scale incidence of species in discrete habitats is patch size dependent'. Raw data consist of 202 published datasets collated from primary and secondary (e.g., government technical reports) sources. These sources summarise metacommunity structure for different taxonomic groups (birds, invertebrates, non-avian vertebrates or plants) in different types of discrete metacommunities including 'true' islands (i.e., inland, continental or oceanic archipelagos), habitat islands (e.g., ponds, wetlands, sky islands) and fragments (e.g., forest/woodland or grass/shrubland habitat remnants). The aim of the study was to test whether the size of a habitat patch influences the mean incidences of species within it, relative to the incidence of all species across the landscape. In other words, whether high-incidence (widespread) or low-incidence (narrow-range) species are found more often than expected in smaller or larger patches. To achieve this, a new standardized effect size metric was developed that quantifies the mean observed incidence of all species present in every patch (the geometric mean of the number of patches in which all species were observed) and compares this with an expectation based on re-sampling the incidences of all species in all patches. Meta-regression of the 202 datasets was used to test the relationship between this metric, the 'mean species landscape-scale incidences per patch' (MSLIP), and the size of habitat patches, and for differences in response among metacommunity types and taxonomic groups. Methods Details regarding keyword and other search strategies used to collate the raw database from published sources were presented in Deane, D. C. & He, F. (2018) Loss of only the smallest patches will reduce species diversity in most discrete habitat networks. Glob Chang Biol, 24, 5802-5814 and in Deane, D.C. (2022) Species accumulation in small-large vs large-small order: more species but not all species? Oecologia, 200, 273-284. Minimum data requirements were presence absence records for all species in all patches and area of each habitat patch. The database consists of 202 published datasets. The first column in each dataset is the area of the patch in question (in hectares), other columns record presence and absence of each species in each patch. In the study, a metric was calculated for every patch that quantifies how the incidence of species in each patch compares with an expectation derived from the occupancy of all species in all patches (called mean species landscape-scale incidences per patch or MSLIP). This value was regressed on patch size and other covariates to determine whether the representation of widespread (or narrowly distributed) species changes with patch size. In summary, the work flow proceeded in three steps. 1. Pre-processing. This stage consisted of calculating a standardized effect size (SES) for the MSLIP metric for every patch and extracting important covariates (taxon, patch type, total number of patches, total number of species, patch-level deviations from fitted island species area relationships, data quality) to be used in model building. 2. Model building. MSLIP SES was then modelled against patch area and other covariates using a multilevel Bayesian (meta-)regression model using Stan and brms in the statistical programming langauge R (Version 4.3.0). 3. Model analysis. The final model was analysed by running different scenarios and the patterns interpreted in light of the hypotheses under test and creating figures to illustrate these.
Late Payment Index (LPI) helps organisations evaluate prospects and current business partners and their potential risk of future non-payment. Helping businesses to make strategic decisions ahead of time.
The Late Payment Index is a comprehensive metric developed by Coface to evaluate payment behaviour of companies globally. It reflects the age, frequency, severity of claims, and other vital factors to assess payment difficulties accurately.
Dataset Structure and Components: Sample Size: 15 company records with unique identifiers Assessment Date: All entries dated Product Type: All categorized as "LPI" (Late Payment Index) Evaluation Format: Numeric rating system (0-4) with corresponding descriptive explanations
Rating Classification System: The dataset employs a standardized 0-4 scale to categorize payment behavior:
0: Not available - Assessment data unavailable or insufficient 1: In progress - Assessment currently being conducted 2: Considerable negative experience - Significant payment issues detected 3: Some negative experience - Minor or occasional payment issues detected 4: No negative experience - Clean payment history with no detected issues
Application Context: This sample illustrates how payment behavior can be systematically tracked and categorized to support credit decision-making and business relationship management. The LPI provides an objective metric for evaluating the payment reliability of potential business partners or customers.
This structured assessment system allows organizations to: Identify potential payment risks before entering business relationships Support credit limit decisions with objective payment history data Monitor changing payment behaviors across their business portfolio Create consistent payment evaluation standards across departments
Note: This is sample data intended to demonstrate the structure and capabilities of a payment index system.
Learn More For a complete demonstration of our Late Payment Index capabilities or to discuss how our system can be integrated with your existing processes, please visit https://business-information.coface.com/what-is-urba360 to request additional information.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Taphonomic factors may significantly alter faunal assemblages at varying scales. An exceptional record of late Holocene (< 4000 years old) mammal fanuas establishes a firm baseline to investigate the effects of scale on taphonomy. Our sample contains 73 sites within four contiguous states (North Dakota, South Dakota, Iowa, and Illinois, USA) that transect a strong modern and late Holocene environmental gradient, the prairie-forest ecotone. We performed Detrended Correspondence Analysis (DCA) and Non-metric Multidimensional Scaling (NMDS) analyses. Both DCA and NMDS analyses of the datasets produced virtually the same results, and both failed to reveal the known ecological gradient within each state. However, both DCA and NMDS analyses of the unfiltered multistate dataset across the entire gradient clearly reflect an environmental, rather than taphonomic, signal. DCA tended to provide better separation of some clusters than did NMDS in most of the analyses. We conclude that a large mammal dataset collected across a strong environmental gradient will document species turnover without the removal of taphonomic factors. In other words, taphonomy exhibits varying scale-dependent effects.
Reference level assumptions used to calculate the Atlantic meridional overturning circulation transports at the RAPID and MOVE observing arrays are revisited in an eddying ocean model. Observational transport calculation methods are complemented by several alternative approaches. At RAPID, the model transports from the observational method and the model truth (based on the actual model velocities) agree well in their mean and variability. There are substantial differences among the transport estimates obtained with various methods at the MOVE site. These differences result from relatively large and time-varying reference velocities at depth in the model, not supporting a level-of-no-motion. The methods that account for these reference velocities properly at MOVE produce transports that are in good agreement with the model truth. In contrast with the observational estimates, the model transport trends at MOVE and RAPID largely agree with each other on pentadal to multi-decadal time scales. The datasets listed here are output fields from the Meridional ovErTurning ciRculation diagnostIC (METRIC) package which enables consistent calculations of AMOC estimates at the MOVE and RAPID sections from observations and models. The METRIC package is available on GitHub at https://github.com/NCAR/metric. Citation for the code: Castruccio F. S., 2021: NCAR/metric: metric v0.1. doi/10.5281/zenodo.4708277 Citation for the method: Danabasoglu et al. (2021). Revisiting AMOC Transport Estimates from Observations and Models. Geophysical Research Letters.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Policies requiring biodiversity no net loss or net gain as an outcome of environmental planning have become more prominent worldwide, catalysing interest in biodiversity offsetting as a mechanism to compensate for development impacts on nature. Offsets rely on credible and evidence-based methods to quantify biodiversity losses and gains. Following the introduction of the United Kingdom’s Environment Act in November 2021, all new developments requiring planning permission in England are expected to demonstrate a 10% biodiversity net gain from 2024, calculated using the statutory biodiversity metric framework (Defra, 2023). The metric is used to calculate both baseline and proposed post-development biodiversity units, and is set to play an increasingly prominent role in nature conservation nationwide. The metric has so far received limited scientific scrutiny. This dataset comprises a database of statutory biodiversity metric unit values for terrestrial habitat samples across England. For each habitat sample, we present biodiversity units alongside five long-established single-attribute proxies for biodiversity (species richness, individual abundance, number of threatened species, mean species range or population, mean species range or population change). Data were compiled for species from three taxa (vascular plants, butterflies, birds), from sites across England. The dataset includes 24 sites within grassland, wetland, woodland and forest, sparsely vegetated land, cropland, heathland and shrub, i.e. all terrestrial broad habitats except urban and individual trees. Species data were reused from long-term ecological change monitoring datasets (mostly in the public domain), whilst biodiversity units were calculated following field visits. Fieldwork was carried out in April-October 2022 to calculate biodiversity units for the samples. Sites were initially assessed using metric version 3.1, which was current at the time of survey, and were subsequently updated to the statutory metric for analysis using field notes and species data. Species data were derived from 24 long-term ecological change monitoring sites across the Environmental Change Network (ECN), Long Term Monitoring Network (LTMN) and Ecological Continuity Trust (ECT), collected between 2010 and 2020. Methods Study sites We studied 24 sites across the Environmental Change Network (ECN), Long Term Monitoring Network (LTMN) and Ecological Continuity Trust (ECT). Biodiversity units were calculated following field visits by the authors, whilst species data (response variables) were derived from long-term ecological change monitoring datasets collected by the sites and mostly held in the public domain (Table S1). We used all seven ECN sites in England. We selected a complementary 13 LTMN sites to give good geographic and habitat representation across England. We included four datasets from sites supported by the ECT where 2 x 2m vascular plant quadrat data were available for reuse. The 24 sites included samples from all terrestrial broad habitats (sensu Defra 2023) in England, except urban and individual trees: grassland (8), wetland (6), woodland and forest (5), sparsely vegetated land (2), cropland (2), heathland and shrub (1). Non-terrestrial broad habitats (rivers and lakes, marine inlets and transitional waters) were excluded. Our samples ranged in biodiversity unit scores from 2 to 24, the full range of the metric. Not all 24 sites had long-term datasets from all taxa: 23 had vascular plant data, 8 had bird data, and 13 had butterfly data. We chose these three taxa as they are the most comprehensively surveyed taxa in England’s long-term biological datasets. Together they represent a taxonomically broad, although by no means representative, sample of English nature. Biodiversity unit calculation Baseline biodiversity units were attributed to each vegetation quadrat using the statutory biodiversity metric (Defra, 2023) (Equation 1). Sites were visited by the authors between April and October 2022, i.e. within the optimal survey period indicated in the metric guidance. Sites were assessed initially using metric version 3.1 (Panks et al., 2022), which was current at the time of survey, and were subsequently updated to the statutory metric for analysis using field notes and species data.. Following the biodiversity metric guidance, we calculated biodiversity units at the habitat parcel scale, such that polygons with consistent habitat type and condition are the unit of assessment. We assigned habitat type and condition score to all quadrats falling within the parcel. Where the current site conditions (2022) and quadrat data (2010 to 2020) differed from each other in habitat or condition, e.g. the % bracken cover, we deferred to the quadrat data in order to match our response and explanatory variables more fairly. Across all samples, area was set to 1 ha arbitrarily, and strategic significance set to 1 (no strategic significance), to allow comparison between sites. To assign biodiversity units to the bird and butterfly transects, we averaged the biodiversity units of plant quadrats within the transect routes plus a buffer of 500 m (birds) or 100 m (butterflies). Quadrats were positioned to represent the habitats present at each site proportionally, and transect routes were also positioned to represent the habitats present across each site. Although units have been calculated as precisely as possible for all taxa, we recognize that biodiversity units are calculated more precisely for the plant dataset than the bird and butterfly dataset: the size of transect buffer is subjective, and some transects run adjacent to offsite habitat that could not be accessed. Further detail about biodiversity unit calculation can be found in the Supporting Information. Equation 1. Biodiversity unit calculation following the statutory biodiversity metric (Defra, 2023) Size of habitat parcel × Distinctiveness × Condition × Strategic Significance = Biodiversity Units Species response variable calculation We reused species datasets for plants, birds and butterflies recorded by the sites to calculate our response variables (Table S1). Plant species presence data were recorded using 2 x 2m quadrats of all vascular plant species at approximately 50 sample locations per site (mean 48.1, sd 3.7), stratified to represent all habitat types on site. If the quadrat fell within woodland or scrub, trees and shrubs rooted within a 10 x 10 m plot centred on the quadrat were also counted and added to the quadrat species records, with any duplicate species records removed. We treated each quadrat as a sample point, and the most recent census year was analysed (ranging between 2011-2021). Bird data were collected annually using the Breeding Birds Survey method of the British Trust for Ornithology: two approximately parallel 1 km long transects were routed through representative habitat on each site. The five most recent census years were analysed (all fell between 2006-2019), treating each year as a sample point (Bateman et al., 2013). Butterfly data were collected annually using the Pollard Walk method of the UK Butterfly Monitoring Scheme: a fixed transect route taking 30 to 90 minutes to walk (c. 1-2 km) was established through representative habitat on each site. The five most recent census years were analysed (all fell between 2006-2019), treating each year as a sample point. Full detail of how these datasets were originally collected in the field can be found in Supporting Information. For species richness estimates we omitted any records with vague taxon names not resolved to species level. Subspecies records were put back to the species level, as infraspecific taxa were recorded inconsistently across sites. Species synonyms were standardised across all sites prior to analysis. For bird abundance we used the maximum count of individuals recorded per site per year for each species as per the standard approach (Bateman et al., 2013). For butterfly abundance we used sum abundance over 26 weekly visits each year for each species at each site, using a GAM to interpolate missing weekly values (Dennis et al., 2013). Designated taxa were identified using the Great Britain Red List data held by JNCC (2022); species with any Red List designation other than Data Deficient or Least Concern were summed. Plant species range and range change index data followed PLANTATT (Hill et al., 2004). Range was measured as the number of 10x10 km cells across Great Britain that a species is found in. The change index measures the relative magnitude of range size change in standardised residuals, comparing 1930-1960 with 1987-1999. For birds, species mean population size across Great Britain followed Musgrove et al., 2013. We used the breeding season population size estimates to match field surveys. Bird long-term population percentage change (generally 1970-2014) followed Defra (2017). For butterflies, range and change data followed Fox et al., 2015. Range data was occupancy of UK 10 km squares 2010-2014. Change was percent abundance change 1976-2014. For all taxa, mean range and mean change were averaged from all the species present in the sample, not weighted by the species’ abundance in the sample. · Bateman, I. J., Harwood, A. R., Mace, G. M., Watson, R. T., Abson, D. J., Andrews, B., et al. (2013). Bringing ecosystem services into economic decision-making: Land use in the United Kingdom. Science (80-. ). 341, 45–50. doi: 10.1126/science.1234379. · British Trust for Ornithology (BTO), 2022. Breeding Bird methodology and survey design. Available online at https://www.bto.org/our-science/projects/breeding-bird-survey/research-conservation/methodology-and-survey-design · Defra, 2023. Statutory biodiversity metric tools and guides. https://www.gov.uk/government/publications/statutory-biodiversity-metric-tools-and-guides. · Dennis, E. B., Freeman, S. N., Brereton, T., and
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
One the most obvious difficulties in comparing the influence of landscape on crop production across studies is the choice of landscape metric. There exist countless metrics of landscape composition—the categories of land cover found on a landscape—and landscape configuration—the spatial organization of these categories. Common landscape composition metrics include measures of diversity—such as the Shannon Diversity Index or the Simpson Diversity Index—and measures of land cover composition—such as the percent of the landscape classified as natural cover. Common landscape configuration metrics include measures of patch size (contiguous areas of the same land cover) and mixing as well as edge length (linear length of patch boundaries/perimeter) and fragmentation. Even just considering diversity metrics, numerous options to select from can be found in the literature. Each one of these metrics has its own particularities in terms of sensitivity to scale, rare categories, and boundaries that can significantly alter the conclusions of studies examining the relationship between landscape characteristics and crop production. To address this challenge, we assess the sensitivity of our model results to a number of indicators of landscape composition and configuration using the USDA NASS Cropland Data Layer (CDL) as our indicator of land cover. This dataset classifies land cover at a 30-meter resolution nationwide from 2008 to present using satellite imagery and extensive ground truth data. While the 30-meter spatial resolution of this land cover data cannot accurately represent very small or narrow patches of land cover including shelterbelts and wildflower strips, given its relatively high resolution, full coverage, and historical availability, it is the best data for understanding land cover across agricultural landscapes in the U.S. We extract landscape indices from the CDL data using the landscapemetrics package in R, which considers all land cover in each county’s bounding box with the exception of open water and null categories. We measure compositional complexity using a set of six common landscape metrics associated with the number or the predominance of land cover categories across a landscape. Five of these metrics—Shannon Diversity Index, Simpson Diversity Index, Richness, Shannon Evenness Index, and Simpson Evenness Index—can be considered measures of land cover diversity. The sixth metric–Percent Natural Cover–is a simple measure of the predominance of undeveloped and uncultivated land cover classes (such as wetlands, grasslands, and forests) on a landscape. All of the compositional complexity metrics are aspatial, in that their calculation is not contingent on how land cover categories are arranged within the landscape. Configurational complexity is measured using four landscape metrics associated with the size of land cover patches (continuous areas of a single land cover category), shape of land cover patches, or mixing of land cover categories across the landscape. The metrics Mean Patch Area and Largest Patch Index are most strongly associated with patch size, the Contagion metric is a measure of land cover category mixing and strongly related to patch size, and the Edge Density metric is related to patch size and shape. Unlike the landscape composition metrics, the four landscape configuration metrics are spatially explicit and depend on the arrangement of land cover categories across the landscape. All code used to build data can be found here: https://github.com/katesnelson/aglandscapes-what-or-how Resources in this dataset:
Resource Title: County-level Estimates of Landscape Complexity and Configuration in the Coterminous US File Name: landscape_panel.txt Resource Description: GEOID: State and county FIPS codes in format SSCCC YEAR: Year in which CDL data was collected VALUE: Index value INDEX_NAME: Indices with _AG were computed for the subset of agricultural lands in a county. Indices with _ALL were computed for the entire landscape (agricultural and nonagricultural lands) in a county. LSM_AREA_MN_AG/ALL: Mean patch area, a measure of patch structure. Approaches 0 if all patches are small. Increases, without limit, as the patch areas increase. Higher values generally indicate lower complexity. LSM_CONTAG_AG/ALL: Contagion, a measure of dispersion and interspersion of land cover classes where a high proportion of like adjacencies and an uneven distribution of pairwise adjacencies produces a high contagion value. Range of 0 to 100. Higher values generally indicate lower complexity. LSM_ED_AG/ALL: Edge density, a measure of the patchiness of the landscape. Equals 0 if only one land cover is present and increased without limit as more land cover patches are added. Higher values generally indicate higher complexity. LSM_LPI_AG/ALL: Largest patch index, a measure of patch dominance representing the percentage of the landscape covered by the single largest patch. Approaches 0 when the largest patch is becoming small and equals 100 when only one patch is present. Higher values generally indicate lower complexity. LSM_RICH_AG/ALL: Richness, a measure of the abundance of categories. Higher values generally indicate higher complexity. LSM_SHDI_AG/ALL: Shannon Diversity Index, a measure of the abundance and evenness of land cover categories. This index is sensitive to rare land cover categories. Typical values are between 1.5 and 3. Higher values indicate higher complexity. LSM_SHEI_ALL: Simpson Evenness Index, a measure of diversity or dominance calculated as the ratio between the Shannon Diversity Index and the theoretical maximum of the Shannon Diversity Index. Shannon Evenness Index = 0 when there is only one land cover on the landscape and equals 1 when all land cover classes are equally distributed. Higher values generally indicate higher complexity. LSM_SIDI_ALL: Simpson Diversity Index, a diversity measure that considers the abundance and evenness of land cover categories. This index is not sensitive to rare land cover categories. Values range from 0 to 1. Higher values generally indicate higher complexity MODE_AG : Most dominant agricultural land use type found in the data (mode of agricultural CDL categories) MODE_ALL : Most dominant land use type found in the data (mode of all land use categories) PNC : Percent natural cover
Resource Title: Technical Validation File Name: technical_validation.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
North Polar Layered Deposit spiral trough geomorphology metric data (2023Trough_AllData.xlsx) This dataset contains the trough metric data calculated from the 3,192 trough profiles analyzed in this study as an Excel file. The dataset is split into tabs; the All_Data tab, which contain the metric values calculated for all trough profiles, and regional tabs, based on the regions identified in Smith & Holt (2015), and labeled R1-7a (excluding R0, 6, and 7b, as they do not contain troughs that we mapped). Each profile is identified by a unique number ranging between 1-3,192, labeled as Trough_Profile in all tabs of the dataset. The All_Data tab contains location data for each trough profile analyzed in the study, recorded as the center latitude of the profile, labeled CENTROID_X, and the center longitude of the profile, labeled CENTROID_Y. The values of each trough metric for all trough profiles are also recorded in this tab, including equator-facing trough wall slope (labeled as EQ_Slope), pole-facing trough wall slope (labeled as Pole_Slope), equator-facing trough wall relief (labeled as EQ_Relief), pole-facing trough wall relief (labeled as Pole_Relief) the difference between the two relief values (labeled as Relief_Diff), trough width (labeled as Width), and trough depth (labeled as Depth). Each regional tab contains information on which trough each trough profile is connected to, labeled as Trough_Number, where profiles from the same trough have matching number values. The values of each trough metric for trough profiles present in the region identified by the tab’s title are also recorded in this tab, including equator-facing trough wall slope (labeled as EQ_Slope), pole-facing trough wall slope (labeled as Pole_Slope), equator-facing trough wall relief (labeled as EQ_Relief), pole-facing trough wall relief (labeled as Pole_Relief) the difference between the two relief values (labeled as Relief_Diff), trough width (labeled as Width), and trough depth (labeled as Depth).
North Polar Layered Deposit spiral trough profiles (trough_profiles.zip) This dataset contains the trough profile figures of the 3,192 trough profiles analyzed in this study as .PNG images. Each figure is titled with a unique number ranging between 1-3,192, identifying which profile the figure corresponds to (labeled as Trough Profile Profile_XXXX.txt, where the XXXX is the unique profile number). The x-axis of each figure is the extent of the trough profile in meters (labeled Extent (m)) and the y-axis of each figure is the elevation of the trough profile in meters (labeled Elevation (m)). The legend of each figure labels the original trough profile extracted MOLA data as a blue line, the polynomial curve fit to the original data as an orange line, the calculated minimum point on the profile as a green dot, the calculated left shoulder point on the profile as a red dot, and the calculated right shoulder point on the profile as a purple dot.
A cloud atlas of the Martian North Polar Layered Deposits using THEMIS VIS imagery from Mars years 26-35 (All_VIS_Images.xlsx) This dataset contains our updated cloud atlas for the 13,857 THEMIS VIS images analyzed in this study as an Excel file. Each image is identified by their product ID (V########), labeled as file_ID in the dataset, which are used to request that specific image from the Planetary Data System (PDS). Each image also has their corresponding Mars year, ranging from a value of 25-35 and labeled as mars_year in the dataset. Each image also has their solar longitude (Ls), which is the Mars-Sun angle measured from the Northern Hemisphere of Mars where the northern spring equinox is Ls=0, the northern summer solstice is Ls=90, the northern autumn equinox is Ls=180, and the northern winter solstice is Ls=270. The solar longitude is labeled as solar_long in the dataset. The location data for each image is also recorded in the dataset as the center latitude of the image, labeled center_lat, and the center longitude of the image, labeled center_long. The classification done by this study is also recorded in the dataset. This includes which NPLD region, based on the regions identified in Smith & Holt (2015), ranging from 0-7b, based on which region most of the image lied in and labeled NPLD Region. It also includes the image noise ranking (labeled Image Noise Rank in the dataset), which was used to quantify image quality, where we assigned a metric 0,1, or 2 to each image, gauging whether the image was visually clear enough to distinguish clouds. This metric was based primarily on if surface features could be visibly distinguished in the image (e.g., pitting, craters, trough wall layers, trough edges, striations, etc.). A ranking of 0 meant the surface features were clear and high-resolution; 1 meant surface features were visible but less resolved in some way (e.g., slightly blurred, washed out, blocked by some visual artifacts, etc.); 2 meant the surface features were not at all distinct or blocked out by large visual artifacts, and the image was classified as too noisy to credibly decide if clouds were or were not present. Images were then classified as either having cloud presence or absence, on a yes/no scale (labeled Clouds? (y/n?/n) in the dataset). To state “yes”, the cloud’s edge must be distinct from the NPLD surface, so as not to confuse the cloud with other surface features. If there was doubt if the feature is a cloud (e.g., due to a soft or no cloud boundary with the surface and/or image defects on top of the potential cloud), that image is classified as “no?” in this analysis, indicating there may be clouds present but we did not feel confident enough in the image quality to state “yes”. If clouds were identified they were classified into three broad categories in the following section labeled Cloud type, if visible: trough-parallel clouds (similar to the low-altitude clouds with an elongated structure located parallel to the NPLD troughs identified in Smith et al. (2013)), wispy clouds, and general cloudiness. When visible, other related cloud features were noted, such as the presence of undulations, or linear cloud structures. Images classified as “no?” had their possible cloud type identified as well, though the noise complicating the image was also noted.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Knowledge graph construction of heterogeneous data has seen a lot of uptake
in the last decade from compliance to performance optimizations with respect
to execution time. Besides execution time as a metric for comparing knowledge
graph construction, other metrics e.g. CPU or memory usage are not considered.
This challenge aims at benchmarking systems to find which RDF graph
construction system optimizes for metrics e.g. execution time, CPU,
memory usage, or a combination of these metrics.
Task description
The task is to reduce and report the execution time and computing resources
(CPU and memory usage) for the parameters listed in this challenge, compared
to the state-of-the-art of the existing tools and the baseline results provided
by this challenge. This challenge is not limited to execution times to create
the fastest pipeline, but also computing resources to achieve the most efficient
pipeline.
We provide a tool which can execute such pipelines end-to-end. This tool also
collects and aggregates the metrics such as execution time, CPU and memory
usage, necessary for this challenge as CSV files. Moreover, the information
about the hardware used during the execution of the pipeline is available as
well to allow fairly comparing different pipelines. Your pipeline should consist
of Docker images which can be executed on Linux to run the tool. The tool is
already tested with existing systems, relational databases e.g. MySQL and
PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
which can be combined in any configuration. It is strongly encouraged to use
this tool for participating in this challenge. If you prefer to use a different
tool or our tool imposes technical requirements you cannot solve, please contact
us directly.
Part 1: Knowledge Graph Construction Parameters
These parameters are evaluated using synthetic generated data to have more
insights of their influence on the pipeline.
Data
Mappings
Part 2: GTFS-Madrid-Bench
The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
public transport domain in Madrid.
Scaling
Heterogeneity
Example pipeline
The ground truth dataset and baseline results are generated in different steps
for each parameter:
The pipeline is executed 5 times from which the median execution time of each
step is calculated and reported. Each step with the median execution time is
then reported in the baseline results with all its measured metrics.
Query timeout is set to 1 hour and knowledge graph construction timeout
to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
you can adapt the execution plans for this example pipeline to your own needs.
Each parameter has its own directory in the ground truth dataset with the
following files:
metadata.json
.Datasets
Knowledge Graph Construction Parameters
The dataset consists of:
Format
All input datasets are provided as CSV, depending on the parameter that is being
evaluated, the number of rows and columns may differ. The first row is always
the header of the CSV.
GTFS-Madrid-Bench
The dataset consists of:
Format
CSV datasets always have a header as their first row.
JSON and XML datasets have their own schema.
Evaluation criteria
Submissions must evaluate the following metrics:
Expected output
Duplicate values
Scale | Number of Triples |
---|---|
0 percent | 2000000 triples |
25 percent | 1500020 triples |
50 percent | 1000020 triples |
75 percent | 500020 triples |
100 percent | 20 triples |
Empty values
Scale | Number of Triples |
---|---|
0 percent | 2000000 triples |
25 percent | 1500000 triples |
50 percent | 1000000 triples |
75 percent | 500000 triples |
100 percent | 0 triples |
Mappings
Scale | Number of Triples |
---|---|
1TM + 15POM | 1500000 triples |
3TM + 5POM | 1500000 triples |
5TM + 3POM | 1500000 triples |
15TM + 1POM | 1500000 triples |
Properties
Scale | Number of Triples |
---|---|
1M rows 1 column | 1000000 triples |
1M rows 10 columns | 10000000 triples |
1M rows 20 columns | 20000000 triples |
1M rows 30 columns | 30000000 triples |
Records
Scale | Number of Triples |
---|---|
10K rows 20 columns | 200000 triples |
100K rows 20 columns | 2000000 triples |
1M rows 20 columns | 20000000 triples |
10M rows 20 columns | 200000000 triples |
Joins
1-1 joins
Scale | Number of Triples |
---|---|
0 percent | 0 triples |
25 percent | 125000 triples |
50 percent | 250000 triples |
75 percent | 375000 triples |
100 percent | 500000 triples |
1-N joins
Scale | Number of Triples |
---|---|
1-10 0 percent | 0 triples |
1-10 25 percent | 125000 triples |
1-10 50 percent | 250000 triples |
1-10 75 percent | 375000 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.
Dataset
The artifact contains the resources described below.
Experiment resources
The resources needed for replicating the experiment, namely in directory experiment:
alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.
alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.
docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.
api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.
Experiment data
The task database used in our application of the experiment, namely in directory data/experiment:
Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.
identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.
Collected data
Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:
data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).
data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:
participant identification: participant's unique identifier (ID);
socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).
data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);
detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.
data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID);
user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).
participants.txt: the list of participant identifiers that have registered for the experiment.
Analysis scripts
The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:
analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.
requirements.r: An R script to install the required libraries for the analysis script.
normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.
normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.
Dockerfile: Docker script to automate the analysis script from the collected data.
Setup
To replicate the experiment and the analysis of the results, only Docker is required.
If you wish to manually replicate the experiment and collect your own data, you'll need to install:
A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.
If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:
Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.
R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.
Usage
Experiment replication
This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.
To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.
cd experimentdocker-compose up
This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.
In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:
Group N (no hints): http://localhost:3000/0CAN
Group L (error locations): http://localhost:3000/CA0L
Group E (counter-example): http://localhost:3000/350E
Group D (error description): http://localhost:3000/27AD
In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.
Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.
Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.
After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:
Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.
Analysis of other applications of the experiment
This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.
The analysis script expects data in 4 CSV files,
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides processed and normalized/standardized indices for the management tool 'Customer Segmentation', including the closely related concept of Market Segmentation. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding Customer Segmentation dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "customer segmentation" + "market segmentation" + "customer segmentation marketing". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Customer Segmentation + Market Segmentation. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching Customer Segmentation-related keywords [("customer segmentation" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (Segmentation Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Customer Segmentation (1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017). Note: Not reported in 2022 survey data. Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Customer Segmentation (1999-2017). Note: Not reported in 2022 survey data. Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding Customer Segmentation dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of 124 global and free datasets allowing for spatial (and temporal) analyses of floods, droughts and their interactions with human societies. We have structured the datasets into seven categories: hydrographic baseline, hydrological dynamics, hydrological extremes, land cover & agriculture, human presence, water management, and vulnerability. Please refer to Lindersson et al. (accepted february 2020 in WIREs Water) for further information about review methodology.
The collection is a descriptive list, holding the following information for each dataset:
NOTE: Carefully consult the data usage licenses as given by the data providers, to assure that the exact permissions and restrictions are followed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Decoding movement related intentions is a key step to implement BMIs. Decoding EEG has been challenging due to its low spatial resolution and signal to noise ratio. Metric learning allows finding a representation of data in a way that captures a desired notion of similarity between data points. In this study, we investigate how metric learning can help finding a representation of the data to efficiently classify EEG movement and pre-movement intentions. We evaluate the effectiveness of the obtained representation by comparing classification the performance of a Support Vector Machine (SVM) as a classifier when trained on the original representation, called Euclidean, and representations obtained with three different metric learning algorithms, including Conditional Entropy Metric Learning (CEML), Neighborhood Component Analysis (NCA), and the Entropy Gap Metric Learning (EGML) algorithms. We examine different types of features, such as time and frequency components, which input to the metric learning algorithm, and both linear and non-linear SVM are applied to compare the classification accuracies on a publicly available EEG data set for two subjects (Subject B and C). Although metric learning algorithms do not increase the classification accuracies, their interpretability using an importance measure we define here, helps understanding data organization and how much each EEG channel contributes to the classification. In addition, among the metric learning algorithms we investigated, EGML shows the most robust performance due to its ability to compensate for differences in scale and correlations among variables. Furthermore, from the observed variations of the importance maps on the scalp and the classification accuracy, selecting an appropriate feature such as clipping the frequency range has a significant effect on the outcome of metric learning and subsequent classification. In our case, reducing the range of the frequency components to 0–5 Hz shows the best interpretability in both Subject B and C and classification accuracy for Subject C. Our experiments support potential benefits of using metric learning algorithms by providing visual explanation of the data projections that explain the inter class separations, using importance. This visualizes the contribution of features that can be related to brain function.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.
Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.
Overview statistics:
Number of subjects: 30
Number of transmitter locations: 6
Number of receiver locations: 6
Number of measurement frequencies: 19
Input voltage: 1 V
Load resistance: 50 ohm and 1 megaohm
Measurement group statistics:
Height: 174.10 (7.15)
Weight: 72.85 (16.26)
BMI: 23.94 (4.70)
Body fat %: 21.53 (7.55)
Age group: 29.00 (11.25)
Male/female ratio: 50%
Included files:
experiment_protocol_description.docx - protocol used in the experiments
electrode_placement_schematic.png - schematic of placement locations
electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject
RawData - the full measurement results and experiment info sheets
all_measurements.csv - the most important results extracted to .csv
all_measurements_filtered.csv - same, but after z-score filtering
all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row
all_measurements_by_freq_filtered.csv - same, but after z-score filtering
summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets
process_json_files.py - script that creates .csv from the raw data
filter_results.py - outlier removal based on z-score
plot_sample_curves.py - visualization of a randomly selected measurement result subset
plot_measurement_group.py - visualization of the measurement group
CSV file columns:
subject_id - participant's random unique ID
experiment_id - measurement session's number for the participant
height - participant's height, cm
weight - participant's weight, kg
BMI - body mass index, computed from the valued above
body_fat_% - body fat composition, as measured by bioimpedance scales
age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.
male - 1 if male, 0 if female
tx_point - transmitter point number
rx_point - receiver point number
distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!
tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.
rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.
total_fat_level - sum of rx and tx fat levels
bias - constant term to simplify data analytics, always equal to 1.0
CSV file columns, frequency-specific:
tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py
script from the voltage drop
rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance
rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance
Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.
References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.
Contact information: info@edi.lv