Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data contain bathymetric data from the Namibia continental slope. The data were acquired on R/V Meteor research expeditions M76/1 in 2008, and R/V Maria S. Merian expedition MSM19/1c in 2011. The purpose of the data was the exploration of the Namibian continental slope and espressially the investigation of large seafloor depressions. The bathymetric data were acquired with the 191-beam 12 kHz Kongsberg EM120 system. The data were processed using the public software package MBSystems. The loaded data were cleaned semi-automatically and manually, removing outliers and other erroneous data. Initial velocity fields were adjusted to remove artifacts from the data. Gridding was done in 10x10 m grid cells for the MSM19-1c dataset and 50x50 m for the M76 dataset using the Gaussian Weighted Mean algorithm.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset estimates location and size of trees in the District of Columbia that are not managed by the Urban Forestry Division (https://opendata.dc.gov/datasets/urban-forestry-street-trees/explore). Trees are modeled using an automated feature extraction process applied to 2022 LiDAR data. All data is an estimate, and intended for general representation purposes.
DC 2022 LiDAR was used and processed using the “Extract Trees using Cluster Analysis” script which is included as part of Esri’s 3D Basemap solution. All LiDAR-derived trees within 2 meters of a Urban Forestry Division tree were removed as being duplicates.
Tree diameter (DBH, in inches) was estimated for the LiDAR-derived trees from calculated tree height (in feet) based on the equation: DBH = 0.4003*height - 1.9557. This equation was derived from a statistical analysis of a detailed park inventory tree data set and has an R^2 = 0.7418.
Extreme outliers were also modified, with any DBH larger than 80 inches being converted to a DBH of 80 inches.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Measurement Configuration Dataset
This is the anonymous reviewing version; the source code repository will be added after the review.
This dataset provides reproduction data for performance measurement configuration at source code level in Java. The measurement data can be obtained using the precision-experiments repository https://anonymous.4open.science/r/precision-experiments-C613/ (Examining Different Repetition Counts) yourself. These data conatained here are the data we obtained from execution on i7-4770 CPU @ 3.40GHz.
The analysis was tested on Ubuntu 20.04 and gnuplot 5.2.8. It will not work with older gnuplot versions.
To execute the analysis, extract the data by
tar -xvf basic-parameter-comparison.tar tar -xvf parallel-sequential-comparison.tar
and afterwards build the precision-experiments repo and execute the analysis by
cd precision-experiments/precision-analysis/ ../gradlew fatJar cd scripts/configuration-analysis/ ./executeCompleteAnalysis.sh ../../../../basic-parameter-comparison ../../../../parallel-sequential-comparison
Afterwards, the following files will be present:
precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_all_en.pdf (Heatmaps for different repetition counts)
precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_outlierRemoval_en.pdf (Heatmap with and without outlier removal for 1000 repetitions)
precision-experiments/precision-analysis/scripts/configuration-analysis/histogram_outliers_en.pdf (Histogram of the outliers)
precision-experiments/precision-analysis/scripts/configuration-analysis/heatmap_parallel_en.pdf (Heatmap with sequential and parallel execution)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Version 2 of the Depth of Regolith product of the Soil and Landscape Grid of Australia (produced 2015-06-01).
The Soil and Landscape Grid of Australia has produced a range of digital soil attribute products. The digital soil attribute maps are in raster format at a resolution of 3 arc sec (~90 x 90 m pixels).
Attribute Definition: The regolith is the in situ and transported material overlying unweathered bedrock; Units: metres; Spatial prediction method: data mining using piecewise linear regression; Period (temporal coverage; approximately): 1900-2013; Spatial resolution: 3 arc seconds (approx 90m); Total number of gridded maps for this attribute:3; Number of pixels with coverage per layer: 2007M (49200 * 40800); Total size before compression: about 8GB; Total size after compression: about 4GB; Data license : Creative Commons Attribution 4.0 (CC BY); Variance explained (cross-validation): R^2 = 0.38; Target data standard: GlobalSoilMap specifications; Format: GeoTIFF. Lineage: The methodology consisted of the following steps: (i) drillhole data preparation, (ii) compilation and selection of the environmental covariate raster layers and (iii) model implementation and evaluation.
Drillhole data preparation: Drillhole data was sourced from the National Groundwater Information System (NGIS) database. This spatial database holds nationally consistent information about bores that were drilled as part of the Bore Construction Licensing Framework (http://www.bom.gov.au/water/groundwater/ngis/). The database contains 357,834 bore locations with associated lithology, bore construction and hydrostratigraphy records. This information was loaded into a relational database to facilitate analysis.
Regolith depth extraction: The first step was to recognise and extract the boundary between the regolith and bedrock within each drillhole record. This was done using a key word look-up table of bedrock or lithology related words from the record descriptions. 1,910 unique descriptors were discovered. Using this list of new standardised terms analysis of the drillholes was conducted, and the depth value associated with the word in the description that was unequivocally pointing to reaching fresh bedrock material was extracted from each record using a tool developed in C# code.
The second step of regolith depth extraction involved removal of drillhole bedrock depth records deemed necessary because of the “noisiness” in depth records resulting from inconsistencies we found in drilling and description standards indentified in the legacy database.
On completion of the filtering and removal of outliers the drillhole database used in the model comprised of 128,033 depth sites.
Selection and preparation of environmental covariates The environmental correlations style of DSM applies environmental covariate datasets to predict target variables, here regolith depth. Strongly performing environmental covariates operate as proxies for the factors that control regolith formation including climate, relief, parent material organisms and time.
Depth modelling was implemented using the PC-based R-statistical software (R Core Team, 2014), and relied on the R-Cubist package (Kuhn et al. 2013). To generate modelling uncertainty estimates, the following procedures were followed: (i) the random withholding of a subset comprising 20% of the whole depth record dataset for external validation; (ii) Bootstrap sampling 100 times of the remaining dataset to produce repeated model training datasets, each time. The Cubist model was then run repeated times to produce a unique rule set for each of these training sets. Repeated model runs using different training sets, a procedure referred to as bagging or bootstrap aggregating, is a machine learning ensemble procedure designed to improve the stability and accuracy of the model. The Cubist rule sets generated were then evaluated and applied spatially calculating a mean predicted value (i.e. the final map). The 5% and 95% confidence intervals were estimated for each grid cell (pixel) in the prediction dataset by combining the variance from the bootstrapping process and the variance of the model residuals. Version 2 differs from version 1, in that the modelling of depths was performed on the log scale to better conform to assumptions of normality used in calculating the confidence intervals. The method to estimate the confidence intervals was improved to better represent the full range of variability in the modelling process. (Wilford et al, in press)
DC 2022 LiDAR was used and processed using the “Extract Trees using Cluster Analysis” script which is included as part of Esri’s 3D Basemap solution. All LiDAR-derived trees within 2 meters of a Urban Forestry Division tree were removed as being duplicates.Tree diameter (DBH, in inches) was estimated for the LiDAR-derived trees from calculated tree height (in feet) based on the equation: DBH = 0.4003*height - 1.9557. This equation was derived from a statistical analysis of a detailed park inventory tree data set and has an R^2 = 0.7418.Extreme outliers were also modified, with any DBH larger than 80 inches being converted to a DBH of 80 inches.The combined data set was processed using the USDA Forest Service i-Tree eco software, where structure and environmental benefits were estimated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rapid increase of large-scale datasets, biomedical data visualization is facing challenges. The data may be large, have different orders of magnitude, contain extreme values, and the data distribution is not clear. Here we present an R package ggbreak that allows users to create broken axes using ggplot2 syntax. It can effectively use the plotting area to deal with large datasets (especially for long sequential data), data with different magnitudes, and contain outliers. The ggbreak package increases the available visual space for a better presentation of the data and detailed annotation, thus improves our ability to interpret the data. The ggbreak package is fully compatible with ggplot2 and it is easy to superpose additional layers and applies scale and theme to adjust the plot using the ggplot2 syntax. The ggbreak package is open-source software released under the Artistic-2.0 license, and it is freely available on CRAN (https://CRAN.R-project.org/package=ggbreak) and Github (https://github.com/YuLab-SMU/ggbreak).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers