Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT The considerable volume of data generated by sensors in the field presents systematic errors; thus, it is extremely important to exclude these errors to ensure mapping quality. The objective of this research was to develop and test a methodology to identify and exclude outliers in high-density spatial data sets, determine whether the developed filter process could help decrease the nugget effect and improve the spatial variability characterization of high sampling data. We created a filter composed of a global, anisotropic, and an anisotropic local analysis of data, which considered the respective neighborhood values. For that purpose, we used the median to classify a given spatial point into the data set as the main statistical parameter and took into account its neighbors within a radius. The filter was tested using raw data sets of corn yield, soil electrical conductivity (ECa), and the sensor vegetation index (SVI) in sugarcane. The results showed an improvement in accuracy of spatial variability within the data sets. The methodology reduced RMSE by 85 %, 97 %, and 79 % in corn yield, soil ECa, and SVI respectively, compared to interpolation errors of raw data sets. The filter excluded the local outliers, which considerably reduced the nugget effects, reducing estimation error of the interpolated data. The methodology proposed in this work had a better performance in removing outlier data when compared to two other methodologies from the literature.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Verbal and Quantitative Reasoning GRE scores and percentiles were collected by querying the student database for the appropriate information. Any student records that were missing data such as GRE scores or grade point average were removed from the study before the data were analyzed. The GRE Scores of entering doctoral students from 2007-2012 were collected and analyzed. A total of 528 student records were reviewed. Ninety-six records were removed from the data because of a lack of GRE scores. Thirty-nine of these records belonged to MD/PhD applicants who were not required to take the GRE to be reviewed for admission. Fifty-seven more records were removed because they did not have an admissions committee score in the database. After 2011, the GRE’s scoring system was changed from a scale of 200-800 points per section to 130-170 points per section. As a result, 12 more records were removed because their scores were representative of the new scoring system and therefore were not able to be compared to the older scores based on raw score. After removal of these 96 records from our analyses, a total of 420 student records remained which included students that were currently enrolled, left the doctoral program without a degree, or left the doctoral program with an MS degree. To maintain consistency in the participants, we removed 100 additional records so that our analyses only considered students that had graduated with a doctoral degree. In addition, thirty-nine admissions scores were identified as outliers by statistical analysis software and removed for a final data set of 286 (see Outliers below). Outliers We used the automated ROUT method included in the PRISM software to test the data for the presence of outliers which could skew our data. The false discovery rate for outlier detection (Q) was set to 1%. After removing the 96 students without a GRE score, 432 students were reviewed for the presence of outliers. ROUT detected 39 outliers that were removed before statistical analysis was performed. Sample See detailed description in the Participants section. Linear regression analysis was used to examine potential trends between GRE scores, GRE percentiles, normalized admissions scores or GPA and outcomes between selected student groups. The D’Agostino & Pearson omnibus and Shapiro-Wilk normality tests were used to test for normality regarding outcomes in the sample. The Pearson correlation coefficient was calculated to determine the relationship between GRE scores, GRE percentiles, admissions scores or GPA (undergraduate and graduate) and time to degree. Candidacy exam results were divided into students who either passed or failed the exam. A Mann-Whitney test was then used to test for statistically significant differences between mean GRE scores, percentiles, and undergraduate GPA and candidacy exam results. Other variables were also observed such as gender, race, ethnicity, and citizenship status within the samples. Predictive Metrics. The input variables used in this study were GPA and scores and percentiles of applicants on both the Quantitative and Verbal Reasoning GRE sections. GRE scores and percentiles were examined to normalize variances that could occur between tests. Performance Metrics. The output variables used in the statistical analyses of each data set were either the amount of time it took for each student to earn their doctoral degree, or the student’s candidacy examination result.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
We are enclosing the database used in our research titled "Concentration and Geospatial Modelling of Health Development Offices' Accessibility for the Total and Elderly Populations in Hungary", along with our statistical calculations. For the sake of reproducibility, further information can be found in the file Short_Description_of_Data_Analysis.pdf and Statistical_formulas.pdf
The sharing of data is part of our aim to strengthen the base of our scientific research. As of March 7, 2024, the detailed submission and analysis of our research findings to a scientific journal has not yet been completed.
The dataset was expanded on 23rd September 2024 to include SPSS statistical analysis data, a heatmap, and buffer zone analysis around the Health Development Offices (HDOs) created in QGIS software.
Short Description of Data Analysis and Attached Files (datasets):
Our research utilised data from 2022, serving as the basis for statistical standardisation. The 2022 Hungarian census provided an objective basis for our analysis, with age group data available at the county level from the Hungarian Central Statistical Office (KSH) website. The 2022 demographic data provided an accurate picture compared to the data available from the 2023 microcensus. The used calculation is based on our standardisation of the 2022 data. For xlsx files, we used MS Excel 2019 (version: 1808, build: 10406.20006) with the SOLVER add-in.
Hungarian Central Statistical Office served as the data source for population by age group, county, and regions: https://www.ksh.hu/stadat_files/nep/hu/nep0035.html, (accessed 04 Jan. 2024.) with data recorded in MS Excel in the Data_of_demography.xlsx file.
In 2022, 108 Health Development Offices (HDOs) were operational, and it's noteworthy that no developments have occurred in this area since 2022. The availability of these offices and the demographic data from the Central Statistical Office in Hungary are considered public interest data, freely usable for research purposes without requiring permission.
The contact details for the Health Development Offices were sourced from the following page (Hungarian National Population Centre (NNK)): https://www.nnk.gov.hu/index.php/efi (n=107). The Semmelweis University Health Development Centre was not listed by NNK, hence it was separately recorded as the 108th HDO. More information about the office can be found here: https://semmelweis.hu/egeszsegfejlesztes/en/ (n=1). (accessed 05 Dec. 2023.)
Geocoordinates were determined using Google Maps (N=108): https://www.google.com/maps. (accessed 02 Jan. 2024.) Recording of geocoordinates (latitude and longitude according to WGS 84 standard), address data (postal code, town name, street, and house number), and the name of each HDO was carried out in the: Geo_coordinates_and_names_of_Hungarian_Health_Development_Offices.csv file.
The foundational software for geospatial modelling and display (QGIS 3.34), an open-source software, can be downloaded from:
https://qgis.org/en/site/forusers/download.html. (accessed 04 Jan. 2024.)
The HDOs_GeoCoordinates.gpkg QGIS project file contains Hungary's administrative map and the recorded addresses of the HDOs from the
Geo_coordinates_and_names_of_Hungarian_Health_Development_Offices.csv file,
imported via .csv file.
The OpenStreetMap tileset is directly accessible from www.openstreetmap.org in QGIS. (accessed 04 Jan. 2024.)
The Hungarian county administrative boundaries were downloaded from the following website: https://data2.openstreetmap.hu/hatarok/index.php?admin=6 (accessed 04 Jan. 2024.)
HDO_Buffers.gpkg is a QGIS project file that includes the administrative map of Hungary, the county boundaries, as well as the HDO offices and their corresponding buffer zones with a radius of 7.5 km.
Heatmap.gpkg is a QGIS project file that includes the administrative map of Hungary, the county boundaries, as well as the HDO offices and their corresponding heatmap (Kernel Density Estimation).
A brief description of the statistical formulas applied is included in the Statistical_formulas.pdf.
Recording of our base data for statistical concentration and diversification measurement was done using MS Excel 2019 (version: 1808, build: 10406.20006) in .xlsx format.
Using the SPSS 29.0.1.0 program, we performed the following statistical calculations with the databases Data_HDOs_population_without_outliers.sav and Data_HDOs_population.sav:
For easier readability, the files have been provided in both SPV and PDF formats.
The translation of these supplementary files into English was completed on 23rd Sept. 2024.
If you have any further questions regarding the dataset, please contact the corresponding author: domjan.peter@phd.semmelweis.hu
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Digital Earth Africa Rates of change of coastlines dataset is a point dataset providing robust rates of coastal change (in metres per year) for every 30 m along Africa’s non-rocky (e.g. sandy and muddy) coastlines. These rates are calculated by linearly regressing annual shoreline positions against time, using the most recent shoreline as a baseline.Negative values (red points) indicate retreat (e.g. erosion), and positive values indicate growth (e.g. progradation) over time. By default, rates of change are shown for points with a statistically significant trend over time only.Key PropertiesGeographic Coverage: Continental Africa - approximately 37° North to 35° SouthTemporal Coverage: 2000 to PresentSpatial Resolution: 30 x 30 meterUpdate Frequency: Annual from 200 - Present; 6 months from end of previous yearParent Dataset: Landsat Collection 2 Surface ReflectanceSource Data Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)Service Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)Rates of change statistics attributes:
Attribute
rate_time
Annual rates of change (in metres per year) calculated by linearly regressing annual shoreline distances against time (excluding outliers). Negative values indicate retreat and positive values indicate growth.
sig_time
Significance (p-value) of the linear relationship between annual shoreline distances and time. Small values (e.g. p-value < 0.01) may indicate a coastline is undergoing consistent coastal change through time.
se_time
Standard error (in metres) of the linear relationship between annual shoreline distances and time. This can be used to generate confidence intervals around the rate of change given by rate_time (e.g. 95% confidence interval = se_time * 1.96).
outl_time
Individual annual shoreline are noisy estimators of coastline position that can be influenced by environmental conditions (e.g. clouds, breaking waves, sea spray) or modelling issues (e.g. poor tidal modelling results or limited clear satellite observations). To obtain reliable rates of change, outlier shorelines are excluded using a robust Median Absolute Deviation outlier detection algorithm, and recorded in this column.
sce
Shoreline Change Envelope (SCE). A measure of the maximum change or variability across all annual shorelines, calculated by computing the maximum distance between any two annual shoreline (excluding outliers). This statistic excludes sub-annual shoreline variability.
nsm
Net Shoreline Movement (NSM). The distance between the oldest (2000) and most recent annual shoreline (excluding outliers). Negative values indicate the coastline retreated between the oldest and most recent shoreline; positive values indicate growth. This statistic does not reflect sub-annual shoreline variability, so will underestimate the full extent of variability at any given location.
max_year, min_year
The year that annual shorelines were at their maximum (i.e. located furthest towards the ocean) and their minimum (i.e. located furthest inland) respectively (excluding outliers). This statistic excludes sub-annual shoreline variability.
outl_time
Individual annual shoreline are noisy estimators of coastline position that can be influenced by environmental conditions (e.g. clouds, breaking waves, sea spray) or modelling issues (e.g. poor tidal modelling results or limited clear satellite observations). To obtain reliable rates of change, outlier shorelines are excluded using a robust Median Absolute Deviation outlier detection algorithm, and recorded in this column.
More details on this dataset can be found here.
Prevalence of anomalies from 2003 to 2012, the annual proportional change during this period and the adjusted annual proportional change after excluding outliers for the 17 anomaly subgroups with statistically significant trends identified in Fig 1.
A tracer breakthrough curve (BTC) for each sampling station is the ultimate goal of every quantitative hydrologic tracing study, and dataset size can critically affect the BTC. Groundwater-tracing data obtained using in situ automatic sampling or detection devices may result in very high-density data sets. Data-dense tracer BTCs obtained using in situ devices and stored in dataloggers can result in visually cluttered overlapping data points. The relatively large amounts of data detected by high-frequency settings available on in situ devices and stored in dataloggers ensure that important tracer BTC features, such as data peaks, are not missed. Alternatively, such dense datasets can also be difficult to interpret. Even more difficult, is the application of such dense data sets in solute-transport models that may not be able to adequately reproduce tracer BTC shapes due to the overwhelming mass of data. One solution to the difficulties associated with analyzing, interpreting, and modeling dense data sets is the selective removal of blocks of the data from the total dataset. Although it is possible to arrange to skip blocks of tracer BTC data in a periodic sense (data decimation) so as to lessen the size and density of the dataset, skipping or deleting blocks of data also may result in missing the important features that the high-frequency detection setting efforts were intended to detect. Rather than removing, reducing, or reformulating data overlap, signal filtering and smoothing may be utilized but smoothing errors (e.g., averaging errors, outliers, and potential time shifts) need to be considered. Appropriate probability distributions to tracer BTCs may be used to describe typical tracer BTC shapes, which usually include long tails. Recognizing appropriate probability distributions applicable to tracer BTCs can help in understanding some aspects of the tracer migration. This dataset is associated with the following publications: Field, M. Tracer-Test Results for the Central Chemical Superfund Site, Hagerstown, Md. May 2014 -- December 2015. U.S. Environmental Protection Agency, Washington, DC, USA, 2017. Field, M. On Tracer Breakthrough Curve Dataset Size, Shape, and Statistical Distribution. ADVANCES IN WATER RESOURCES. Elsevier Science Ltd, New York, NY, USA, 141: 1-19, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Digital Earth Australia Coastlines is a continental dataset that includes annual shorelines and rates of coastal change along the entire Australian coastline from 1988 to the present. The product combines satellite data from Geoscience Australia's Digital Earth Australia program with tidal modelling to map the most representative location of the shoreline at mean sea level for each year. The product enables trends of coastal retreat and growth to be examined annually at both a local and continental scale, and for patterns of coastal change to be mapped historically and updated regularly as data continues to be acquired. This allows current rates of coastal change to be compared with that observed in previous years or decades. The ability to map shoreline positions for each year provides valuable insights into whether changes to our coastline are the result of particular events or actions, or a process of more gradual change over time. This information can enable scientists, managers and policy makers to assess impacts from the range of drivers impacting our coastlines and potentially assist planning and forecasting for future scenarios. The DEA Coastlines product contains five layers:
Annual shorelines Rates of change points Coastal change hotspots (1 km) Coastal change hotspots (5 km) Coastal change hotspots (10 km)
Annual shorelines Annual shoreline vectors that represent the median or ‘most representative’ position of the shoreline at approximately 0 m Above Mean Sea Level for each year since 1988. Dashed shorelines have low certainty. Rates of change points A point dataset providing robust rates of coastal change for every 30 m along Australia’s non-rocky coastlines. The most recent annual shoreline is used as a baseline for measuring rates of change. Points are shown for locations with statistically significant rates of change (p-value <= 0.01; see sig_time below) and good quality data (certainty = "good"; see certainty below) only. Each point shows annual rates of change (in metres per year; see rate_time below), and an estimate of uncertainty in brackets (95% confidence interval; see se_time). For example, there is a 95% chance that a point with a label -10.0 m (±1.0 m) is retreating at a rate of between -9.0 and -11.0 metres per year. Coastal change hotspots (1 km, 5 km, 10 km) Three points layers summarising coastal change within moving 1 km, 5 km and 10km windows along the coastline. These layers are useful for visualising regional or continental-scale patterns of coastal change. Currency Date modified: August 2023 Modification frequency: Annually Data extent Spatial extent North: -9° South: -44° East: 154° West: 112° Temporal extent From 1988 to Present Source information
Product description and metadata Digital Earth Australia Coastlines catalog entry Data download Interactive Map
Lineage statement The DEA Coastlines product is under active development. A full and current product description is best sourced from the DEA Coastlines website. For a full summary of changes made in previous versions, refer to Github. Data dictionary Layer attribute columns Annual shorelines
Attribute name Description
OBJECTID Automatically generated system ID
year The year of each annual shoreline
certainty A column providing important data quality flags for each annual shoreline (see the Quality assurance section of the product description and metadata page for more detail about each data quality flag)
tide_datum The tide datum of each annual shoreline (e.g. "0 m AMSL")
id_primary The name of the annual shoreline's Primary sediment compartment from the Australian Coastal Sediment Compartments framework
Rates of change points and Coastal change hotspots
Attribute name Description
OBJECTID Automatically generated system ID
uid A unique geohash identifier for each point
rate_time Annual rates of change (in metres per year) calculated by linearly regressing annual shoreline distances against time (excluding outliers). Negative values indicate retreat and positive values indicate growth
sig_time Significance (p-value) of the linear relationship between annual shoreline distances and time. Small values (e.g. p-value < 0.01 or 0.05) may indicate a coastline is undergoing consistent coastal change through time
se-time Standard error (in metres) of the linear relationship between annual shoreline distances and time. This can be used to generate confidence intervals around the rate of change given by rate_time (e.g. 95% confidence interval = se_time * 1.96).
outl_time Individual annual shoreline are noisy estimators of coastline position that can be influenced by environmental conditions (e.g. clouds, breaking waves, sea spray) or modelling issues (e.g. poor tidal modelling results or limited clear satellite observations). To obtain reliable rates of change, outlier shorelines are excluded using a robust Median Absolute Deviation outlier detection algorithm, and recorded in this column
dist_1990, dist_1991, etc Annual shoreline distances (in metres) relative to the most recent baseline shoreline. Negative values indicate that an annual shoreline was located inland of the baseline shoreline. By definition, the most recent baseline column will always have a distance of 0 m
angle_mean, angle_std The mean angle and standard deviation between the baseline point to all annual shorelines. This data is used to calculate how well shorelines fall along a consistent line; high angular standard deviation indicates that derived rates of change are unlikely to be correct
valid_obs, valid_span The total number of valid (i.e. non-outliers, non-missing) annual shoreline observations, and the maximum number of years between the first and last valid annual shoreline
sce Shoreline Change Envelope (SCE). A measure of the maximum change or variability across all annual shorelines, calculated by computing the maximum distance between any two annual shorelines (excluding outliers). This statistic excludes sub-annual shoreline variability like tides, storms and seasonal effects
nsm Net Shoreline Movement (NSM). The distance between the oldest (1988) and most recent annual shoreline (excluding outliers). Negative values indicate the coastline retreated between the oldest and most recent shoreline; positive values indicate growth. This statistic does not reflect sub-annual shoreline variability, so will underestimate the full extent of variability at any given location
max_year, min_year The year that annual shorelines were at their maximum (i.e. located furthest towards the ocean) and their minimum (i.e. located furthest inland) respectively (excluding outliers). This statistic excludes sub-annual shoreline variability
certainty A column providing important data quality flags for each annual shoreline (see the Quality assurance section of the product description and metadata page for more detail about each data quality flag)
id_primary The name of the point's Primary sediment compartment from the Australian Coastal Sediment Compartments framework
Contact Geoscience Australia, clientservices@ga.gov.au
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
--------------------------------------------------------------------------------
This is the RNA2023 dataset by the Richardson Lab at Duke University
These are high-quality residues from high-quality, low-redundancy RNA chains in the PDB.
For a similar set of quality-filtered protein residues, see the top2018 datasets at:
https://doi.org/10.5281/zenodo.4626149
https://doi.org/10.5281/zenodo.5115232
Corresponding authors
--------------------------------------------------------------------------------
dcrjsr at kinemage.biochem.duke.edu
christopher.sci.williams at gmail.com
Usage recommendations
--------------------------------------------------------------------------------
RNA residues that fail the filtering criteria described below have been removed from the files. As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data.
Files already contain hydrogens added by Reduce in the context of the original full models.
Two datasets are provided. The standard dataset is rna2023_pruned. We recommend this version as the default. The RNA backbone conformational space is highly diverse, and some real conformations fall below the statistical threshold for recognition as suites. Therefore we do not recommend excluding suite outliers from the dataset except in specialty cases. We also provide a rna2023_nosuiteout dataset. In this case, no residues having "!!" outlier suite identifications are permitted. This set may be useful in specialist cases where suite outliers are undesireable or where losing some real conformations is an acceptable sacrifice for maximal filtering.
Each dataset also has a mmCIF version.
Note: Chains are named based on author chain ids, except for 8b0x, chain a. To avoid conflicts with 8b0x chain A in file systems that do not support case-sensitive file names, 8b0x chain a has been renamed to chain AB, matching its PDB/mmCIF designation.
Additional files
--------------------------------------------------------------------------------
rna2023_pdbmetadata.csv contains information on release date, resolution, title, authors, etc for each source pdb.
rna2023_chain_list contains a list of all included chains, plus statistics on the number residues from the original chain passed the quality filters.
rna2023_suitename_table.csv and rna2023_suitename_table_nosuiteout.csv contain suitename identifications of rotameric RNA backbone conformations (1a, 1c, 2u, 6d, etc) precomputed for convenience.
Filtering criteria: Chain level
--------------------------------------------------------------------------------
The chain list was derived from http://rna.bgsu.edu/rna3dhub/nrlist, version 3.150 as of 2020/10/28, with a 1.9Å resolution cutoff.
We added 6ugg chain A and two recent EM ribosome structures: 8a3d and 8b0x
After residue-level filtering, chains with no complete suites were removed.
Filtering criteria: Residue level
--------------------------------------------------------------------------------
Even excellent structures usually contain some poorly-resolved regions. Residue-level filtering helps avoid including these regions in otherwise high-quality data
Residues are required to meet the following validation quality contain:
No sugar pucker outliers
No steric overlaps or "clashes", as per Probe >= 0.5Å
No covalent bond or angle geometry outliers
Optionally, no !! suite outliers
Residues from xray structures are required for meet the following fit-to-map criteria:
Average of worst 2 atoms' 2Fo-Fc map values >= 1.2
Average of worst 2 atoms' RSCC scores >= 0.7
No atoms modeled at partial occupancy
Residues from em structures are required for meet the following fit-to-map criteria:
RSCC >= 0.7
Residue inclusion fraction = 1.0 or >= 0.95, depending on structure
No atoms modeled at partial occupancy
Filtering is documented in each pruned file. See USER DOC lines in .pdb and data_rna2023_dataset loops in .cif
Version history
--------------------------------------------------------------------------------
Version 1.0 Jun 30, 2023
Initial version
https://m0ve.com/terms-of-usehttps://m0ve.com/terms-of-use
This Stone housing dataset, produced by M0VE.com, spans the period from 2021 to 2025 and delivers a focused view of residential pricing based on real, publicly recorded data. Land Registry sales, EPC classifications, and council-sourced records form the backbone of this dataset, but what sets it apart is how the raw inputs are refined. Each entry is rebalanced for inflation, filtered to exclude outliers, and modified to reflect nuanced variables like energy efficiency, build type, and whether the home is newly constructed or long established. Pricing is tracked annually and organised by property category, with helpful comparisons to nearby areas to frame each figure in meaningful context. The dataset is structured for clarity, not clutter, and is intended for use by people who need grounded, reliable pricing.
This dataset provides monthly summaries of evapotranspiration (ET) data from OpenET v2.0 image collections for the period 2008-2023 for all National Watershed Boundary Dataset subwatersheds (12-digit hydrologic unit codes [HUC12s]) in the US that overlap the spatial extent of OpenET datasets. For each HUC12, this dataset contains spatial aggregation statistics (minimum, mean, median, and maximum) for each of the ET variables from each of the publicly available image collections from OpenET for the six available models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop) and the Ensemble image collection, which is a pixel-wise ensemble of all 6 individual models after filtering and removal of outliers according to the median absolute deviation approach (Melton and others, 2022). Data are available in this data release in two different formats: comma-separated values (CSV) and parquet, a high-performance format that is optimized for storage and processing of columnar data. CSV files containing data for each 4-digit HUC are grouped by 2-digit HUCs for easier access of regional data, and the single parquet file provides convenient access to the entire dataset. For each of the ET models (DisALEXI, eeMETRIC, geeSEBAL, PT-JPL, SIMS, SSEBop), variables in the model-specific CSV data files include: -huc12: The 12-digit hydrologic unit code -ET: Actual evapotranspiration (in millimeters) over the HUC12 area in the month calculated as the sum of daily ET interpolated between Landsat overpasses -statistic: Max, mean, median, or min. Statistic used in the spatial aggregation within each HUC12. For example, maximum ET is the maximum monthly pixel ET value occurring within the HUC12 boundary after summing daily ET in the month -year: 4-digit year -month: 2-digit month -count: Number of Landsat overpasses included in the ET calculation in the month -et_coverage_pct: Integer percentage of the HUC12 with ET data, which can be used to determine how representative the ET statistic is of the entire HUC12 -count_coverage_pct: Integer percentage of the HUC12 with count data, which can be different than the et_coverage_pct value because the “count” band in the source image collection extends beyond the “et” band in the eastern portion of the image collection extent For the Ensemble data, these additional variables are included in the CSV files: -et_mad: Ensemble ET value, computed as the mean of the ensemble after filtering outliers using the median absolute deviation (MAD) -et_mad_count: The number of models used to compute the ensemble ET value after filtering for outliers using the MAD -et_mad_max: The maximum value in the ensemble range, after filtering for outliers using the MAD -et_mad_min: The minimum value in the ensemble range, after filtering for outliers using the MAD -et_sam: A simple arithmetic mean (across the 6 models) of actual ET average without outlier removal Below are the locations of each OpenET image collection used in this summary: DisALEXI: https://developers.google.com/earth-engine/datasets/catalog/OpenET_DISALEXI_CONUS_GRIDMET_MONTHLY_v2_0 eeMETRIC: https://developers.google.com/earth-engine/datasets/catalog/OpenET_EEMETRIC_CONUS_GRIDMET_MONTHLY_v2_0 geeSEBAL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_GEESEBAL_CONUS_GRIDMET_MONTHLY_v2_0 PT-JPL: https://developers.google.com/earth-engine/datasets/catalog/OpenET_PTJPL_CONUS_GRIDMET_MONTHLY_v2_0 SIMS: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SIMS_CONUS_GRIDMET_MONTHLY_v2_0 SSEBop: https://developers.google.com/earth-engine/datasets/catalog/OpenET_SSEBOP_CONUS_GRIDMET_MONTHLY_v2_0 Ensemble: https://developers.google.com/earth-engine/datasets/catalog/OpenET_ENSEMBLE_CONUS_GRIDMET_MONTHLY_v2_0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cell population heterogeneity can affect cellular response and is a major factor in drug resistance. However, there are few techniques available to represent and explore how heterogeneity is linked to population response. Recent high-throughput genomic, proteomic, and cellomic approaches offer opportunities for profiling heterogeneity on several scales. We have recently examined heterogeneity in vascular endothelial growth factor receptor (VEGFR) membrane localization in endothelial cells. We and others processed the heterogeneous data through ensemble averaging and integrated the data into computational models of anti-angiogenic drug effects in breast cancer. Here we show that additional modeling insight can be gained when cellular heterogeneity is considered. We present comprehensive statistical and computational methods for analyzing cellomic data sets and integrating them into deterministic models. We present a novel method for optimizing the fit of statistical distributions to heterogeneous data sets to preserve important data and exclude outliers. We compare methods of representing heterogeneous data and show methodology can affect model predictions up to 3.9-fold. We find that VEGF levels, a target for tuning angiogenesis, are more sensitive to VEGFR1 cell surface levels than VEGFR2; updating VEGFR1 levels in the tumor model gave a 64% change in free VEGF levels in the blood compartment, whereas updating VEGFR2 levels gave a 17% change. Furthermore, we find that subpopulations of tumor cells and tumor endothelial cells (tEC) expressing high levels of VEGFR (>35,000 VEGFR/cell) negate anti-VEGF treatments. We show that lowering the VEGFR membrane insertion rate for these subpopulations recovers the anti-angiogenic effect of anti-VEGF treatment, revealing new treatment targets for specific tumor cell subpopulations. This novel method of characterizing heterogeneous distributions shows for the first time how different representations of the same data set lead to different predictions of drug efficacy.
Since MANOVA revealed that a negative relationship between biological composition (the RNA/DNA ratio and the Symbiodinium genome copy proportion [GCP]) and gene expression significantly distinguished outliers from non-outliers (Fig 4C), multivariate analysis of covariance (MANCOVA) and univariate ANCOVA were performed on the multivariate means (excluding size data) and individual gene expression means, respectively. Only genes for which statistically significant interaction effects were documented between a biological composition parameter and outlier status (analyzed as a categorical variable: outlier [yes] vs. non-outlier [no]) have been presented. For MANCOVA, Wilks’ lambda values are shown for comparisons between a continuous variable and a categorical one, while Exact F values are shown between two continuous variables. For the multivariate data, individual correlations were tested between canonical scores (first axis only) and 1) the Symbiodinium GCP and 2) the RNA/DNA ratio. t = linear regression test statistic. *p<0.05. **p<0.01. ***p<0.0001.
https://m0ve.com/terms-of-usehttps://m0ve.com/terms-of-use
Compiled by M0VE.com and covering the years 2021 to 2025, this Evesham residential dataset is structured to remove common distortions and present a clear, honest picture of the local market. Source data includes Land Registry entries, EPC documentation, and council-maintained housing records. Every datapoint is passed through a deliberate refinement system that accounts for inflation, excludes statistical outliers, and adjusts pricing according to building age, energy classification, and home category. With year-on-year changes mapped out and prices grouped by property type, the dataset offers practical insights backed by comparable information from surrounding areas. It’s straightforward, consistent, and designed for actual use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Robustness test by sequentially removing the outliers.
We propose new estimates of the secular aberration drift, mainly due to the rotation of the Solar System about the Galactic center, based on up-to-date VLBI observations and and improved method of outlier elimination. We fit degree-2 vector spherical harmonics to extragalactic radio source proper motion field derived from geodetic VLBI observations spanning 1979-2013. We pay particular attention to the outlier elimination procedure to remove outliers from (i) radio source coordinate time series and (ii) the proper motion sample. We obtain more accurate values of the Solar system acceleration compared to those in our previous paper. The acceleration vector is oriented towards the Galactic center within ~7{deg}. The component perpendicular to the Galactic plane is statistically insignificant. We show that an insufficient cleaning of the data set can lead to strong variations in the dipole amplitude and orientation, and statistically biased results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe etiology of diabetic kidney disease is complex, and the role of lipoproteins and their lipid components in the development of the disease cannot be ignored. However, phospholipids are an essential component, and no Mendelian randomization studies have yet been conducted to examine potential causal associations between phospholipids and diabetic kidney disease.MethodsRelevant exposure and outcome datasets were obtained through the GWAS public database. The exposure datasets included various phospholipids, including those in LDL, IDL, VLDL, and HDL. IVW methods were the primary analytical approach. The accuracy of the results was validated by conducting heterogeneity, MR pleiotropy, and F-statistic tests. MR-PRESSO analysis was utilized to identify and exclude outliers.ResultsPhospholipids in intermediate-density lipoprotein (OR: 0.8439; 95% CI: 0.7268–0.9798), phospholipids in large low- density lipoprotein (OR: 0.7913; 95% CI: 0.6703–0.9341), phospholipids in low- density lipoprotein (after removing outliers, OR: 0.788; 95% CI: 0.6698–0.9271), phospholipids in medium low- density lipoprotein (OR: 0.7682; 95% CI: 0.634–0.931), and phospholipids in small low-density lipoprotein (after removing outliers, OR: 0.8044; 95% CI: 0.6952–0.9309) were found to be protective factors.ConclusionsThis study found that a higher proportion of phospholipids in intermediate-density lipoprotein and the various subfractions of low-density lipoprotein, including large LDL, medium LDL, and small LDL, is associated with a lower risk of developing diabetic kidney disease.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Full model output for statistical analyses (including and excluding outliers) and figures showing the relationship between phenological sensitivity and performance for flowering and vegetative phenology and proportional increase in dbh for species in the Harvard Forest Warming Experiment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Validation of the GPR model: Comparison between the estimates obtained with the GPR method, with and without outliers, and the ECMWF ERA5 reanalysis data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.