Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Dataset provided to help users interpret the correction made to the detailed Census 2021 sexual orientation estimates. More information in quality notice.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Companion data for the creation of a banksia plot:Background:In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.Methods:The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.Results:In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.Conclusions:The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for "Comparing Transaction Logs to ILL requests to Determine the Persistence of Library Patrons In Obtaining Materials" article. Excel file contains all data in four worksheets Zip file contains four csv files, one for each worksheet: - Comparing Transaction Logs to ILL - 2016 ILL Raw ...Data.csv - Comparing Transaction Logs to ILL - 2015 ILL Raw Data.csv - Comparing Transaction Logs to ILL - 2016 Zero Search Raw Data.csv - Comparing Transaction Logs to ILL - 2015 Zero Search Raw Data.csv [more]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data corresponds to quantitative (raw) effort assessments/predictions during maintenance process of a sample of 1000 possible instances of the general selection problem among Visitor and Inheritance Based Implementation over the Composite design patterns (CIBI vs CVP).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data from experiment "Sample-comparison mapping and joint stimulus control"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.
A comma separated values (csv) file that is a snapshot of percent difference between November 19, 2008 and November 14, 2016 peak streamflow. The file lists station identification, water year, original (2008) peak Q, current (2016) peak Q and percent difference calculated per water year. The percent difference was calculated as the absolute value of [(current peak Q - original peak Q)/(original peak Q) x 100], where current peak Q is the 2016 peak and the original peak Q is the 2008 peak. When an original peak Q value is 0, the resultant percent difference calculation is undefined because of division by 0. In these cases, the percent difference field is populated with NA. Those entries are included in the data file so that users can make their own comparisons between the 2008 and 2016 peaks for those cases where the original peak value was 0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
comparisons of MI VS ORIGINAL, EM VS ORIGINAL, and CIM VS ORIGINAL
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pathway Multi-Omics Simulated Data
These are synthetic variations of the TCGA COADREAD data set (original data available at http://linkedomics.org/data_download/TCGA-COADREAD/). This data set is used as a comprehensive benchmark data set to compare multi-omics tools in the manuscript "pathwayMultiomics: An R package for efficient integrative analysis of multi-omics datasets with matched or un-matched samples".
There are 100 sets (stored as 100 sub-folders, the first 50 in "pt1" and the second 50 in "pt2") of random modifications to centred and scaled copy number, gene expression, and proteomics data saved as compressed data files for the R programming language. These data sets are stored in subfolders labelled "sim001", "sim002", ..., "sim100". Each folder contains the following contents: 1) "indicatorMatricesXXX_ls.RDS" is a list of simple triplet matrices showing which genes (in which pathways) and which samples received the synthetic treatment (where XXX is the simulation run label: 001, 002, ...), (2) "CNV_partitionA_deltaB.RDS" is the synthetically modified copy number variation data (where A represents the proportion of genes in each gene set to receive the synthetic treatment [partition 1 is 20%, 2 is 40%, 3 is 60% and 4 is 80%] and B is the signal strength in units of standard deviations), (3) "RNAseq_partitionA_deltaB.RDS" is the synthetically modified gene expression data (same parameter legend as CNV), and (4) "Prot_partitionA_deltaB.RDS" is the synthetically modified protein expression data (same parameter legend as CNV).
Supplemental Files
The file "cluster_pathway_collection_20201117.gmt" is the collection of gene sets used for the simulation study in Gene Matrix Transpose format. Scripts to create and analyze these data sets available at: https://github.com/TransBioInfoLab/pathwayMultiomics_manuscript_supplement
https://data.gov.tw/licensehttps://data.gov.tw/license
Handle the re-survey of cadastre maps or cadastre organization areas, and the comparison table of old and new sections and plot numbers.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LLM Similarity Comparison Dataset
This dataset is pased on the original Alpaca dataset and was synthetically genearted for LLM similarity comparison using ConSCompF framework as described in the original paper. The script used for generating data is available on Kaggle. It is divided into 3 subsets:
quantization - contains 156,000 samples (5,200 for each model) generated by the original Tinyllama and its 8-bit, 4-bit, and 2-bit GGUF quantized versions. comparison - contains 28,600… See the full description on the dataset page: https://huggingface.co/datasets/alex-karev/llm-comparison.
https://rightsstatements.org/page/InC/1.0/https://rightsstatements.org/page/InC/1.0/
The dataset is divided into four subfolders: 1) "SEM experiment data" contains Scanning Electron Microscopy data, epifluorescence microscopy data and flow cytometry data of cultured Synechococcus, Chroococcus and Snowella 2) "raw data" contains epifluorescence microscopy and flow cytometry data of picophytoplankton from Finnish lakes. This has two sub folders "flow cytometry raw" and "microscopy raw" 3) "flow cytometry calibration data" contains data for cell size calibration with latex beads and volumetric calibration for the flow cytometer 4) "processed flow and microscopy data" contains excel workbooks for the figures shown in the manuscipt
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Comparison of outturn information with final plans by department for 2009-10, taken from snapshots 31 and 11 (Main Estimate outturn snapshot April 2010 and Spring Supplementary Estimates plans snapshot February 2010). The 2009-10 data are consistent with the raw COINS data published in June 2010. The 2009-10 data will not match the provisional outturn for 2009-10 published by the Treasury on 26 July 2010. These datasets, and the COINS raw data will be updated at the end of September, to reflect the latest outturn for 2009-10, once all related national statistic releases have taken place.
There is interest in using social media content to supplement or even substitute for survey data. O’Connor et al. (2010) report reasonably high correlations between the sentiment of tweets containing the word “jobs” and survey-based measures of consumer confidence in 2008-2009. Other researchers report a similar relationship through 2011 but after that time it is no longer observed, suggesting such tweets may not be as promising an alternative to survey responses as originally hoped. But, it’s possible that with the right analytic techniques, the sentiment of “jobs” tweets might still be an acceptable alternative. We explore this possibility by attempting to strengthen the original relationship and then extending the most successful approaches to more recent years. We classify “jobs” tweets into categories whose content is related to employment and categories whose content is not, to see if sentiment of the former correlates more highly with a survey-based measure of consumer sentiment. We use five sentiment-scoring tools, calculate daily sentiment three different ways, and use a measure of association less sensitive to outliers than correlation. None of these approaches improved the size of the relationship in the original or more recent data. We discuss the possibility that weighting and better understanding why users tweet might help recover the original relationship between the sentiment of tweets and survey responses. However, despite the earlier promise of tweets as an alternative to survey responses, we find no evidence that the original relationship was more than a chance occurrence.
Raw data, computed data and statistical code for all main analyses and subgroup analyses presented in JAMA Netw Open. 2020;3(8):e2015009. doi:10.1001/jamanetworkopen.2020.15009 Data sharing statement: Access to The English Longitudinal Study of Ageing (ELSA) dataset is publicly available via the UK Data Service (https://www.ukdataservice.ac.uk) Note: Statistical code to create the subcategories of some demographic variables included in the analyses (e.g. age categories of participants) may not be available in the current dataset. Additional statistical code is available from the corresponding author upon reasonable request at: dialechti.tsimpida@manchester.ac.uk
This data set contains the originally-submitted observation measurement data, terrestrial biosphere model output data, and inverse model simulations that various investigator teams contributed to the North American Carbon Program (NACP) Regional Synthesis activities. The data set provides nine (9) data packages of remote sensing and ground observation measurements (OM) (MODIS gross primary productivity (GPP), MODIS net primary production (NPP), MODIS fraction of photosynthetically active radiation (fPar), MODIS leaf area index (LAI), MODIS enhanced vegetation index (EVI), MODIS normalize difference vegetation index (NDVI), Forest Inventory and Analysis (FIA) forest biomass, National Agricultural Statistics Service (NASS) crop NPP, and Flux Anomaly). The data set also provides data packages of simulation results from 19 terrestrial biosphere models (TBM) and eight (8) inverse models (IM). The data packages are respectively OM, TBM, and IM data files listed in Tables 4-6. Each OM, TBM, and IM data package contains all of the original data (and documentation, if any) that the NACP Modeling and Synthesis Thematic Data Center (MAST-DC) acquired or received. These originally-submitted data were processed by the MAST-DC to produce the three standardized gridded data sets of carbon flux for inter-comparison purposes (see Related Data Products below). These original data and documentation are provided to allow users of the standardized gridded data products to be able to trace back to the data origins when needed. The Data Center (ORNL DAAC) transformed some of the originally-submitted data files to file formats that are more suitable for long-term archiving. For example, .xlsx files were saved as .csv, ERDAS Imagine files were converted to GeoTIFFs, and MATLAB files were converted to GeoTIFF and NetCDF formats as appropriate. Files received in NetCDF, GeoTIFF, and HDF formats were not transformed.
https://data.gov.tw/licensehttps://data.gov.tw/license
The data of the comparison of new and old land numbers for the re-testing business in the history of the Bade District (until the end of 2015)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.
The original data set was created and split using this Python code:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm
clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)
X_explain = X_test y_explain = y_test
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Files descriptions:
All csv files refer to results from the different models (PAMM, AARs, Linear models, MRPPs) on each iteration of the simulation. One row being one iteration. "results_perfect_detection.csv" refers to the results from the first simulation part with all the observations."results_imperfect_detection.csv" refers to the results from the first simulation part with randomly thinned observations to mimick imperfect detection.
ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).PAMM30: p-value of the PAMM running on the 30-days survey.PAMM7: p-value of the PAMM running on the 7-days survey.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).
"results_int_dir_perf_det.csv" refers to the results from the second simulation part, with all the observations."results_int_dir_imperf_det.csv" refers to the results from the second simulation part, with randomly thinned observations to mimick imperfect detection.ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of A on B.p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of B on A.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2_BAB: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.AAR2_ABA: ratio value for the Avoidance-Attraction-Ratio calculating ABA/AA.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).
Scripts files description:1_Functions: R script containing the functions: - MRPP from Karanth et al. (2017) adapted here for time efficiency. - MRPP from Murphy et al. (2021) adapted here for time efficiency. - Version of the ct_to_recurrent() function from the recurrent package adapted to process parallized on the simulation datasets. - The simulation() function used to simulate two species observations with reciprocal effect on each other.2_Simulations: R script containing the parameters definitions for all iterations (for the two parts of the simulations), the simulation paralellization and the random thinning mimicking imperfect detection.3_Approaches comparison: R script containing the fit of the different models tested on the simulated data.3_1_Real data comparison: R script containing the fit of the different models tested on the real data example from Murphy et al. 2021.4_Graphs: R script containing the code for plotting results from the simulation part and appendices.5_1_Appendix - Check for similarity between codes for Karanth et al 2017 method: R script containing Karanth et al. (2017) and Murphy et al. (2021) codes lines and the adapted version for time-efficiency matter and a comparison to verify similarity of results.5_2_Appendix - Multi-response procedure permutation difference: R script containing R code to test for difference of the MRPPs approaches according to the species on which permutation are done.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw data for the Paper titled "Modular comparison of untargeted metabolomics processing steps". The dataset encompasses 42 samples, with 3 solvent blanks, 7 QC samples, and 32 biological samples (4 biological replicates: Banane, Bergrose, Narbe, Ricky) spiked with 42 compounds in different concentrations (0 ngmL, 30 ngmL, 100 ngmL, 300 ngmL).
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Dataset provided to help users interpret the correction made to the detailed Census 2021 sexual orientation estimates. More information in quality notice.