100+ datasets found

P
Column Correlation Data Dataset
paperswithcode.com
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Immanuel Trummer (2023). Column Correlation Data Dataset [Dataset]. https://paperswithcode.com/dataset/column-correlation-data
Explore at:
Dataset updated
Sep 12, 2023
Authors
Immanuel Trummer
Description
Contains correlation data for 119,384 column pairs, taken from 3,952 data sets, including Pearson correlation, Spearman correlation, and Theil's U. This data can be used, e.g., for approaches that predict column correlation based on column properties, including column names.
Evaluating Correlation Between Measurement Samples in Reverberation Chambers...
data.nist.gov
datasets.ai
+1more
Updated Apr 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering [Dataset]. http://doi.org/10.18434/mds2-2986
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2986, https://identifiers.org/ark:/88434/mds2-2986
Dataset updated
Apr 6, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering Abstract: Traditionally, in reverberation chambers (RC) measurement autocorrelation or correlation-matrix methods have been applied to evaluate measurement correlation. In this article, we introduce the use of clustering based on correlative distance to group correlated measurements. We apply the method to measurements taken in an RC using one and two paddles to stir the electromagnetic fields and applying decreasing angular steps between consecutive paddles positions. The results using varying correlation threshold values demonstrate that the method calculates the number of effective samples and allows discerning outliers, i.e., uncorrelated measurements, and clusters of correlated measurements. This calculation method, if verified, will allow non-sequential stir sequence design and, thereby, reduce testing time. Keywords: Correlation, Pearson correlation coefficient (PCC), reverberation chambers (RC), mode-stirring samples, correlative distance, clustering analysis, adjacency matrix.
f
D-CCA: A Decomposition-Based Canonical Correlation Analysis for...
tandf.figshare.com
zip
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hai Shu; Xiao Wang; Hongtu Zhu (2024). D-CCA: A Decomposition-Based Canonical Correlation Analysis for High-Dimensional Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7461734.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7461734.v2
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francis
Authors
Hai Shu; Xiao Wang; Hongtu Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A typical approach to the joint analysis of two high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within a single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider the more necessary orthogonal relationship between the two distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the ℓ2 space of random variables rather than the conventionally used Euclidean space, with a careful construction of the orthogonal relationship between distinctive matrices. D-CCA represents a natural generalization of the traditional canonical correlation analysis. The proposed estimators of common and distinctive matrices are shown to be consistent and have reasonably better performance than some state-of-the-art methods in both simulated data and the real data analysis of breast cancer data obtained from The Cancer Genome Atlas. Supplementary materials for this article are available online.
p
Music & Affect 2020 Dataset Study 1.csv
psycharchives.org
Updated Sep 17, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Music & Affect 2020 Dataset Study 1.csv [Dataset]. https://www.psycharchives.org/handle/20.500.12034/3089
Explore at:
Dataset updated
Sep 17, 2020
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset for: Leipold, B. & Loepthien, T. (2021). Attentive and emotional listening to music: The role of positive and negative affect. Jahrbuch Musikpsychologie, 30. https://doi.org/10.5964/jbdgm.78 In a cross-sectional study associations of global affect with two ways of listening to music – attentive–analytical listening (AL) and emotional listening (EL) were examined. More specifically, the degrees to which AL and EL are differentially correlated with positive and negative affect were examined. In Study 1, a sample of 1,291 individuals responded to questionnaires on listening to music, positive affect (PA), and negative affect (NA). We used the PANAS that measures PA and NA as high arousal dimensions. AL was positively correlated with PA, EL with NA. Moderation analyses showed stronger associations between PA and AL when NA was low. Study 2 (499 participants) differentiated between three facets of affect and focused, in addition to PA and NA, on the role of relaxation. Similar to the findings of Study 1, AL was correlated with PA, EL with NA and PA. Moderation analyses indicated that the degree to which PA is associated with an individual´s tendency to listen to music attentively depends on their degree of relaxation. In addition, the correlation between pleasant activation and EL was stronger for individuals who were more relaxed; for individuals who were less relaxed the correlation between unpleasant activation and EL was stronger. In sum, the results demonstrate not only simple bivariate correlations, but also that the expected associations vary, depending on the different affective states. We argue that the results reflect a dual function of listening to music, which includes emotional regulation and information processing.: Dataset Study 1
f
Data from: Correlation matrices.
plos.figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bálint Maczák; Gergely Vadai; András Dér; István Szendi; Zoltán Gingl (2023). Correlation matrices. [Dataset]. http://doi.org/10.1371/journal.pone.0261718.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0261718.s001
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Bálint Maczák; Gergely Vadai; András Dér; István Szendi; Zoltán Gingl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Our analyses are based on 148×148 time- and frequency-domain correlation matrices. A correlation matrix covers all the possible use cases of every activity metric listed in the article. With these activity metrics and different preprocessing methods, we were able to calculate 148 different activity signals from multiple datasets of a single measurement. Each cell of a correlation matrix contains the mean and standard deviation of the calculated Pearson’s correlation coefficients between two types of activity signals based on 42 different subjects’ 10-days-long motion. The small correlation matrices presented both in the article and in the appendixes are derived from these 148 × 148 correlation matrices. This published Excel workbook contains multiple sheets labelled according to their content. The mean and standard deviation values for both time- and frequency-domain correlations can be found on their own separate sheet. Moreover, we reproduced the correlation matrix with an alternatively parametrized digital filter, which doubled the number of sheets to 8. In the Excel workbook, we used the same notation for both the datasets and activity metrics as presented in this article with an extension to the PIM metric: PIMs denotes the PIM metric where we used Simpson’s 3/8 rule integration method, PIMr indicates the PIM metric where we calculated the integral by simple numerical integration (Riemann sum). (XLSX)
d
Example Groundwater-Level Datasets and Benchmarking Results for the...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Example Groundwater-Level Datasets and Benchmarking Results for the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) Software Package [Dataset]. https://catalog.data.gov/dataset/example-groundwater-level-datasets-and-benchmarking-results-for-the-automated-regional-cor
Explore at:
Dataset updated
Oct 13, 2024
Dataset provided by
U.S. Geological Survey
Description
This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.
f
Data sets of the study.
plos.figshare.com
xls
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shouxi Zhu; Hongbin Gu (2023). Data sets of the study. [Dataset]. http://doi.org/10.1371/journal.pone.0283577.s001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0283577.s001
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Shouxi Zhu; Hongbin Gu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThis study aimed to explore the adverse influences of mobile phone usage on pilots’ status, so as to improve flight safety.MethodsA questionnaire was designed, and a cluster random sampling method was adopted. Pilots of Shandong Airlines were investigated on the use of mobile phones. The data was analyzed by frequency statistics, linear regression and other statistical methods.ResultsA total of 340 questionnaires were distributed and 317 were returned, 315 of which were valid. The results showed that 239 pilots (75.87%) used mobile phones as the main means of entertainment in their leisure time. There was a significant negative correlation between age of pilots and playing mobile games (p
n
Simulated data set for reproduction of the MGIDI index - High correlation
narcis.nl
data.mendeley.com
Updated Oct 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivoto, T (via Mendeley Data) (2020). Simulated data set for reproduction of the MGIDI index - High correlation [Dataset]. http://doi.org/10.17632/vzzkmrkrrr.1
Explore at:
Unique identifier
https://doi.org/10.17632/vzzkmrkrrr.1
Dataset updated
Oct 19, 2020
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Olivoto, T (via Mendeley Data)
Description
This is a simulated data set containing 1000 genotypes and 25 highly correlated traits, to be used in the Monte Carlo simulation of the draft paper "MGIDI: towards an effective multivariate selection in biological experiments" by Tiago Olivoto and Maicon Nardino
H
Replication Data for:"Real-World Considerations for Deep Learning in...
dataverse.harvard.edu
Updated Jan 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kürşat Tekbıyık; Özkan Akbunar; Ali Rıza Ekti; Ali Görçin (2019). Replication Data for:"Real-World Considerations for Deep Learning in Wireless Signal Identification Based on Spectral Correlation Function" [Dataset]. http://doi.org/10.7910/DVN/KNEEVY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/KNEEVY
Dataset updated
Jan 28, 2019
Dataset provided by
Harvard Dataverse
Authors
Kürşat Tekbıyık; Özkan Akbunar; Ali Rıza Ekti; Ali Görçin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
World
Description
The dataset includes spectral correlation function (SCF) estimations by FFT accumulation method (FAM) for totally 4500 signals with 20000 I/Q samples. The signals belong to three different cellular communication standards: GSM, WCDMA, and LTE. The signals have been received from the different channels with multipath, fading, and noise. The dataset can be used to validate the designed classifier model aiming to identify cellular communication signals. For each signal, the dimension of SCF estimate is 8193*16. There are two train sets which must be used together (concatenate train_data_wo_mapping1 and train_data_wo_mapping2 ). Two train sets have 3000 signals totally, and the test set has 1500. The label of the cellular communication standards are given in dataset as follows: WCDMA -> 0 LTE -> 1 GSM -> 2 The dataset includes: 1. SCFDatatrain1.mat 2. SCFDatatrain2.mat 3. SCFDatatest.mat The contents of .mat files: train_class : denotes class labels of the train set, its dimension is 3000*1 double train_data_wo_mapping1 : includes the first half of the training data, its dimension 1500*1 cell train_data_wo_mapping2 : includes the second half of the training data, its dimension 1500*1 cell *Note, concatenate two cells given above (ie [train_data_wo_mapping1; train_data_wo_mapping2]) test_class : denotes class labels of the train set, its dimension is 1500*1 double test_data_without_mapping : includes the test data, its dimension 1500*1 cell Each cell contains 1500 SCF estimates (8193*16) . The dataset has been used for the paper "Real-World Considerations for Deep Learning in Wireless Signal Identification Based on Spectral Correlation Function" submitted for possible publication in IEEE Wireless Communication Letters. Please cite this paper, if you use the dataset.
f
Results of the Pearson correlation analysis between training load variables,...
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs (2023). Results of the Pearson correlation analysis between training load variables, starting fitness and end fitness. [Dataset]. http://doi.org/10.1371/journal.pone.0211776.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0211776.t003
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of the Pearson correlation analysis between training load variables, starting fitness and end fitness.
P
CI-MNIST Dataset
paperswithcode.com
Updated Mar 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charan Reddy; Soroush Mehri; Deepak Sharma; Samira Shabanian; Sina Honari (2022). CI-MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/ci-mnist
Explore at:
Dataset updated
Mar 23, 2022
Authors
Charan Reddy; Soroush Mehri; Deepak Sharma; Samira Shabanian; Sina Honari
Description
CI-MNIST (Correlated and Imbalanced MNIST) is a variant of MNIST dataset with introduced different types of correlations between attributes, dataset features, and an artificial eligibility criterion. For an input image $x$, the label $y \in {1, 0}$ indicates eligibility or ineligibility, respectively, given that $x$ is even or odd. The dataset defines the background colors as the protected or sensitive attribute $s \in {0, 1}$, where blue denotes the unprivileged group and red denotes the privileged group. The dataset was designed in order to evaluate bias-mitigation approaches in challenging setups and be capable of controlling different dataset configurations.
d
Data From: The 1014F knockdown resistance mutation is not a strong correlate...
catalog.data.gov
datasets.ai
+2more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data From: The 1014F knockdown resistance mutation is not a strong correlate of phenotypic resistance to pyrethroids in Florida populations of Culex quinquefasciatus [Dataset]. https://catalog.data.gov/dataset/data-from-the-1014f-knockdown-resistance-mutation-is-not-a-strong-correlate-of-phenotypic--78b35
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
Culex quinquefasciatus is an important target for vector control because of its ability to transmit pathogens that cause disease. Most populations are resistant to pyrethroids and often to organophosphates, the two most common classes of active ingredients used by public health agencies. A knockdown resistance (kdr) mutation, resulting in a change from a leucine to phenylalanine in the voltage gated sodium channel, is one mechanism contributing to the pyrethroid resistant phenotype. Enzymatic resistance has also been shown to play a very important role. Recent studies have shown strong resistance in populations even when kdr is relatively low which indicates factors other than kdr may be larger contributors to resistance. In this study, we examined on a statewide scale (over 70 populations), the strength of the correlation between resistance in the CDC bottle bioassay and the kdr genotypes and allele frequencies. Spearman correlation analysis showed only moderate (-0.51) and weak (-0.29) correlation between the kdr genotype and permethrin and deltamethrin respectively. The frequency of the kdr allele was an even weaker correlate. These results indicate, in contrast to Aedes aegypti, assessing kdr in populations of Culex quinquefasciatus is not a good surrogate for phenotypic resistance testing.
I
Processing and Pearson Correlation Scripts for the C&RL Article on the...
databank.illinois.edu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Mischo; Mary C. Schlembach, Processing and Pearson Correlation Scripts for the C&RL Article on the Relationships between Publication, Citation, and Usage Metrics at the University of Illinois at Urbana-Champaign Library [Dataset]. http://doi.org/10.13012/B2IDB-0931140_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-0931140_V1
Authors
William Mischo; Mary C. Schlembach
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Illinois
Description
These processing and Pearson correlational scripts were developed to support the study that examined the correlational relationships between local journal authorship, local and external citation counts, full-text downloads, link-resolver clicks, and four global journal impact factor indices within an all-disciplines journal collection of 12,200 titles and six subject subsets at the University of Illinois at Urbana-Champaign (UIUC) Library. This study shows strong correlations in the all-disciplines set and most subject subsets. Special processing scripts and web site dashboards were created, including Pearson correlational analysis scripts for reading values from relational databases and displaying tabular results. The raw data used in this analysis, in the form of relational database tables with multiple columns, is available at https://doi.org/10.13012/B2IDB-6810203_V1.
o
Data set of correlations between stocks world wide
explore.openaire.eu
Updated Mar 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Burkard (2022). Data set of correlations between stocks world wide [Dataset]. http://doi.org/10.5281/zenodo.6331463
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6331463
Dataset updated
Mar 6, 2022
Authors
Burkard
Area covered
World
Description
This data set contains intraday (1 hour format) correlations for one month (December 2021) from more than 2000 Stocks, Indices, Forex and Futures of major Stock exchanges world wide. It is an example of the outcome from data processing inside Infore project. The data set contains more than 2 million files.
f
Data sets to demonstrate the inappropriate use of correlation coefficient in...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafdzah Zaki; Awang Bulgiba; Roshidi Ismail; Noor Azina Ismail (2023). Data sets to demonstrate the inappropriate use of correlation coefficient in testing agreement. [Dataset]. http://doi.org/10.1371/journal.pone.0037908.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0037908.t004
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Rafdzah Zaki; Awang Bulgiba; Roshidi Ismail; Noor Azina Ismail
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data sets to demonstrate the inappropriate use of correlation coefficient in testing agreement.
J
Testing for correlation in error‐component models (replication data)
journaldata.zbw.eu
txt
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koen Jochmans; Koen Jochmans (2022). Testing for correlation in error‐component models (replication data) [Dataset]. http://doi.org/10.15456/jae.2022327.0715375800
Explore at:
txt(23401432), txt(3356)Available download formats
Unique identifier
https://doi.org/10.15456/jae.2022327.0715375800
Dataset updated
Dec 7, 2022
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
Koen Jochmans; Koen Jochmans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper concerns linear models for grouped data with group-specific effects. We construct a portmanteau test for the null of no within-group correlation beyond that induced by the group-specific effect. The approach allows for heteroskedasticity and is applicable to models with exogenous, predetermined, or endogenous regressors. The test can be implemented as soon as three observations per group are available and is applicable to unbalanced data. A test with such general applicability is not available elsewhere. We provide theoretical results on size and power under asymptotics where the number of groups grows but their size is held fixed. Extensive power comparisons with other tests available in the literature for special cases of our setup reveal that our test compares favorably. In a simulation study we find that, under heteroskedasticity, only our procedure yields a test that is both size correct and powerful. In a large data set on mothers with multiple births we find that infant birthweight is correlated across children even after controlling for mother fixed effects and a variety of prenatal care factors. This suggests that such a strategy may be inadequate to take care of all confounding factors that correlate with the mother's decision to engage in activities that are detrimental to the infant's health, such as smoking.
n
Data from: WiBB: An integrated method for quantifying the relative...
data.niaid.nih.gov
datadryad.org
zip
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Aug 20, 2021
Dataset provided by
Beijing Normal University
Field Museum of Natural History
Authors
Qin Li; Xiaojun Kou
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Data from: Data Set S7 - Tie points for correlation of IODP Site 342-U1408...
doi.pangaea.de
html, tsv
Updated Sep 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlotta Cappelli; Paul R Bown; Thomas Westerhold; Yuhji Yamamoto; Claudia Agnini; Steven M Bohaty; Martina de Riu; Veronica Lobba (2019). Data Set S7 - Tie points for correlation of IODP Site 342-U1408 to IODP Site 342-U1410 [Dataset]. http://doi.org/10.1594/PANGAEA.905422
Explore at:
tsv, htmlAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.905422
Dataset updated
Sep 2, 2019
Dataset provided by
PANGAEA
Authors
Carlotta Cappelli; Paul R Bown; Thomas Westerhold; Yuhji Yamamoto; Claudia Agnini; Steven M Bohaty; Martina de Riu; Veronica Lobba
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Variables measured
Tie point, Sample code/label, DEPTH, sediment/rock, Depth, composite revised, corrected
Description
This dataset is about: Data Set S7 - Tie points for correlation of IODP Site 342-U1408 to IODP Site 342-U1410. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.905432 for more information.
Data from: DATASET FOR: A multimodal spectroscopic approach combining...
zenodo.org
bin, csv, zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Perez Guaita; David Perez Guaita (2024). DATASET FOR: A multimodal spectroscopic approach combining mid-infrared and near-infrared for discriminating Gram-positive and Gram-negative bacteria [Dataset]. http://doi.org/10.5281/zenodo.10523185
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10523185
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Perez Guaita; David Perez Guaita
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:

This dataset comprises a comprehensive set of files designed for the analysis and 2D correlation of spectral data, specifically focusing on ATR and NIR spectra. It includes MATLAB scripts and supporting functions necessary to replicate the analysis, as well as the raw datasets used in the study. Below is a detailed description of the included files:

Data Analysis:

File Name: Data_Analysis.mlx

Description: This MATLAB Live Script file contains the main script used for the classification analysis of the spectral data. It includes steps for preprocessing, analysis, and visualization of the ATR and NIR spectra.

2D Correlation Data Analysis:

File Name: Data_Analysis_2Dcorr.mlx

Description: This MATLAB Live Script file is similar to the primary analysis script but is specifically tailored for performing 2D correlation analysis on the spectral data. It includes detailed steps and code for executing the 2D correlation.

Functions:

Folder Name: Functions

Description: This folder contains all the necessary MATLAB function files required to replicate the analyses presented in the scripts. These functions handle various preprocessing steps, calculations, and visualizations.

Datasets:

File Names: ATR_dataset.xlsx, NIR_dataset.xlsx, Reference_data.csv

Description: These Excel files contain the raw spectral data for ATR and NIR analyses, as well as reference datasets. Each file includes multiple sheets with detailed measurements and metadata.

Usage Notes:

Software Requirements:

MATLAB is required to run the .mlx files and utilize the functions.

PLS_Toolbox: Necessary for certain preprocessing and analysis steps.

MIDAS 2010: Available at MIDAS 2010, required for the 2D correlation analysis.

Replication: Users can replicate the analyses by running the Data_Analysis.mlx and Data_Analysis_2Dcorr.mlx scripts in MATLAB, ensuring that the Functions folder is in the MATLAB path.

Data Handling: The datasets are provided in .xlsx format, which can be easily imported into MATLAB or other data analysis software.
d
GLO climate data stats summary
data.gov.au
cloud.csiss.gmu.edu
+2more
zip
Updated Apr 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). GLO climate data stats summary [Dataset]. https://data.gov.au/data/dataset/afed85e0-7819-493d-a847-ec00a318e657
Explore at:
zip(8810)Available download formats
Dataset updated
Apr 13, 2022
Dataset authored and provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

Various climate variables summary for all 15 subregions based on Bureau of Meteorology Australian Water Availability Project (BAWAP) climate grids. Including

Time series mean annual BAWAP rainfall from 1900 - 2012.

Long term average BAWAP rainfall and Penman Potentail Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month

Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P (precipitation); (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009).

As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

There are 4 csv files here:

BAWAP_P_annual_BA_SYB_GLO.csv

Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

Source data: annual BILO rainfall

P_PET_monthly_BA_SYB_GLO.csv

long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

Climatology_Trend_BA_SYB_GLO.csv

Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

Dataset History

Dataset was created from various BAWAP source data, including Monthly BAWAP rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET, Correlation coefficient data. Data were extracted from national datasets for the GLO subregion.

BAWAP_P_annual_BA_SYB_GLO.csv

Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

Source data: annual BILO rainfall

P_PET_monthly_BA_SYB_GLO.csv

long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

Climatology_Trend_BA_SYB_GLO.csv

Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

Dataset Citation

Bioregional Assessment Programme (2014) GLO climate data stats summary. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/afed85e0-7819-493d-a847-ec00a318e657.

Dataset Ancestors

Derived From Natural Resource Management (NRM) Regions 2010

Derived From Bioregional Assessment areas v03

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From GEODATA TOPO 250K Series 3

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From Geological Provinces - Full Extent

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Facebook

Twitter

Click to copy link

Link copied

Cite

Immanuel Trummer (2023). Column Correlation Data Dataset [Dataset]. https://paperswithcode.com/dataset/column-correlation-data

Column Correlation Data Dataset

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 12, 2023

Authors

Immanuel Trummer

Description

Contains correlation data for 119,384 column pairs, taken from 3,952 data sets, including Pearson correlation, Spearman correlation, and Theil's U. This data can be used, e.g., for approaches that predict column correlation based on column properties, including column names.

Clear search

Close search

Google apps

Main menu

Column Correlation Data Dataset

Evaluating Correlation Between Measurement Samples in Reverberation Chambers...

D-CCA: A Decomposition-Based Canonical Correlation Analysis for...

Music & Affect 2020 Dataset Study 1.csv

Data from: Correlation matrices.

Example Groundwater-Level Datasets and Benchmarking Results for the...

Data sets of the study.

Simulated data set for reproduction of the MGIDI index - High correlation

Replication Data for:"Real-World Considerations for Deep Learning in...

Results of the Pearson correlation analysis between training load variables,...

CI-MNIST Dataset

Data From: The 1014F knockdown resistance mutation is not a strong correlate...

Processing and Pearson Correlation Scripts for the C&RL Article on the...

Data set of correlations between stocks world wide

Data sets to demonstrate the inappropriate use of correlation coefficient in...

Testing for correlation in error‐component models (replication data)

Data from: WiBB: An integrated method for quantifying the relative...

Data from: Data Set S7 - Tie points for correlation of IODP Site 342-U1408...

Data from: DATASET FOR: A multimodal spectroscopic approach combining...

Description:

Usage Notes:

GLO climate data stats summary

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Column Correlation Data Dataset