100+ datasets found
  1. P

    Column Correlation Data Dataset

    • paperswithcode.com
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Immanuel Trummer (2023). Column Correlation Data Dataset [Dataset]. https://paperswithcode.com/dataset/column-correlation-data
    Explore at:
    Dataset updated
    Sep 12, 2023
    Authors
    Immanuel Trummer
    Description

    Contains correlation data for 119,384 column pairs, taken from 3,952 data sets, including Pearson correlation, Spearman correlation, and Theil's U. This data can be used, e.g., for approaches that predict column correlation based on column properties, including column names.

  2. Evaluating Correlation Between Measurement Samples in Reverberation Chambers...

    • data.nist.gov
    • datasets.ai
    • +1more
    Updated Apr 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering [Dataset]. http://doi.org/10.18434/mds2-2986
    Explore at:
    Dataset updated
    Apr 6, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    Evaluating Correlation Between Measurement Samples in Reverberation Chambers Using Clustering Abstract: Traditionally, in reverberation chambers (RC) measurement autocorrelation or correlation-matrix methods have been applied to evaluate measurement correlation. In this article, we introduce the use of clustering based on correlative distance to group correlated measurements. We apply the method to measurements taken in an RC using one and two paddles to stir the electromagnetic fields and applying decreasing angular steps between consecutive paddles positions. The results using varying correlation threshold values demonstrate that the method calculates the number of effective samples and allows discerning outliers, i.e., uncorrelated measurements, and clusters of correlated measurements. This calculation method, if verified, will allow non-sequential stir sequence design and, thereby, reduce testing time. Keywords: Correlation, Pearson correlation coefficient (PCC), reverberation chambers (RC), mode-stirring samples, correlative distance, clustering analysis, adjacency matrix.

  3. f

    D-CCA: A Decomposition-Based Canonical Correlation Analysis for...

    • tandf.figshare.com
    zip
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hai Shu; Xiao Wang; Hongtu Zhu (2024). D-CCA: A Decomposition-Based Canonical Correlation Analysis for High-Dimensional Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7461734.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Hai Shu; Xiao Wang; Hongtu Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A typical approach to the joint analysis of two high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within a single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider the more necessary orthogonal relationship between the two distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the ℓ2 space of random variables rather than the conventionally used Euclidean space, with a careful construction of the orthogonal relationship between distinctive matrices. D-CCA represents a natural generalization of the traditional canonical correlation analysis. The proposed estimators of common and distinctive matrices are shown to be consistent and have reasonably better performance than some state-of-the-art methods in both simulated data and the real data analysis of breast cancer data obtained from The Cancer Genome Atlas. Supplementary materials for this article are available online.

  4. p

    Music & Affect 2020 Dataset Study 1.csv

    • psycharchives.org
    Updated Sep 17, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Music & Affect 2020 Dataset Study 1.csv [Dataset]. https://www.psycharchives.org/handle/20.500.12034/3089
    Explore at:
    Dataset updated
    Sep 17, 2020
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset for: Leipold, B. & Loepthien, T. (2021). Attentive and emotional listening to music: The role of positive and negative affect. Jahrbuch Musikpsychologie, 30. https://doi.org/10.5964/jbdgm.78 In a cross-sectional study associations of global affect with two ways of listening to music – attentive–analytical listening (AL) and emotional listening (EL) were examined. More specifically, the degrees to which AL and EL are differentially correlated with positive and negative affect were examined. In Study 1, a sample of 1,291 individuals responded to questionnaires on listening to music, positive affect (PA), and negative affect (NA). We used the PANAS that measures PA and NA as high arousal dimensions. AL was positively correlated with PA, EL with NA. Moderation analyses showed stronger associations between PA and AL when NA was low. Study 2 (499 participants) differentiated between three facets of affect and focused, in addition to PA and NA, on the role of relaxation. Similar to the findings of Study 1, AL was correlated with PA, EL with NA and PA. Moderation analyses indicated that the degree to which PA is associated with an individual´s tendency to listen to music attentively depends on their degree of relaxation. In addition, the correlation between pleasant activation and EL was stronger for individuals who were more relaxed; for individuals who were less relaxed the correlation between unpleasant activation and EL was stronger. In sum, the results demonstrate not only simple bivariate correlations, but also that the expected associations vary, depending on the different affective states. We argue that the results reflect a dual function of listening to music, which includes emotional regulation and information processing.: Dataset Study 1

  5. f

    Data from: Correlation matrices.

    • plos.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bálint Maczák; Gergely Vadai; András Dér; István Szendi; Zoltán Gingl (2023). Correlation matrices. [Dataset]. http://doi.org/10.1371/journal.pone.0261718.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Bálint Maczák; Gergely Vadai; András Dér; István Szendi; Zoltán Gingl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our analyses are based on 148×148 time- and frequency-domain correlation matrices. A correlation matrix covers all the possible use cases of every activity metric listed in the article. With these activity metrics and different preprocessing methods, we were able to calculate 148 different activity signals from multiple datasets of a single measurement. Each cell of a correlation matrix contains the mean and standard deviation of the calculated Pearson’s correlation coefficients between two types of activity signals based on 42 different subjects’ 10-days-long motion. The small correlation matrices presented both in the article and in the appendixes are derived from these 148 × 148 correlation matrices. This published Excel workbook contains multiple sheets labelled according to their content. The mean and standard deviation values for both time- and frequency-domain correlations can be found on their own separate sheet. Moreover, we reproduced the correlation matrix with an alternatively parametrized digital filter, which doubled the number of sheets to 8. In the Excel workbook, we used the same notation for both the datasets and activity metrics as presented in this article with an extension to the PIM metric: PIMs denotes the PIM metric where we used Simpson’s 3/8 rule integration method, PIMr indicates the PIM metric where we calculated the integral by simple numerical integration (Riemann sum). (XLSX)

  6. d

    Example Groundwater-Level Datasets and Benchmarking Results for the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Example Groundwater-Level Datasets and Benchmarking Results for the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) Software Package [Dataset]. https://catalog.data.gov/dataset/example-groundwater-level-datasets-and-benchmarking-results-for-the-automated-regional-cor
    Explore at:
    Dataset updated
    Oct 13, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.

  7. f

    Data sets of the study.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shouxi Zhu; Hongbin Gu (2023). Data sets of the study. [Dataset]. http://doi.org/10.1371/journal.pone.0283577.s001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Shouxi Zhu; Hongbin Gu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThis study aimed to explore the adverse influences of mobile phone usage on pilots’ status, so as to improve flight safety.MethodsA questionnaire was designed, and a cluster random sampling method was adopted. Pilots of Shandong Airlines were investigated on the use of mobile phones. The data was analyzed by frequency statistics, linear regression and other statistical methods.ResultsA total of 340 questionnaires were distributed and 317 were returned, 315 of which were valid. The results showed that 239 pilots (75.87%) used mobile phones as the main means of entertainment in their leisure time. There was a significant negative correlation between age of pilots and playing mobile games (p

  8. n

    Simulated data set for reproduction of the MGIDI index - High correlation

    • narcis.nl
    • data.mendeley.com
    Updated Oct 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivoto, T (via Mendeley Data) (2020). Simulated data set for reproduction of the MGIDI index - High correlation [Dataset]. http://doi.org/10.17632/vzzkmrkrrr.1
    Explore at:
    Dataset updated
    Oct 19, 2020
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Olivoto, T (via Mendeley Data)
    Description

    This is a simulated data set containing 1000 genotypes and 25 highly correlated traits, to be used in the Monte Carlo simulation of the draft paper "MGIDI: towards an effective multivariate selection in biological experiments" by Tiago Olivoto and Maicon Nardino

  9. H

    Replication Data for:"Real-World Considerations for Deep Learning in...

    • dataverse.harvard.edu
    Updated Jan 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kürşat Tekbıyık; Özkan Akbunar; Ali Rıza Ekti; Ali Görçin (2019). Replication Data for:"Real-World Considerations for Deep Learning in Wireless Signal Identification Based on Spectral Correlation Function" [Dataset]. http://doi.org/10.7910/DVN/KNEEVY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Kürşat Tekbıyık; Özkan Akbunar; Ali Rıza Ekti; Ali Görçin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    World
    Description

    The dataset includes spectral correlation function (SCF) estimations by FFT accumulation method (FAM) for totally 4500 signals with 20000 I/Q samples. The signals belong to three different cellular communication standards: GSM, WCDMA, and LTE. The signals have been received from the different channels with multipath, fading, and noise. The dataset can be used to validate the designed classifier model aiming to identify cellular communication signals. For each signal, the dimension of SCF estimate is 8193*16. There are two train sets which must be used together (concatenate train_data_wo_mapping1 and train_data_wo_mapping2 ). Two train sets have 3000 signals totally, and the test set has 1500. The label of the cellular communication standards are given in dataset as follows: WCDMA -> 0 LTE -> 1 GSM -> 2 The dataset includes: 1. SCFDatatrain1.mat 2. SCFDatatrain2.mat 3. SCFDatatest.mat The contents of .mat files: train_class : denotes class labels of the train set, its dimension is 3000*1 double train_data_wo_mapping1 : includes the first half of the training data, its dimension 1500*1 cell train_data_wo_mapping2 : includes the second half of the training data, its dimension 1500*1 cell *Note, concatenate two cells given above (ie [train_data_wo_mapping1; train_data_wo_mapping2]) test_class : denotes class labels of the train set, its dimension is 1500*1 double test_data_without_mapping : includes the test data, its dimension 1500*1 cell Each cell contains 1500 SCF estimates (8193*16) . The dataset has been used for the paper "Real-World Considerations for Deep Learning in Wireless Signal Identification Based on Spectral Correlation Function" submitted for possible publication in IEEE Wireless Communication Letters. Please cite this paper, if you use the dataset.

  10. f

    Results of the Pearson correlation analysis between training load variables,...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs (2023). Results of the Pearson correlation analysis between training load variables, starting fitness and end fitness. [Dataset]. http://doi.org/10.1371/journal.pone.0211776.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of the Pearson correlation analysis between training load variables, starting fitness and end fitness.

  11. P

    CI-MNIST Dataset

    • paperswithcode.com
    Updated Mar 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charan Reddy; Soroush Mehri; Deepak Sharma; Samira Shabanian; Sina Honari (2022). CI-MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/ci-mnist
    Explore at:
    Dataset updated
    Mar 23, 2022
    Authors
    Charan Reddy; Soroush Mehri; Deepak Sharma; Samira Shabanian; Sina Honari
    Description

    CI-MNIST (Correlated and Imbalanced MNIST) is a variant of MNIST dataset with introduced different types of correlations between attributes, dataset features, and an artificial eligibility criterion. For an input image $x$, the label $y \in {1, 0}$ indicates eligibility or ineligibility, respectively, given that $x$ is even or odd. The dataset defines the background colors as the protected or sensitive attribute $s \in {0, 1}$, where blue denotes the unprivileged group and red denotes the privileged group. The dataset was designed in order to evaluate bias-mitigation approaches in challenging setups and be capable of controlling different dataset configurations.

  12. d

    Data From: The 1014F knockdown resistance mutation is not a strong correlate...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data From: The 1014F knockdown resistance mutation is not a strong correlate of phenotypic resistance to pyrethroids in Florida populations of Culex quinquefasciatus [Dataset]. https://catalog.data.gov/dataset/data-from-the-1014f-knockdown-resistance-mutation-is-not-a-strong-correlate-of-phenotypic--78b35
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Culex quinquefasciatus is an important target for vector control because of its ability to transmit pathogens that cause disease. Most populations are resistant to pyrethroids and often to organophosphates, the two most common classes of active ingredients used by public health agencies. A knockdown resistance (kdr) mutation, resulting in a change from a leucine to phenylalanine in the voltage gated sodium channel, is one mechanism contributing to the pyrethroid resistant phenotype. Enzymatic resistance has also been shown to play a very important role. Recent studies have shown strong resistance in populations even when kdr is relatively low which indicates factors other than kdr may be larger contributors to resistance. In this study, we examined on a statewide scale (over 70 populations), the strength of the correlation between resistance in the CDC bottle bioassay and the kdr genotypes and allele frequencies. Spearman correlation analysis showed only moderate (-0.51) and weak (-0.29) correlation between the kdr genotype and permethrin and deltamethrin respectively. The frequency of the kdr allele was an even weaker correlate. These results indicate, in contrast to Aedes aegypti, assessing kdr in populations of Culex quinquefasciatus is not a good surrogate for phenotypic resistance testing.

  13. I

    Processing and Pearson Correlation Scripts for the C&RL Article on the...

    • databank.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Mischo; Mary C. Schlembach, Processing and Pearson Correlation Scripts for the C&RL Article on the Relationships between Publication, Citation, and Usage Metrics at the University of Illinois at Urbana-Champaign Library [Dataset]. http://doi.org/10.13012/B2IDB-0931140_V1
    Explore at:
    Authors
    William Mischo; Mary C. Schlembach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Illinois
    Description

    These processing and Pearson correlational scripts were developed to support the study that examined the correlational relationships between local journal authorship, local and external citation counts, full-text downloads, link-resolver clicks, and four global journal impact factor indices within an all-disciplines journal collection of 12,200 titles and six subject subsets at the University of Illinois at Urbana-Champaign (UIUC) Library. This study shows strong correlations in the all-disciplines set and most subject subsets. Special processing scripts and web site dashboards were created, including Pearson correlational analysis scripts for reading values from relational databases and displaying tabular results. The raw data used in this analysis, in the form of relational database tables with multiple columns, is available at https://doi.org/10.13012/B2IDB-6810203_V1.

  14. o

    Data set of correlations between stocks world wide

    • explore.openaire.eu
    Updated Mar 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burkard (2022). Data set of correlations between stocks world wide [Dataset]. http://doi.org/10.5281/zenodo.6331463
    Explore at:
    Dataset updated
    Mar 6, 2022
    Authors
    Burkard
    Area covered
    World
    Description

    This data set contains intraday (1 hour format) correlations for one month (December 2021) from more than 2000 Stocks, Indices, Forex and Futures of major Stock exchanges world wide. It is an example of the outcome from data processing inside Infore project. The data set contains more than 2 million files.

  15. f

    Data sets to demonstrate the inappropriate use of correlation coefficient in...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafdzah Zaki; Awang Bulgiba; Roshidi Ismail; Noor Azina Ismail (2023). Data sets to demonstrate the inappropriate use of correlation coefficient in testing agreement. [Dataset]. http://doi.org/10.1371/journal.pone.0037908.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rafdzah Zaki; Awang Bulgiba; Roshidi Ismail; Noor Azina Ismail
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data sets to demonstrate the inappropriate use of correlation coefficient in testing agreement.

  16. J

    Testing for correlation in error‐component models (replication data)

    • journaldata.zbw.eu
    txt
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koen Jochmans; Koen Jochmans (2022). Testing for correlation in error‐component models (replication data) [Dataset]. http://doi.org/10.15456/jae.2022327.0715375800
    Explore at:
    txt(23401432), txt(3356)Available download formats
    Dataset updated
    Dec 7, 2022
    Dataset provided by
    ZBW - Leibniz Informationszentrum Wirtschaft
    Authors
    Koen Jochmans; Koen Jochmans
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper concerns linear models for grouped data with group-specific effects. We construct a portmanteau test for the null of no within-group correlation beyond that induced by the group-specific effect. The approach allows for heteroskedasticity and is applicable to models with exogenous, predetermined, or endogenous regressors. The test can be implemented as soon as three observations per group are available and is applicable to unbalanced data. A test with such general applicability is not available elsewhere. We provide theoretical results on size and power under asymptotics where the number of groups grows but their size is held fixed. Extensive power comparisons with other tests available in the literature for special cases of our setup reveal that our test compares favorably. In a simulation study we find that, under heteroskedasticity, only our procedure yields a test that is both size correct and powerful. In a large data set on mothers with multiple births we find that infant birthweight is correlated across children even after controlling for mother fixed effects and a variety of prenatal care factors. This suggests that such a strategy may be inadequate to take care of all confounding factors that correlate with the mother's decision to engage in activities that are detrimental to the infant's health, such as smoking.

  17. n

    Data from: WiBB: An integrated method for quantifying the relative...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Beijing Normal University
    Field Museum of Natural History
    Authors
    Qin Li; Xiaojun Kou
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

    A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

    Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

  18. Data from: Data Set S7 - Tie points for correlation of IODP Site 342-U1408...

    • doi.pangaea.de
    html, tsv
    Updated Sep 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlotta Cappelli; Paul R Bown; Thomas Westerhold; Yuhji Yamamoto; Claudia Agnini; Steven M Bohaty; Martina de Riu; Veronica Lobba (2019). Data Set S7 - Tie points for correlation of IODP Site 342-U1408 to IODP Site 342-U1410 [Dataset]. http://doi.org/10.1594/PANGAEA.905422
    Explore at:
    tsv, htmlAvailable download formats
    Dataset updated
    Sep 2, 2019
    Dataset provided by
    PANGAEA
    Authors
    Carlotta Cappelli; Paul R Bown; Thomas Westerhold; Yuhji Yamamoto; Claudia Agnini; Steven M Bohaty; Martina de Riu; Veronica Lobba
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Variables measured
    Tie point, Sample code/label, DEPTH, sediment/rock, Depth, composite revised, corrected
    Description

    This dataset is about: Data Set S7 - Tie points for correlation of IODP Site 342-U1408 to IODP Site 342-U1410. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.905432 for more information.

  19. Data from: DATASET FOR: A multimodal spectroscopic approach combining...

    • zenodo.org
    bin, csv, zip
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Perez Guaita; David Perez Guaita (2024). DATASET FOR: A multimodal spectroscopic approach combining mid-infrared and near-infrared for discriminating Gram-positive and Gram-negative bacteria [Dataset]. http://doi.org/10.5281/zenodo.10523185
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Perez Guaita; David Perez Guaita
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:

    This dataset comprises a comprehensive set of files designed for the analysis and 2D correlation of spectral data, specifically focusing on ATR and NIR spectra. It includes MATLAB scripts and supporting functions necessary to replicate the analysis, as well as the raw datasets used in the study. Below is a detailed description of the included files:

    1. Data Analysis:

      • File Name: Data_Analysis.mlx
      • Description: This MATLAB Live Script file contains the main script used for the classification analysis of the spectral data. It includes steps for preprocessing, analysis, and visualization of the ATR and NIR spectra.
    2. 2D Correlation Data Analysis:

      • File Name: Data_Analysis_2Dcorr.mlx
      • Description: This MATLAB Live Script file is similar to the primary analysis script but is specifically tailored for performing 2D correlation analysis on the spectral data. It includes detailed steps and code for executing the 2D correlation.
    3. Functions:

      • Folder Name: Functions
      • Description: This folder contains all the necessary MATLAB function files required to replicate the analyses presented in the scripts. These functions handle various preprocessing steps, calculations, and visualizations.
    4. Datasets:

      • File Names: ATR_dataset.xlsx, NIR_dataset.xlsx, Reference_data.csv
      • Description: These Excel files contain the raw spectral data for ATR and NIR analyses, as well as reference datasets. Each file includes multiple sheets with detailed measurements and metadata.

    Usage Notes:

    • Software Requirements:
      • MATLAB is required to run the .mlx files and utilize the functions.
      • PLS_Toolbox: Necessary for certain preprocessing and analysis steps.
      • MIDAS 2010: Available at MIDAS 2010, required for the 2D correlation analysis.
    • Replication: Users can replicate the analyses by running the Data_Analysis.mlx and Data_Analysis_2Dcorr.mlx scripts in MATLAB, ensuring that the Functions folder is in the MATLAB path.
    • Data Handling: The datasets are provided in .xlsx format, which can be easily imported into MATLAB or other data analysis software.
  20. d

    GLO climate data stats summary

    • data.gov.au
    • cloud.csiss.gmu.edu
    • +2more
    zip
    Updated Apr 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2022). GLO climate data stats summary [Dataset]. https://data.gov.au/data/dataset/afed85e0-7819-493d-a847-ec00a318e657
    Explore at:
    zip(8810)Available download formats
    Dataset updated
    Apr 13, 2022
    Dataset authored and provided by
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    Various climate variables summary for all 15 subregions based on Bureau of Meteorology Australian Water Availability Project (BAWAP) climate grids. Including

    1. Time series mean annual BAWAP rainfall from 1900 - 2012.

    2. Long term average BAWAP rainfall and Penman Potentail Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month

    3. Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P (precipitation); (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.

    4. Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009).

    As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

    There are 4 csv files here:

    BAWAP_P_annual_BA_SYB_GLO.csv

    Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

    Source data: annual BILO rainfall

    P_PET_monthly_BA_SYB_GLO.csv

    long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

    Climatology_Trend_BA_SYB_GLO.csv

    Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

    Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

    Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

    Dataset History

    Dataset was created from various BAWAP source data, including Monthly BAWAP rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET, Correlation coefficient data. Data were extracted from national datasets for the GLO subregion.

    BAWAP_P_annual_BA_SYB_GLO.csv

    Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

    Source data: annual BILO rainfall

    P_PET_monthly_BA_SYB_GLO.csv

    long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

    Climatology_Trend_BA_SYB_GLO.csv

    Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

    Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

    Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

    Dataset Citation

    Bioregional Assessment Programme (2014) GLO climate data stats summary. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/afed85e0-7819-493d-a847-ec00a318e657.

    Dataset Ancestors

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Immanuel Trummer (2023). Column Correlation Data Dataset [Dataset]. https://paperswithcode.com/dataset/column-correlation-data

Column Correlation Data Dataset

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 12, 2023
Authors
Immanuel Trummer
Description

Contains correlation data for 119,384 column pairs, taken from 3,952 data sets, including Pearson correlation, Spearman correlation, and Theil's U. This data can be used, e.g., for approaches that predict column correlation based on column properties, including column names.

Search
Clear search
Close search
Google apps
Main menu