100+ datasets found
  1. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  2. f

    Data from: D-CCA: A Decomposition-based Canonical Correlation Analysis for...

    • figshare.com
    zip
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hai Shu; Xiao Wang; Hongtu Zhu (2024). D-CCA: A Decomposition-based Canonical Correlation Analysis for High-Dimensional Datasets* [Dataset]. http://doi.org/10.6084/m9.figshare.7461734.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Hai Shu; Xiao Wang; Hongtu Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A typical approach to the joint analysis of two high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within a single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider the more necessary orthogonal relationship between the two distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the ℒ2 space of random variables rather than the conventionally used Euclidean space, with a careful construction of the orthogonal relationship between distinctive matrices. D-CCA represents a natural generalization of the traditional canonical correlation analysis. The proposed estimators of common and distinctive matrices are shown to be consistent and have reasonably better performance than some state-of-the-art methods in both simulated data and the real data analysis of breast cancer data obtained from The Cancer Genome Atlas.

  3. Statistical Dataset Supporting the Review Paper of International Trade...

    • figshare.com
    xlsx
    Updated Jul 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donghai Liu; Lingli Xing (2024). Statistical Dataset Supporting the Review Paper of International Trade Network Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.26300167.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 14, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Donghai Liu; Lingli Xing
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains statistical data of International Trade Network (ITN) literature from 2003 to 2023. It includes the data sources, research content, and citation counts for each piece of literature (01_Comprehensive Statistics.xlsx). Additionally, for structure prediction (02_Structure Prediction.xlsx) and correlation analysis (03_Correlation Analysis.xlsx), a detailed classification of methodologies and analytical perspectives is provided. Finally, for each data source, we have compiled the total citation counts (04_citations_of_data.xlsx) and the total number of publications (05_publications_of_data.xlsx).

  4. DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

    • kaggle.com
    zip
    Updated Oct 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Noranian (2023). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. https://www.kaggle.com/datasets/alinoranianesfahani/dataco-smart-supply-chain-for-big-data-analysis/data
    Explore at:
    zip(26920609 bytes)Available download formats
    Dataset updated
    Oct 31, 2023
    Authors
    Ali Noranian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

    Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

    Types of Products : Clothing , Sports , and Electronic Supplies

    Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv. Categories Data Mining, Supply Chain Management, Machine Learning, Big Data Analytics

  5. f

    Data Sheet 1_Non-linear correlation analysis between internet searches and...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    He, Yongzhang; Xia, Yixue; Wang, Yang; Huang, Fengxiang; Ran, Lingshi (2025). Data Sheet 1_Non-linear correlation analysis between internet searches and epidemic trends.xlsx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002036779
    Explore at:
    Dataset updated
    Apr 4, 2025
    Authors
    He, Yongzhang; Xia, Yixue; Wang, Yang; Huang, Fengxiang; Ran, Lingshi
    Description

    IntroductionThis study uses a non-linear model to explore the impact mechanism of change rates between internet search behavior and confirmed COVID-19 cases. The research background focuses on epidemic monitoring, leveraging internet search data as a real-time tool to capture public interest and predict epidemic development. The goal is to establish a widely applicable mathematical framework through the analysis of long-term disease data.MethodsData were sourced from the Baidu Index for COVID-19-related search behavior and confirmed COVID-19 case data from the National Health Commission of China. A logistic-based non-linear differential equation model was employed to analyze the mutual influence mechanism between confirmed case numbers and the rate of change in search behavior. Structural and operator relationships between variables were determined through segmented data fitting and regression analysis.ResultsThe results indicated a significant non-linear correlation between search behavior and confirmed COVID-19 cases. The non-linear differential equation model constructed in this study successfully passed both structural and correlation tests, with dynamic data fitting showing a high degree of consistency. The study further quantified the mutual influence between search behavior and confirmed cases, revealing a strong feedback loop between the two: changes in search behavior significantly drove the growth of confirmed cases, while the increase in confirmed cases also stimulated the public's search behavior. This finding suggests that search behavior not only reflects the development trend of the epidemic but can also serve as an effective indicator for predicting the evolution of the pandemic.DiscussionThis study enriches the understanding of epidemic transmission mechanisms by quantifying the dynamic interaction between public search behavior and epidemic spread. Compared to simple prediction models, this study focuses more on stable common mechanisms and structural analysis, laying a foundation for future research on public health events.

  6. Data from: Approaches for the utilization of multiple criteria to select a...

    • tandf.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip M. Westgate (2023). Approaches for the utilization of multiple criteria to select a working correlation structure for use within generalized estimating equations [Dataset]. http://doi.org/10.6084/m9.figshare.7422707.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Philip M. Westgate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generalized estimating equations (GEE) incorporate a working correlation structure that is important because the more accurately this structure reflects the true structure, the more efficiently regression parameters may be estimated. Numerous criteria have therefore been proposed to select a working structure, although no criterion will always work better than all other criteria. In practice, it will be unknown which criterion will work best. Therefore, in this manuscript we propose how to utilize information from multiple criteria. We demonstrate the benefits of our proposed approach via a simulation study in a variety of settings and then in an application example.

  7. Data_Sheet_1_Interpretive JIVE: Connections with CCA and an application to...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raphiel J. Murden; Zhengwu Zhang; Ying Guo; Benjamin B. Risk (2023). Data_Sheet_1_Interpretive JIVE: Connections with CCA and an application to brain connectivity.PDF [Dataset]. http://doi.org/10.3389/fnins.2022.969510.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Raphiel J. Murden; Zhengwu Zhang; Ying Guo; Benjamin B. Risk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Joint and Individual Variation Explained (JIVE) is a model that decomposes multiple datasets obtained on the same subjects into shared structure, structure unique to each dataset, and noise. JIVE is an important tool for multimodal data integration in neuroimaging. The two most common algorithms are R.JIVE, an iterative approach, and AJIVE, which uses principal angle analysis. The joint structure in JIVE is defined by shared subspaces, but interpreting these subspaces can be challenging. In this paper, we reinterpret AJIVE as a canonical correlation analysis of principal component scores. This reformulation, which we call CJIVE, (1) provides an intuitive view of AJIVE; (2) uses a permutation test for the number of joint components; (3) can be used to predict subject scores for out-of-sample observations; and (4) is computationally fast. We conduct simulation studies that show CJIVE and AJIVE are accurate when the total signal ranks are correctly specified but, generally inaccurate when the total ranks are too large. CJIVE and AJIVE can still extract joint signal even when the joint signal variance is relatively small. JIVE methods are applied to integrate functional connectivity (resting-state fMRI) and structural connectivity (diffusion MRI) from the Human Connectome Project. Surprisingly, the edges with largest loadings in the joint component in functional connectivity do not coincide with the same edges in the structural connectivity, indicating more complex patterns than assumed in spatial priors. Using these loadings, we accurately predict joint subject scores in new participants. We also find joint scores are associated with fluid intelligence, highlighting the potential for JIVE to reveal important shared structure.

  8. d

    Alterations of gray and white matter networks in patients with...

    • search.dataone.org
    • data.niaid.nih.gov
    • +3more
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seung-Goo Kim; Wi Hoon Jung; Sung Nyun Kim; Joon Hwan Jang; Jun Soo Kwon (2025). Alterations of gray and white matter networks in patients with obsessive-compulsive disorder: a multimodal fusion analysis of structural MRI and DTI using mCCA+jICA [Dataset]. http://doi.org/10.5061/dryad.5jv56
    Explore at:
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Seung-Goo Kim; Wi Hoon Jung; Sung Nyun Kim; Joon Hwan Jang; Jun Soo Kwon
    Time period covered
    Apr 24, 2016
    Description

    Many of previous neuroimaging studies on neuronal structures in patients with obsessive-compulsive disorder (OCD) used univariate statistical tests on unimodal imaging measurements. Although the univariate methods revealed important aberrance of local morphometry in OCD patients, the covariance structure of the anatomical alterations remains unclear. Motivated by recent developments of multivariate techniques in the neuroimaging field, we applied a fusion method called “mCCA+jICA†on multimodal structural data of T1-weighted magnetic resonance imaging (MRI) and diffusion tensor imaging (DTI) of 30 unmedicated patients with OCD and 34 healthy controls. Amongst six highly correlated multimodal networks (p < 0.0001), we found significant alterations of the interrelated gray and white matter networks over occipital and parietal cortices, frontal interhemispheric connections and cerebella (False Discovery Rate q ≤ 0.05). In addition, we found white matter networks around basal ganglia tha...

  9. Expression reflects population structure

    • plos.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brielin C. Brown; Nicolas L. Bray; Lior Pachter (2023). Expression reflects population structure [Dataset]. http://doi.org/10.1371/journal.pgen.1007841
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Brielin C. Brown; Nicolas L. Bray; Lior Pachter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Population structure in genotype data has been extensively studied, and is revealed by looking at the principal components of the genotype matrix. However, no similar analysis of population structure in gene expression data has been conducted, in part because a naïve principal components analysis of the gene expression matrix does not cluster by population. We identify a linear projection that reveals population structure in gene expression data. Our approach relies on the coupling of the principal components of genotype to the principal components of gene expression via canonical correlation analysis. Our method is able to determine the significance of the variance in the canonical correlation projection explained by each gene. We identify 3,571 significant genes, only 837 of which had been previously reported to have an associated eQTL in the GEUVADIS results. We show that our projections are not primarily driven by differences in allele frequency at known cis-eQTLs and that similar projections can be recovered using only several hundred randomly selected genes and SNPs. Finally, we present preliminary work on the consequences for eQTL analysis. We observe that using our projection co-ordinates as covariates results in the discovery of slightly fewer genes with eQTLs, but that these genes replicate in GTEx matched tissue at a slightly higher rate.

  10. d

    Replication data for: Diverse Correlation Structures in Microarray Gene...

    • datamed.org
    • dataverse.harvard.edu
    Updated Oct 8, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). Replication data for: Diverse Correlation Structures in Microarray Gene Expression Data [Dataset]. https://datamed.org/display-item.php?repository=0012&idName=ID&id=56d4b887e4b0e644d313513b
    Explore at:
    Dataset updated
    Oct 8, 2007
    Description

    It is well-known that correlations in microarray data represent a serious nuisance deteriorating the performance of gene selection procedures. This paper is intended to demonstrate that the correlation structure of microarray data provides a rich source of useful information. We discuss distinct correlation substructures revealed in microarray gene expression data by an appropriate ordering of genes. These substructures include stochastic proportionality of expression signals in a large percentage of all gene pairs, negative correlations hidden in ordered gene triples, and a long sequence of weakly dependent random variables associated with ordered pairs of genes. The reported striking regularities are of general biological interest and they also have far-reaching implications for theory and practice of statistical methods of microarray data analysis. We illustrate the latter point with a method for testing differential expression of non-overlapping gene pairs. While designed for testing a different null hypothesis, this method provides an order of magnitude more accurate control of type 1 error rate compared to conventional methods of individual gene expre ssion profiling. In addition, this method is robust to the technical noise. Quantitative inference of the correlation structure has the potential to extend the analysis of microarray data far beyond currently practiced methods.

  11. U

    Data sets for "Structure of molten NaCl and the decay of the...

    • researchdata.bath.ac.uk
    jpeg, txt
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip Salmon; Anita Zeidler (2022). Data sets for "Structure of molten NaCl and the decay of the pair-correlations" [Dataset]. http://doi.org/10.15125/BATH-01165
    Explore at:
    jpeg, txtAvailable download formats
    Dataset updated
    Aug 26, 2022
    Dataset provided by
    University of Bath
    Authors
    Philip Salmon; Anita Zeidler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Japan Society for the Promotion of Science
    Engineering and Physical Sciences Research Council
    Description

    Data sets used to prepare Figure 1 -14 in the Journal of Chemical Physics article entitled "Structure of molten NaCl and the decay of the pair-correlations." The data sets refer to the measured and simulated structure and thermodynamic properties of molten NaCl.

  12. r

    Structural estimates of the intergenerational education correlation...

    • resodate.org
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Belzil (2025). Structural estimates of the intergenerational education correlation (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zdHJ1Y3R1cmFsLWVzdGltYXRlcy1vZi10aGUtaW50ZXJnZW5lcmF0aW9uYWwtZWR1Y2F0aW9uLWNvcnJlbGF0aW9u
    Explore at:
    Dataset updated
    Oct 6, 2025
    Dataset provided by
    ZBW Journal Data Archive
    ZBW
    Journal of Applied Econometrics
    Authors
    Christian Belzil
    Description

    Using a structural dynamic programming model, we investigate the relative importance of family background variables and individual specific abilities in explaining cross-sectional differences in schooling attainments and wages. Each type of ability is the sum of one component correlated with family background variables and a residual (orthogonal) component which is purely individual specific. Household background variables (especially parents' education) account for 68% of the explained cross-sectional variations in schooling attainments, while ability correlated with background variables accounts for 17% and pure individual specific ability accounts for 15%. Interestingly, individual differences in wages are mostly explained by pure individual specific abilities as they account for as much as 73% of the explained variations in wages. Family background variables account for only 19%, while ability endowments correlated with family background account for 8%.

  13. w

    Dataset of books called The ovary : a correlation of structure and function...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called The ovary : a correlation of structure and function in mammals [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+ovary+%3A+a+correlation+of+structure+and+function+in+mammals
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is The ovary : a correlation of structure and function in mammals. It features 7 columns including author, publication date, language, and book publisher.

  14. f

    Data from: Parameterizing the LISREL Model as a Correlation Structure Model...

    • tandf.figshare.com
    txt
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ke-Hai Yuan; Zhiyong Zhang (2025). Parameterizing the LISREL Model as a Correlation Structure Model for More Efficient Parameter Estimates and More Powerful Statistical Tests [Dataset]. http://doi.org/10.6084/m9.figshare.28410180.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Ke-Hai Yuan; Zhiyong Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Most methods for structural equation modeling (SEM) focused on the analysis of covariance matrices. However, “Historically, interesting psychological theories have been phrased in terms of correlation coefficients.” This might be because data in social and behavioral sciences typically do not have predefined metrics. While proper methods for conducting correlation structure analysis have been developed, they emphasized on either how to get consistent standard errors of parameter estimates or how to ensure that the model-implied matrix remains to be a correlation matrix. Motivated by the fundamental needs for more efficient/accurate parameter estimates and greater power in conducting statistical tests, this article explores advantages of correlation structure analysis over its conventional covariance counterpart. Issues related to reparameterization and placement of parameters are discussed. A new concept is introduced for comparing efficiency/accuracy of parameter estimates that are not on the same scale. Via the analysis of many real datasets, meta results show that correlation structure analysis yields uniformly more accurate parameter estimates and more powerful statistical tests than its covariance-structure-analysis counterpart on parameters that are of substantive interests. The same pattern of results between the two model parameterizations is also found by Monte Carlo simulation. Issues related to correlation structure analysis and substantive elaboration of models that are not scale-invariant are discussed as well. The results are expected to promote technical and software developments of correlation structure analysis as well as its adoption in data analysis.

  15. U

    A Bayesian Monte-Carlo Inversion of Spatial Auto-Correlation (SPAC) for...

    • data.usgs.gov
    • datasets.ai
    • +2more
    Updated Apr 5, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang Huajun; Pankow Kristine; Stephenson William J (2019). A Bayesian Monte-Carlo Inversion of Spatial Auto-Correlation (SPAC) for Near-Surface Vs Structure Applied to both Broadband and Geophone Data - Data release [Dataset]. http://doi.org/10.5066/P9OXYQST
    Explore at:
    Dataset updated
    Apr 5, 2019
    Dataset provided by
    United States Geological Survey
    Authors
    Zhang Huajun; Pankow Kristine; Stephenson William J
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Sep 3, 2007 - Aug 18, 2017
    Description

    The datasets for this investigation consist of microtremor array data collected at: 1) 18 sites in Salt Lake and Utah valleys, Utah, and 2) two sites as part of the Frontier Observatory for Research in Geothermal Energy (FORGE) near Milford, Utah. Each of the 18 sites in the Salt Lake and Utah valleys were acquired with four-sensor arrays with three-component (3C) sensors having flat response from 0.033 Hz to 50 Hz. The data acquired as part of the FORGE investigation used both 3C broadband and 5-Hz geophone sensors. Additional information on these datasets can be found in the supporting documentation provided in this data release as well as in the paper by Zhang and others (2019) that utilized these data.

  16. d

    Data from: Can creative thinking predict academic success in medical...

    • search.dataone.org
    • datadryad.org
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcellus Nealy; Takeo Higuchi; Hiroyuki Daida; Yuichi Tomiki; Dennis Dew (2025). Can creative thinking predict academic success in medical education? Correlating Torrance Test of Creative Thinking scores and five-year GPAs of Japanese medical students [Dataset]. http://doi.org/10.5061/dryad.79cnp5j6p
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Marcellus Nealy; Takeo Higuchi; Hiroyuki Daida; Yuichi Tomiki; Dennis Dew
    Description

    This study determined the correlation between creative thinking aptitude, measured by the Torrance Test of Creative Thinking–Figural (TTCT–F), and five-year academic achievement. The TTCT–F was administered to 135 first-year medical students at a Tokyo-based medical school in 2018. Participants’ academic records—annual GPAs over five years—were averaged, and data were analyzed in 2023. Pearson correlation coefficients examined the relationship between the TTCT–F Creativity Index and the five-year average GPA; multiple linear regression assessed the predictive value of TTCT–F components on GPA; canonical correlation analysis explored multivariate relationships. The Creativity Index demonstrated a weak, non-significant correlation with the five-year average GPA. Fluency, Originality, and Elaboration components were not significantly correlated, while Abstractness of Titles demonstrated a moderate positive correlation. Linear regression indicated that Abstractness of Titles signi..., Participants We conducted a retrospective cohort study, administering the Torrance Test of Creative Thinking Figural (TTCT–F) in 2018 as a proctored and timed test to a cohort of 135 first-year medical students at Juntendo University Faculty of Medicine in Chiba, Japan. The participants took the test simultaneously and were between the ages of 18 and 23 years old at the time. The cohort comprised 42 women (31.1%) and 93 men (68.9%) (see Table 1). The study was approved by the Juntendo University Institutional Review Board and conducted in accordance with ethical guidelines to ensure participant confidentiality and data anonymity. All participants provided informed consent before participation and allowed access to their academic records for research purposes. No exclusion criteria were applied, and all first-year medical students in the cohort were eligible to participate. Data were collected and stored in compliance with ethical guidelines to ensure confidentiality and anonymity. Instr..., , # Can creative thinking predict academic success in medical education? Correlating Torrance Test of Creative Thinking scores and five-year GPAs of Japanese medical students

    https://doi.org/10.5061/dryad.79cnp5j6p

    Description of the data and file structure

    Data Set

    The data set shows the following:

    1. An anonymous student number is assigned to each student from 1 to 135.
    2. Gender
    3. Torrance Test of Creative Thinking Figural (TTCT-F) Fluency ScoreÂ
    4. TTCT-F Originality Score
    5. TTCT-F Elaboration Score
    6. TTCT-F Abstractness of Titles Score
    7. TTCT-F Premature Closure Score
    8. TTCT-F Sum Score
    9. TTCT-F Average Standard Score
    10. TTCT-F Creativity Index Score
    11. First year of medical school (M1) GPA - Fifth year of medical school (M5) GPA
    12. Average 5-year GPA

    TTCT-F scores and categories are described above in the methods section.

    GPA = Grade Point Average.,

  17. g

    Data from: Digital image correlation data from analogue modelling...

    • dataservices.gfz-potsdam.de
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Michail; Michael Rudolf; Matthias Rosenau; Alberto Riva; Piero Gianolla; Massimo Coltorti; Alberto Riva (2021). Digital image correlation data from analogue modelling experiments addressing magma emplacement along simple shear and transtensional fault zones [Dataset]. http://doi.org/10.5880/gfz.4.1.2021.004
    Explore at:
    Dataset updated
    2021
    Dataset provided by
    datacite
    GFZ Data Services
    Authors
    Maria Michail; Michael Rudolf; Matthias Rosenau; Alberto Riva; Piero Gianolla; Massimo Coltorti; Alberto Riva
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set includes the results of digital image correlation analysis applied to nine experiments (Table 1) on magma-tectonic interaction performed at the Helmholtz Laboratory for Tectonic Modelling (HelTec) of the GFZ German Research Centre for Geosciences in Potsdam in the framework of EPOS transnational access activities in 2017. The models use silicone oil (PDMS G30M, Rudolf et al., 2016) and Quartz sand (G12, Rosenau et al., 2018) to simulate pre-, syn- and post-tectonic intrusion of granitic magma into upper crustal shear zones of simple shear and transtensional (15° obliquity) kinematics. Three reference experiments (simple shear, transtension, intrusion) are also reported. Detailed descriptions of the experiments can be found in Michail et al. (submitted) to which this data set is supplement. The models have been monitored by means of digital image correlation (DIC) analysis including Particle Image Velocimetry (PIV; Adam et al., 2005) and Structure from Motion photogrammetry (SfM; Donnadieu et al., 2003; Westoby et al., 2012). DIC analysis yields quantitative model surface deformation information by means of 3D surface topography and displacements from which surface strain has been calculated. The data presented here are visualized as surface deformation maps and movies, as well as digital elevation and intrusion models. The results of a shape analysis of the model plutons is provided, too.

  18. w

    Pilot study of gas production analysis methods applied to Cottageville field...

    • data.wu.ac.at
    html
    Updated Sep 29, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Pilot study of gas production analysis methods applied to Cottageville field [Dataset]. https://data.wu.ac.at/odso/edx_netl_doe_gov/MDA2M2I4NmYtNTBkNS00MGM1LTlkYWQtZjQ5NTcwMjMyZGFj
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 29, 2016
    Description

    Gas production data from 63 wells in the Cottageville Gas Field, producing from Devonian shales, are studied in relationship to structure above and below producing horizons, isopach data and dip of producing shales, and basement structure trends. Gas production data are studied from several aspects including highest accumulated production, mean annual production, initial well pressure, and calculated loss ratio values for four different time periods. A trend correlation of these parameters is presented. The initial pressure trends correlate with all geological parameters, i.e., Devonian shale dip and strike, 40 to 50/sup 0/ NE fracture facies trend, structure on the base of the Huron, structure on the top of the Onondaga, and the basement magnetic density data. Production data trends show greatest correlation with structure on the top of the Onondaga and with fracture facies trends from the Baler well. Production decline data in terms of loss ratio values show trends correlating with all geologic parameters except the Onondaga. Two loss ratio maps correlate with the structure on the bottom of the Huron. The strike of Onondaga structure correlates with the 40 to 50/sup 0/ NE fracture facies trend. These parameters may be generally viewed as the production maps representing free gas pockets and migration-accumulation trends; the loss ratios as possible permeability and migration trend indictors; and the geologic parameters as possible constraints or causative agents. The lack of correlation of geologic parameters with production data trends a few degrees west of north may be suggestive of a fault or faults in that direction, providing the correlative causative agent. This is not an unreasonable possibility from the production data maps. It is concluded that this approach could be useful in gas exploration and development evaluation of Appalachian Devonian shale gas fields. Similar relationships will be examined in the Eastern Kentucky Gas Field(s) Study presently in progress.

  19. b

    Experimental data on plastered rubble stone masonry walls

    • experiments.builtenvdata.eu
    • zenodo.org
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radhakrishna Achanta; Katrin Beyer; Katrin Beyer; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie (2024). Experimental data on plastered rubble stone masonry walls [Dataset]. http://doi.org/10.5281/zenodo.5052675
    Explore at:
    Dataset updated
    Oct 24, 2024
    Dataset provided by
    EUCENTRE
    Authors
    Radhakrishna Achanta; Katrin Beyer; Katrin Beyer; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie; Radhakrishna Achanta; Michele Godio; Amir Rezaie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains data from experimental tests on plastered rubble stone masonry walls conducted at École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland.

  20. c

    Data from: Single-particle structure determination by correlations of...

    • cxidb.org
    Updated Mar 25, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. Starodub (2013). Single-particle structure determination by correlations of snapshot X-ray diffraction patterns [Dataset]. http://doi.org/10.11577/1096925
    Explore at:
    Dataset updated
    Mar 25, 2013
    Authors
    D. Starodub
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This deposition includes the diffraction images generated by the paired polystyrene spheres in random orientations. These images were used to determine and phase the single particle diffraction volume from their autocorrelation functions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586

Data Analysis for the Systematic Literature Review of DL4SE

Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

Search
Clear search
Close search
Google apps
Main menu