31 datasets found
  1. f

    Additional file 2 of Modelling count, bounded and skewed continuous outcomes...

    • springernature.figshare.com
    text/x-diff
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 2 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774297.v1
    Explore at:
    text/x-diffAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Authors
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material 2: A supplementary file with examples of STATA script for all models that have been fitted in this paper.

  2. Additional file 3 of Modelling count, bounded and skewed continuous outcomes...

    • springernature.figshare.com
    txt
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 3 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774300.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material 3: A supplementary file with examples of SAS script for all models that have been fitted in this paper.

  3. C

    EasyGSH-DB: Skew (1996, 2006, 2016)

    • ckan.mobidatalab.eu
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bundesanstalt für Wasserbau (2023). EasyGSH-DB: Skew (1996, 2006, 2016) [Dataset]. https://ckan.mobidatalab.eu/dataset/easygsh-db-skewed-1996-2006-2016
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/wfs_srvc, http://publications.europa.eu/resource/authority/file-type/tiffAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Bundesanstalt für Wasserbau
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 30, 1996 - Jun 30, 2016
    Description

    Definition: The skewness "Sk1" is a measure of the symmetry of the cumulative curve, which indicates the ratio of coarse to fine parts in the particle size distribution. Folk & Ward (1957) quantify this symmetry in a value range from -1 to 1. Positive values ​​greater than 0 to 1 indicate a "left skewing" for metric cumulative curves, i.e. fine grain fractions predominate in comparison to coarse fractions. Negative values ​​of less than 0 to -1 indicate a "right-skewing" for metric cumulative curves, which correspondingly indicates a predominance of coarse compared to fine fractions. Sk1 = 0 indicates a perfectly symmetrical cumulative curve. Conclusions about the deposition environment can be drawn from the skewness. Data generation: The basis for sedimentological evaluations are surface sediment samples, which were interpolated within the framework of the EasyGSH project using anisotropic interpolation methods and taking into account hydrodynamic factors and erosion and sedimentation processes from individual samples from different years to a grid valid for one year. The sediment distribution is therefore available as a cumulative curve at each of these grid nodes. For the German Bight, this basic product is available for the years 1996, 2006 and 2016 in a 100 m grid, for the exclusive economic zone of Germany for the year 1996 in a 250 m grid. The parts for ϕ5, ϕ16, ϕ50, ϕ84 and ϕ95 required for the calculation rule for the skewness according to Folk & Ward (1957) can be determined directly from these cumulative curves and the skewness parameter Sk1 can be calculated. Product: 100 m grid of the German Bight (1996, 2006, 2016) or 250 m grid of the Exclusive Economic Zone (1996), on which the skewness Sk1 according to Folk & Ward (1957) is stored at each grid node. The product is provided in GeoTiff format. Literature: Folk, R.L., & Ward, W.C. (1957). A study in the significance of grain size parameters. Journal of Petrology, 37, 327-354. For further information, please refer to the information portal (http://easygsh.wb.tu-harburg.de/) and the download portal (https://mdi-de.baw.de/easygsh/). English Download: The data for download can be found under References ("further references"), where the data can be downloaded directly or via the web page redirection to the EasyGSH-DB portal. For further information, please refer to the download portal (https://mdi-de.baw.de/easygsh/EasyEN_index.html).

  4. 4

    Supplementary data for the paper "Why psychologists should not default to...

    • data.4tu.nl
    zip
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joost de Winter (2025). Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Joost de Winter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.

  5. n

    Data from: Uneven missing data skew phylogenomic relationships within the...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jul 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Smith; William M. Mauck III; Brett W. Benz; Michael J. Andersen (2022). Uneven missing data skew phylogenomic relationships within the lories and lorikeets [Dataset]. http://doi.org/10.5061/dryad.n5tb2rbsp
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 21, 2022
    Dataset provided by
    American Museum of Natural History
    University of New Mexico
    University of Michigan
    New York Genome Center
    Authors
    Brian Smith; William M. Mauck III; Brett W. Benz; Michael J. Andersen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Inlcuded is the supplementary data for Smith, B. T., Mauck, W. M., Benz, B., & Andersen, M. J. (2018). Uneven missing data skews phylogenomic relationships within the lories and lorikeets. BioRxiv, 398297. The resolution of the Tree of Life has accelerated with advances in DNA sequencing technology. To achieve dense taxon sampling, it is often necessary to obtain DNA from historical museum specimens to supplement modern genetic samples. However, DNA from historical material is generally degraded, which presents various challenges. In this study, we evaluated how the coverage at variant sites and missing data among historical and modern samples impacts phylogenomic inference. We explored these patterns in the brush-tongued parrots (lories and lorikeets) of Australasia by sampling ultraconserved elements in 105 taxa. Trees estimated with low coverage characters had several clades where relationships appeared to be influenced by whether the sample came from historical or modern specimens, which were not observed when more stringent filtering was applied. To assess if the topologies were affected by missing data, we performed an outlier analysis of sites and loci, and a data reduction approach where we excluded sites based on data completeness. Depending on the outlier test, 0.15% of total sites or 38% of loci were driving the topological differences among trees, and at these sites, historical samples had 10.9x more missing data than modern ones. In contrast, 70% data completeness was necessary to avoid spurious relationships. Predictive modeling found that outlier analysis scores were correlated with parsimony informative sites in the clades whose topologies changed the most by filtering. After accounting for biased loci and understanding the stability of relationships, we inferred a more robust phylogenetic hypothesis for lories and lorikeets.

  6. Training data, trained neural network models, trajectories and PLUMED input...

    • zenodo.org
    zip
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhikun Zhang; Zhikun Zhang; GiovanniMaria Piccini; GiovanniMaria Piccini (2025). Training data, trained neural network models, trajectories and PLUMED input files for manuscript "Exploring Chemistry and Catalysis by Biasing Skewed Distributions via Deep Learning" [Dataset]. http://doi.org/10.26434/chemrxiv-2025-cvb1v-v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zhikun Zhang; Zhikun Zhang; GiovanniMaria Piccini; GiovanniMaria Piccini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 26, 2025
    Description

    The following datasets contains two main branches: dataset for neural network (NN) and trajectories of the simulations demontrated in both the main body and the supporting information of the corresponding preprint.

    Dataset for Neural Network (NN):

    All datasets related to the NN training procedure are located in the "NN-models-and-training-data" directory. Within this parent directory, each subfolder corresponds to a specific case study presented in the manuscript. Each subfolder for test cases contains:

    • Training Datasets: COLVAR files used for training.
    • Trained Models: Skewencoder models (.pt files) from each biased iteration of the simulation.
    • PLUMED Files: Used for generating the COLVAR files.
    • Lightning Logs: Logs generated during training.

    For example, consider the SN2 subfolder. The structure of this folder is as follows:

    ├───Reverse
    │ ├───unbiased
    │ ├───results
    │ │ ├───iter_0
    │ │ │ └───data
    │ │ ├───iter_1
    │ │ │ └───data
    │ │ ├───iter_10
    │ │ │ └───data
    │ │ ├───iter_2
    │ │ │ └───data
    │ │ ├───iter_3
    │ │ │ └───data
    │ │ ├───iter_4
    │ │ │ └───data
    │ │ ├───iter_5
    │ │ │ └───data
    │ │ ├───iter_6
    │ │ │ └───data
    │ │ ├───iter_7
    │ │ │ └───data
    │ │ ├───iter_8
    │ │ │ └───data
    │ │ └───iter_9
    │ │ └───data
    │ └───lightning_logs
    │ ├───version_0
    │ ├───version_1
    │ ├───version_10
    │ ├───version_2
    │ ├───version_3
    │ ├───version_4
    │ ├───version_5
    │ ├───version_6
    │ ├───version_7
    │ ├───version_8
    │ └───version_9
    └───Forward
    ├───results
    │ ├───iter_0
    │ │ └───data
    │ ├───iter_1
    │ │ └───data
    │ ├───iter_10
    │ │ └───data
    │ ├───iter_2
    │ │ └───data
    │ ├───iter_3
    │ │ └───data
    │ ├───iter_4
    │ │ └───data
    │ ├───iter_5
    │ │ └───data
    │ ├───iter_6
    │ │ └───data
    │ ├───iter_7
    │ │ └───data
    │ ├───iter_8
    │ │ └───data
    │ └───iter_9
    │ └───data
    └───unbiased

    The reverse and forward folders correspond to specific reaction directions described in the manuscript. The unbiased folder contains the unbiased simulation training data along with the PLUMED input file used for data generation. In the results folder, each subfolder represents a biased simulation iteration and includes:

    • The trained model.
    • The PLUMED input file for the simulation.
    • The generated COLVAR file.


    Dataset for Trajectories:

    All generated trajectory files are included in this directory. They are organized into subdirectories named after the test cases presented in the manuscript. Below is an overview of the file structure within this folder:

    ├───chaba
    │ ├───concerted2
    │ ├───stepwise
    │ └───concerted1
    ├───DA
    │ ├───Backwards
    │ ├───Forwards
    │ └───shallow
    └───SN2
    ├───Backwards
    └───Forwards

    The model system trajectories are not included in this directory because the related simulations were run directly using PLUMED, as described in the manuscript. Therefore, all relevant files are part of the NN-related datasets.

  7. f

    Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wiley
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  8. Data from: Evaluating the contributions of purifying selection and...

    • zenodo.org
    • datadryad.org
    txt
    Updated Jun 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Y. Morales-Arce; Ana Y. Morales-Arce; Rebecca Harris; Anne Stone; Jeffrey Jensen; Rebecca Harris; Anne Stone; Jeffrey Jensen (2022). Evaluating the contributions of purifying selection and progeny-skew in dictating within-host Mycobacterium tuberculosis evolution [Dataset]. http://doi.org/10.5061/dryad.1ns1rn8qq
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ana Y. Morales-Arce; Ana Y. Morales-Arce; Rebecca Harris; Anne Stone; Jeffrey Jensen; Rebecca Harris; Anne Stone; Jeffrey Jensen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The within-host evolutionary dynamics of TB remain unclear, and underlying biological characteristics render standard population genetic approaches based upon the Wright-Fisher model largely inappropriate. In addition, the compact genome combined with an absence of recombination is expected to result in strong purifying selection effects. Thus, it is imperative to establish a biologically-relevant evolutionary framework incorporating these factors in order to enable an accurate study of this important human pathogen. Further, such a model is critical for inferring fundamental evolutionary parameters related to patient treatment, including mutation rates and the severity of infection bottlenecks. We here implement such a model and infer the underlying evolutionary parameters governing within-patient evolutionary dynamics. Results demonstrate that the progeny skew associated with the clonal nature of TB severely reduces genetic diversity and that the neglect of this parameter in previous studies has led to significant mis-inference of mutation rates. As such, our results suggest an underlying de novo mutation rate that is considerably faster than previously inferred, and a progeny distribution differing significantly from Wright-Fisher assumptions. This inference represents a more appropriate evolutionary null model, against which the periodic effects of positive selection, associated with drug-resistance for example, may be better assessed.

  9. Z

    CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vilhelm von Ehrenheim (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957401
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Richard Anselmo Stahl
    Vilhelm von Ehrenheim
    Drew McCornack
    Armin Catovic
    Mark Granroth-Wilding
    Lele Cao
    Dhiana Deva Cavacanti Rocha
    Description

    CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

    Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

    Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

    Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

    Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

    Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

    Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

    Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

    Background and Motivation

    In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

    While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

    In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

    However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

    Source Code and Tutorial:https://github.com/llcresearch/CompanyKG2

    Paper: to be published

  10. P

    DiscoFuse Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mor Geva; Eric Malmi; Idan Szpektor; Jonathan Berant, DiscoFuse Dataset [Dataset]. https://paperswithcode.com/dataset/discofuse
    Explore at:
    Authors
    Mor Geva; Eric Malmi; Idan Szpektor; Jonathan Berant
    Description

    DiscoFuse was created by applying a rule-based splitting method on two corpora - sports articles crawled from the Web, and Wikipedia. See the paper for a detailed description of the dataset generation process and evaluation.

    DiscoFuse has two parts with 44,177,443 and 16,642,323 examples sourced from Sports articles and Wikipedia, respectively.

    For each part, a random split is provided to train (98% of the examples), development (1%) and test (1%) sets. In addition, as the original data distribution is highly skewed (see details in the paper), a balanced version for each part is also provided.

  11. Data from: Sediment particle size analysis for stations from the Western...

    • data-search.nerc.ac.uk
    http
    Updated Jul 25, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UK Polar Data Centre, Natural Environment Research Council, UK Research & Innovation (2020). Sediment particle size analysis for stations from the Western Barents Sea for summer 2017 and 2018 [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/api/records/GB_NERC_BAS_PDC_01373
    Explore at:
    httpAvailable download formats
    Dataset updated
    Jul 25, 2020
    Dataset provided by
    Natural Environment Research Councilhttps://www.ukri.org/councils/nerc
    Authors
    UK Polar Data Centre, Natural Environment Research Council, UK Research & Innovation
    Time period covered
    Jul 19, 2018 - Jul 28, 2018
    Area covered
    Description

    Sediment particle size frequency distributions from the USNL (Unites States Naval Laboratory) box cores were determined optically using a Malvern Mastersizer 2000 He-Ne LASER diffraction sizer and were used to resolve mean particle size, sorting, skewness and kurtosis.

    Samples were collected on cruises JR16006 and JR17007.

    Funding was provided by ''The Changing Arctic Ocean Seafloor (ChAOS) - how changing sea ice conditions impact biological communities, biogeochemical processes and ecosystems'' project (NE/N015894/1 and NE/P006426/1, 2017-2021), part of the NERC funded Changing Arctic Ocean programme.

  12. f

    Misleading characterization of data.

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eckhard Limpert; Werner A. Stahel (2023). Misleading characterization of data. [Dataset]. http://doi.org/10.1371/journal.pone.0021403.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Eckhard Limpert; Werner A. Stahel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    a, Frequently, variation in data from across the sciences is characterized with the arithmetic mean and the standard deviation SD. Often, it is evident from the numbers that the data have to be skewed. This becomes clear if the lower end of the 95% interval of normal variation, - 2 SD, extends below zero, thus failing the “95% range check”, as is the case for all cited examples. Values in bold contradict the positive nature of the data. b, More often, variation is described with the standard error of the mean, SEM (SD  =  SEM · √n, with n  =  sample size). Such distributions are often even more skewed, and their original characterization as being symmetric is even more misleading. Original values are given in italics (°estimated from graphs). Most often, each reference cited contains several examples, in addition to the case(s) considered here. Table 2 collects further examples.

  13. o

    Gender

    • sandbox.municipal.osim.link
    • oxbow.ca
    • +74more
    Updated Jul 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Gender [Dataset]. https://sandbox.municipal.osim.link/work/bylaw
    Explore at:
    Dataset updated
    Jul 14, 2023
    Description

    Age-sex charts emphasize the gap between the numbers of males and females at a specific age group. It also illustrates the age and gender trends across all age and gender groupings. A chart skewed heavily to the left describes a very young population while a chart skewed heavily to the right illustrates an aging population.

  14. Data and Code for: Experience-based Discrimination

    • openicpsr.org
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis-Pierre Lepage (2023). Data and Code for: Experience-based Discrimination [Dataset]. http://doi.org/10.3886/E192292V1
    Explore at:
    Dataset updated
    Jun 22, 2023
    Dataset provided by
    American Economic Associationhttp://www.aeaweb.org/
    Authors
    Louis-Pierre Lepage
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2018 - Mar 31, 2022
    Area covered
    US
    Description

    I present and test a mechanism through which discrimination arises from individual experiences of employers with worker groups. I propose a model in which employers are initially uncertain about the productivity of one of two groups, for example a minority group, and learn through hiring. Learning is endogenous, because hiring experiences of an employer shape their subsequent decisions to hire from the group and therefore learn about its productivity. Positive experiences with the uncertain group lead to positive biases which correct themselves by leading employers to hire more from the group and learn more. In contrast, negative experiences decrease hiring and learning which preserves negative biases, leads to a negatively-skewed belief distribution about the group's productivity across employers, and can cause persistent discrimination in the form of a wage gap. The model explains apparent prejudice as "inaccurate" statistical discrimination and generates novel predictions and policy implications. I then illustrate the formation of biased beliefs from experience in an experimental labor market and find support for key model predictions.

  15. f

    Fit index and LRT false positive rate (of 500 samples) for all models and...

    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiero Guerra-Peña; Zoilo Emilio García-Batista; Sarah Depaoli; Luis Eduardo Garrido (2023). Fit index and LRT false positive rate (of 500 samples) for all models and normal data (skew and kurtosis = 0). [Dataset]. http://doi.org/10.1371/journal.pone.0231525.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Kiero Guerra-Peña; Zoilo Emilio García-Batista; Sarah Depaoli; Luis Eduardo Garrido
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fit index and LRT false positive rate (of 500 samples) for all models and normal data (skew and kurtosis = 0).

  16. o

    Northern Ireland Annual Descriptive House Price Statistics (LGD Level) -...

    • admin.opendatani.gov.uk
    Updated Feb 19, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Northern Ireland Annual Descriptive House Price Statistics (LGD Level) - Dataset - Open Data NI [Dataset]. https://admin.opendatani.gov.uk/dataset/northern-ireland-annual-descriptive-house-price-statistics-lgd-level
    Explore at:
    Dataset updated
    Feb 19, 2020
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    Ireland, Northern Ireland
    Description

    Annual descriptive price statistics for each calendar year 2005 – 2023 for 11 Local Government Districts in Northern Ireland. The statistics include: • Minimum sale price • Lower quartile sale price • Median sale price • Simple Mean sale price • Upper Quartile sale price • Maximum sale price • Number of verified sales Prices are available where at least 30 sales were recorded in the area within the calendar year which could be included in the regression model i.e. the following sales are excluded: • Non Arms-Length sales • sales of properties where the habitable space are less than 30m2 or greater than 1000m2 • sales less than £20,000. Annual median or simple mean prices should not be used to calculate the property price change over time. The quality (where quality refers to the combination of all characteristics of a residential property, both physical and locational) of the properties that are sold may differ from one time period to another. For example, sales in one quarter could be disproportionately skewed towards low-quality properties, therefore producing a biased estimate of average price. The median and simple mean prices are not ‘standardised’ and so the varying mix of properties sold in each quarter could give a false impression of the actual change in prices. In order to calculate the pure property price change over time it is necessary to compare like with like, and this can only be achieved if the ‘characteristics-mix’ of properties traded is standardised. To calculate pure property change over time please use the standardised prices in the NI House Price Index Detailed Statistics file.

  17. d

    Grain-size distribution of sediments from DSDP Leg 65 Holes

    • search.dataone.org
    • doi.pangaea.de
    Updated Jan 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gutiérrez-Estrada, Mario (2018). Grain-size distribution of sediments from DSDP Leg 65 Holes [Dataset]. http://doi.org/10.1594/PANGAEA.818016
    Explore at:
    Dataset updated
    Jan 6, 2018
    Dataset provided by
    PANGAEA Data Publisher for Earth and Environmental Science
    Authors
    Gutiérrez-Estrada, Mario
    Time period covered
    Jan 24, 1979 - Mar 5, 1979
    Area covered
    Description

    The grain-size distribution of 223 unconsolidated sediment samples from four DSDP sites at the mouth of the Gulf of California was determined using sieve and pipette techniques. Shepard's (1954) and Inman's (1952) classification schemes were used for all samples. Most of the sediments are hemipelagic with minor turbidites of terrigenous origin. Sediment texture ranges from silty sand to silty clay. On the basis of grain-size parameters, the sediments can be divided into the following groups: (1) poorly to very poorly sorted coarse and medium sand; and (2) poorly to very poorly sorted fine to very fine sand and clay.

  18. f

    Data from: Structured Variational Approximations with Skew Normal...

    • tandf.figshare.com
    pdf
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Salomone; Xuejun Yu; David J. Nott; Robert Kohn (2025). Structured Variational Approximations with Skew Normal Decomposable Graphical Models and Implicit Copulas [Dataset]. http://doi.org/10.6084/m9.figshare.25222258.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Robert Salomone; Xuejun Yu; David J. Nott; Robert Kohn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although there is much recent work developing flexible variational methods for Bayesian computation, Gaussian approximations with structured covariance matrices are often preferred computationally in high-dimensional settings. This article considers approximate inference methods for complex latent variable models where the posterior is close to Gaussian, but with some skewness in the posterior marginals. We consider skew decomposable graphical models (SDGMs), which are based on the closed skew normal family of distributions, as variational approximations. These approximations can reflect the true posterior conditional independence structure and capture posterior skewness. To increase flexibility, implicit copula SDGM approximations are also developed, where elementwise transformations of an approximately standardized SDGM random vector are considered. This implicit copula extension is an important contribution of our work, and improves the accuracy of SDGM approximations for only a modest increase in computational cost. Our parameterization of the copula approximation is novel, even in the Gaussian case. Performance of the methods is examined in a number of real examples involving generalized linear mixed models and state space models. Supplemental materials including code and appendix are available online.

  19. Fit index false positive rate (of 500 samples) for all models and slightly...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiero Guerra-Peña; Zoilo Emilio García-Batista; Sarah Depaoli; Luis Eduardo Garrido (2023). Fit index false positive rate (of 500 samples) for all models and slightly nonnormal data (skew = 1 and kurtosis = 2). [Dataset]. http://doi.org/10.1371/journal.pone.0231525.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kiero Guerra-Peña; Zoilo Emilio García-Batista; Sarah Depaoli; Luis Eduardo Garrido
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fit index false positive rate (of 500 samples) for all models and slightly nonnormal data (skew = 1 and kurtosis = 2).

  20. f

    Results based on 250 replications of skew generalized t-link samples (probit...

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chénangnon Frédéric Tovissodé; Aliou Diop; Romain Glèlè Kakaï (2023). Results based on 250 replications of skew generalized t-link samples (probit and skew-probit fits). [Dataset]. http://doi.org/10.1371/journal.pone.0249604.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Chénangnon Frédéric Tovissodé; Aliou Diop; Romain Glèlè Kakaï
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results based on 250 replications of skew generalized t-link samples (probit and skew-probit fits).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 2 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774297.v1

Additional file 2 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models

Related Article
Explore at:
text/x-diffAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Authors
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Supplementary Material 2: A supplementary file with examples of STATA script for all models that have been fitted in this paper.

Search
Clear search
Close search
Google apps
Main menu