31 datasets found

f
A Novel Generalized Normal Distribution for Human Longevity and other...
plos.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henry T. Robertson; David B. Allison (2023). A Novel Generalized Normal Distribution for Human Longevity and other Negatively Skewed Data [Dataset]. http://doi.org/10.1371/journal.pone.0037025
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0037025
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Henry T. Robertson; David B. Allison
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Negatively skewed data arise occasionally in statistical practice; perhaps the most familiar example is the distribution of human longevity. Although other generalizations of the normal distribution exist, we demonstrate a new alternative that apparently fits human longevity data better. We propose an alternative approach of a normal distribution whose scale parameter is conditioned on attained age. This approach is consistent with previous findings that longevity conditioned on survival to the modal age behaves like a normal distribution. We derive such a distribution and demonstrate its accuracy in modeling human longevity data from life tables. The new distribution is characterized by 1. An intuitively straightforward genesis; 2. Closed forms for the pdf, cdf, mode, quantile, and hazard functions; and 3. Accessibility to non-statisticians, based on its close relationship to the normal distribution.
f
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
f
Data Sheet 1_The impact of distribution properties on sampling behavior.docx...
figshare.com
docx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thai Quoc Cao; Benjamin Scheibehenne (2025). Data Sheet 1_The impact of distribution properties on sampling behavior.docx [Dataset]. http://doi.org/10.3389/fpsyg.2025.1597227.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2025.1597227.s001
Dataset updated
Sep 30, 2025
Dataset provided by
Frontiers
Authors
Thai Quoc Cao; Benjamin Scheibehenne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.
n
Data from: Selection on skewed characters and the paradox of stasis
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin (2017). Selection on skewed characters and the paradox of stasis [Dataset]. http://doi.org/10.5061/dryad.pt07g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.pt07g
Dataset updated
Sep 8, 2017
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.
Additional file 3 of Modelling count, bounded and skewed continuous outcomes...
springernature.figshare.com
txt
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 3 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774300.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22774300.v1
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Authors
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material 3: A supplementary file with examples of SAS script for all models that have been fitted in this paper.
Ames Housing Engineered Dataset
kaggle.com
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atefeh Amjadian (2025). Ames Housing Engineered Dataset [Dataset]. https://www.kaggle.com/datasets/atefehamjadian/ameshousing-engineered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atefeh Amjadian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Ames
Description
This dataset is an engineered version of the original Ames Housing dataset from the "House Prices: Advanced Regression Techniques" Kaggle competition. The goal of this engineering was to clean the data, handle missing values, encode categorical features, scale numeric features, manage outliers, reduce skewness, select useful features, and create new features to improve model performance for house price prediction.

The original dataset contains information on 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, with the target variable being SalePrice. This engineered version has undergone several preprocessing steps to make it ready for machine learning models.

Preprocessing Steps Applied

Missing Value Handling: Missing values in categorical columns with meaningful absence (e.g., no pool for PoolQC) were filled with "None". Numeric columns were filled with median, and other categorical columns with mode.

Correlation-based Feature Selection: Numeric features with absolute correlation < 0.1 with SalePrice were removed.

Encoding Categorical Variables: Ordinal features (e.g., quality ratings) were encoded using OrdinalEncoder, and nominal features (e.g., neighborhoods) using OneHotEncoder.

Outlier Handling: Outliers in numeric features were detected using IQR and capped (Winsorized) to IQR bounds to preserve data while reducing extreme values.

Skewness Handling: Highly skewed numeric features (|skew| > 1) were transformed using Yeo-Johnson to make distributions more normal-like.

Additional Feature Selection: Low-variance one-hot features (variance < 0.01) and highly collinear features (|corr| > 0.8) were removed.

Feature Scaling: Numeric features were scaled using RobustScaler to handle outliers.

Duplicate Removal: Duplicate rows were checked and removed if found (none in this dataset).

The final dataset has fewer columns than the original (reduced from 81 to approximately 250 after one-hot encoding, then further reduced by feature selection), with improved quality for modeling.

New Features Created

To add more predictive power, the following new features were created based on domain knowledge: 1. HouseAge: Age of the house at the time of sale. Calculated as YrSold - YearBuilt. This captures how old the house is, which can negatively affect price due to depreciation. - Example: A house built in 2000 and sold in 2008 has HouseAge = 8. 2. Quality_x_Size: Interaction term between overall quality and living area. Calculated as OverallQual * GrLivArea. This combines quality and size to capture the value of high-quality large homes. - Example: A house with OverallQual = 7 and GrLivArea = 1500 has Quality_x_Size = 10500. 3. TotalSF: Total square footage of the house. Calculated as GrLivArea + TotalBsmtSF + 1stFlrSF + 2ndFlrSF (if available). This aggregates area features into a single metric for better price prediction. - Example: If GrLivArea = 1500 and TotalBsmtSF = 1000, TotalSF = 2500. 4. Log_LotArea: Log-transformed lot area to reduce skewness. Calculated as np.log1p(LotArea). This makes the distribution of lot sizes more normal, helping models handle extreme values. - Example: A lot area of 10000 becomes Log_LotArea ≈ 9.21.

These new features were created using the original (unscaled) values to maintain interpretability, then scaled with RobustScaler to match the rest of the dataset.

Data Dictionary

Original Numeric Features: Kept features with |corr| > 0.1 with SalePrice, such as:

OverallQual: Material and finish quality (scaled, 1-10).

GrLivArea: Above grade (ground) living area square feet (scaled).

GarageCars: Size of garage in car capacity (scaled).

TotalBsmtSF: Total square feet of basement area (scaled).

And others like FullBath, YearBuilt, etc. (see the code for the full list).

Ordinal Encoded Features: Quality and condition ratings, e.g.:

ExterQual: Exterior material quality (encoded as 0=Po to 4=Ex).

BsmtQual: Basement quality (encoded as 0=None to 5=Ex).

One-Hot Encoded Features: Nominal categorical features, e.g.:

MSZoning_RL: 1 if residential low density, 0 otherwise.

Neighborhood_NAmes: 1 if in NAmes neighborhood, 0 otherwise.

New Engineered Features (as described above):

HouseAge: Age of the house (scaled).

Quality_x_Size: Overall quality times living area (scaled).

TotalSF: Total square footage (scaled).

Log_LotArea: Log-transformed lot area (scaled).

Target: SalePrice - The property's sale price in dollars (not scaled, as it's the target).

Total columns: Approximately 200-250 (after one-hot encoding and feature selection).

License

This dataset is derived from the Ames Housing...
n
Data from: Uneven missing data skew phylogenomic relationships within the...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Jul 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Smith; William M. Mauck III; Brett W. Benz; Michael J. Andersen (2022). Uneven missing data skew phylogenomic relationships within the lories and lorikeets [Dataset]. http://doi.org/10.5061/dryad.n5tb2rbsp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.n5tb2rbsp
Dataset updated
Jul 21, 2022
Dataset provided by
American Museum of Natural History
University of Michigan
New York Genome Center
University of New Mexico
Authors
Brian Smith; William M. Mauck III; Brett W. Benz; Michael J. Andersen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Inlcuded is the supplementary data for Smith, B. T., Mauck, W. M., Benz, B., & Andersen, M. J. (2018). Uneven missing data skews phylogenomic relationships within the lories and lorikeets. BioRxiv, 398297. The resolution of the Tree of Life has accelerated with advances in DNA sequencing technology. To achieve dense taxon sampling, it is often necessary to obtain DNA from historical museum specimens to supplement modern genetic samples. However, DNA from historical material is generally degraded, which presents various challenges. In this study, we evaluated how the coverage at variant sites and missing data among historical and modern samples impacts phylogenomic inference. We explored these patterns in the brush-tongued parrots (lories and lorikeets) of Australasia by sampling ultraconserved elements in 105 taxa. Trees estimated with low coverage characters had several clades where relationships appeared to be influenced by whether the sample came from historical or modern specimens, which were not observed when more stringent filtering was applied. To assess if the topologies were affected by missing data, we performed an outlier analysis of sites and loci, and a data reduction approach where we excluded sites based on data completeness. Depending on the outlier test, 0.15% of total sites or 38% of loci were driving the topological differences among trees, and at these sites, historical samples had 10.9x more missing data than modern ones. In contrast, 70% data completeness was necessary to avoid spurious relationships. Predictive modeling found that outlier analysis scores were correlated with parsimony informative sites in the clades whose topologies changed the most by filtering. After accounting for biased loci and understanding the stability of relationships, we inferred a more robust phylogenetic hypothesis for lories and lorikeets.
C
EasyGSH-DB: Skew (1996, 2006, 2016)
ckan.mobidatalab.eu
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bundesanstalt für Wasserbau (2023). EasyGSH-DB: Skew (1996, 2006, 2016) [Dataset]. https://ckan.mobidatalab.eu/dataset/easygsh-db-skewed-1996-2006-2016
Explore at:
http://publications.europa.eu/resource/authority/file-type/tiff, http://publications.europa.eu/resource/authority/file-type/wfs_srvcAvailable download formats
Dataset updated
Jun 16, 2023
Dataset provided by
Bundesanstalt für Wasserbau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 30, 1996 - Jun 30, 2016
Description
Definition: The skewness "Sk1" is a measure of the symmetry of the cumulative curve, which indicates the ratio of coarse to fine parts in the particle size distribution. Folk & Ward (1957) quantify this symmetry in a value range from -1 to 1. Positive values greater than 0 to 1 indicate a "left skewing" for metric cumulative curves, i.e. fine grain fractions predominate in comparison to coarse fractions. Negative values of less than 0 to -1 indicate a "right-skewing" for metric cumulative curves, which correspondingly indicates a predominance of coarse compared to fine fractions. Sk1 = 0 indicates a perfectly symmetrical cumulative curve. Conclusions about the deposition environment can be drawn from the skewness. Data generation: The basis for sedimentological evaluations are surface sediment samples, which were interpolated within the framework of the EasyGSH project using anisotropic interpolation methods and taking into account hydrodynamic factors and erosion and sedimentation processes from individual samples from different years to a grid valid for one year. The sediment distribution is therefore available as a cumulative curve at each of these grid nodes. For the German Bight, this basic product is available for the years 1996, 2006 and 2016 in a 100 m grid, for the exclusive economic zone of Germany for the year 1996 in a 250 m grid. The parts for ϕ5, ϕ16, ϕ50, ϕ84 and ϕ95 required for the calculation rule for the skewness according to Folk & Ward (1957) can be determined directly from these cumulative curves and the skewness parameter Sk1 can be calculated. Product: 100 m grid of the German Bight (1996, 2006, 2016) or 250 m grid of the Exclusive Economic Zone (1996), on which the skewness Sk1 according to Folk & Ward (1957) is stored at each grid node. The product is provided in GeoTiff format. Literature: Folk, R.L., & Ward, W.C. (1957). A study in the significance of grain size parameters. Journal of Petrology, 37, 327-354. For further information, please refer to the information portal (http://easygsh.wb.tu-harburg.de/) and the download portal (https://mdi-de.baw.de/easygsh/). English Download: The data for download can be found under References ("further references"), where the data can be downloaded directly or via the web page redirection to the EasyGSH-DB portal. For further information, please refer to the download portal (https://mdi-de.baw.de/easygsh/EasyEN_index.html).
r
Data from: STCSSP: A FORTRAN 77 routine to compute a structured staircase...
resodate.org
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Brüll; Volker Mehrmann (2021). STCSSP: A FORTRAN 77 routine to compute a structured staircase form for a (skew-)symmetric/(skew-)symmetric matrix pencil [Dataset]. http://doi.org/10.14279/depositonce-14386
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-14386
Dataset updated
Dec 17, 2021
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Tobias Brüll; Volker Mehrmann
Description
This paper contains the description of the algorithm STCSSP and its interface. STCSSP is a FORTRAN subroutine that computes a structured staircase form for a real (skew-) symmetric / (skew-) symmetric matrix pencil, i.e., a pencil where each of the two matrices is either symmetric or skew-symmetric. An example how to call the subroutine is given.
Training data, trained neural network models, trajectories and PLUMED input...
zenodo.org
zip
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhikun Zhang; Zhikun Zhang; GiovanniMaria Piccini; GiovanniMaria Piccini (2025). Training data, trained neural network models, trajectories and PLUMED input files for manuscript "Exploring Chemistry and Catalysis by Biasing Skewed Distributions via Deep Learning" [Dataset]. http://doi.org/10.26434/chemrxiv-2025-cvb1v-v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.26434/chemrxiv-2025-cvb1v-v2
Dataset updated
Apr 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zhikun Zhang; Zhikun Zhang; GiovanniMaria Piccini; GiovanniMaria Piccini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 26, 2025
Description
The following datasets contains two main branches: dataset for neural network (NN) and trajectories of the simulations demontrated in both the main body and the supporting information of the corresponding preprint.

Dataset for Neural Network (NN):

All datasets related to the NN training procedure are located in the "NN-models-and-training-data" directory. Within this parent directory, each subfolder corresponds to a specific case study presented in the manuscript. Each subfolder for test cases contains:

Training Datasets: COLVAR files used for training.

Trained Models: Skewencoder models (.pt files) from each biased iteration of the simulation.

PLUMED Files: Used for generating the COLVAR files.

Lightning Logs: Logs generated during training.

For example, consider the SN2 subfolder. The structure of this folder is as follows:

├───Reverse
│ ├───unbiased
│ ├───results
│ │ ├───iter_0
│ │ │ └───data
│ │ ├───iter_1
│ │ │ └───data
│ │ ├───iter_10
│ │ │ └───data
│ │ ├───iter_2
│ │ │ └───data
│ │ ├───iter_3
│ │ │ └───data
│ │ ├───iter_4
│ │ │ └───data
│ │ ├───iter_5
│ │ │ └───data
│ │ ├───iter_6
│ │ │ └───data
│ │ ├───iter_7
│ │ │ └───data
│ │ ├───iter_8
│ │ │ └───data
│ │ └───iter_9
│ │ └───data
│ └───lightning_logs
│ ├───version_0
│ ├───version_1
│ ├───version_10
│ ├───version_2
│ ├───version_3
│ ├───version_4
│ ├───version_5
│ ├───version_6
│ ├───version_7
│ ├───version_8
│ └───version_9
└───Forward
├───results
│ ├───iter_0
│ │ └───data
│ ├───iter_1
│ │ └───data
│ ├───iter_10
│ │ └───data
│ ├───iter_2
│ │ └───data
│ ├───iter_3
│ │ └───data
│ ├───iter_4
│ │ └───data
│ ├───iter_5
│ │ └───data
│ ├───iter_6
│ │ └───data
│ ├───iter_7
│ │ └───data
│ ├───iter_8
│ │ └───data
│ └───iter_9
│ └───data
└───unbiased

The reverse and forward folders correspond to specific reaction directions described in the manuscript. The unbiased folder contains the unbiased simulation training data along with the PLUMED input file used for data generation. In the results folder, each subfolder represents a biased simulation iteration and includes:

The trained model.

The PLUMED input file for the simulation.

The generated COLVAR file.

Dataset for Trajectories:

All generated trajectory files are included in this directory. They are organized into subdirectories named after the test cases presented in the manuscript. Below is an overview of the file structure within this folder:

├───chaba
│ ├───concerted2
│ ├───stepwise
│ └───concerted1
├───DA
│ ├───Backwards
│ ├───Forwards
│ └───shallow
└───SN2
├───Backwards
└───Forwards

The model system trajectories are not included in this directory because the related simulations were run directly using PLUMED, as described in the manuscript. Therefore, all relevant files are part of the NN-related datasets.
Data from: Sediment particle size analysis for stations from the Western...
data-search.nerc.ac.uk
http
Updated Jul 25, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Polar Data Centre, Natural Environment Research Council, UK Research & Innovation (2020). Sediment particle size analysis for stations from the Western Barents Sea for summer 2017 and 2018 [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/api/records/GB_NERC_BAS_PDC_01373
Explore at:
httpAvailable download formats
Dataset updated
Jul 25, 2020
Dataset provided by
Natural Environment Research Councilhttps://www.ukri.org/councils/nerc
Authors
UK Polar Data Centre, Natural Environment Research Council, UK Research & Innovation
Time period covered
Jul 19, 2018 - Jul 28, 2018
Area covered

Description
Sediment particle size frequency distributions from the USNL (Unites States Naval Laboratory) box cores were determined optically using a Malvern Mastersizer 2000 He-Ne LASER diffraction sizer and were used to resolve mean particle size, sorting, skewness and kurtosis.

Samples were collected on cruises JR16006 and JR17007.

Funding was provided by ''The Changing Arctic Ocean Seafloor (ChAOS) - how changing sea ice conditions impact biological communities, biogeochemical processes and ecosystems'' project (NE/N015894/1 and NE/P006426/1, 2017-2021), part of the NERC funded Changing Arctic Ocean programme.
f
Supplementary Material for: Modeling Caries Experience: Advantages of the...
karger.figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hofstetter H.; Dusseldorp E.; Zeileis A.; Schuller A.A. (2023). Supplementary Material for: Modeling Caries Experience: Advantages of the Use of the Hurdle Model [Dataset]. http://doi.org/10.6084/m9.figshare.3833571.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3833571.v1
Dataset updated
May 30, 2023
Dataset provided by
Karger Publishers
Authors
Hofstetter H.; Dusseldorp E.; Zeileis A.; Schuller A.A.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In dental epidemiology, the decayed (D), missing (M), and filled (F) teeth or surfaces index (DFM index) is a frequently used measure. The DMF index is characterized by a strongly positive skewed distribution with a large stack of zero counts for those individuals without caries experience. Therefore, standard generalized linear models often lead to a poor fit. The hurdle regression model is a highly suitable class to model a DMF index, but its use is subordinated. We aim to overcome the gap between the suitability of the hurdle model to fit DMF indices and the frequency of its use in caries research. A theoretical introduction to the hurdle model is provided, and an extensive comparison with the zero-inflated model is given. Using an illustrative data example, both types of models are compared, with a special focus on interpretation of their parameters. Accompanying R code and example data are provided as online supplementary material.
Additional file 2 of Modelling count, bounded and skewed continuous outcomes...
springernature.figshare.com
text/x-diff
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 2 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774297.v1
Explore at:
text/x-diffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22774297.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material 2: A supplementary file with examples of STATA script for all models that have been fitted in this paper.
Data and Code for: Experience-based Discrimination
openicpsr.org
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis-Pierre Lepage (2023). Data and Code for: Experience-based Discrimination [Dataset]. http://doi.org/10.3886/E192292V1
Explore at:
Unique identifier
https://doi.org/10.3886/E192292V1
Dataset updated
Jun 22, 2023
Dataset provided by
American Economic Associationhttp://www.aeaweb.org/
Authors
Louis-Pierre Lepage
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2018 - Mar 31, 2022
Area covered
US
Description
I present and test a mechanism through which discrimination arises from individual experiences of employers with worker groups. I propose a model in which employers are initially uncertain about the productivity of one of two groups, for example a minority group, and learn through hiring. Learning is endogenous, because hiring experiences of an employer shape their subsequent decisions to hire from the group and therefore learn about its productivity. Positive experiences with the uncertain group lead to positive biases which correct themselves by leading employers to hire more from the group and learn more. In contrast, negative experiences decrease hiring and learning which preserves negative biases, leads to a negatively-skewed belief distribution about the group's productivity across employers, and can cause persistent discrimination in the form of a wage gap. The model explains apparent prejudice as "inaccurate" statistical discrimination and generates novel predictions and policy implications. I then illustrate the formation of biased beliefs from experience in an experimental labor market and find support for key model predictions.
Additional file 1 of Modelling count, bounded and skewed continuous outcomes...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 1 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774294.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22774294.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material 1: A supplementary file with examples of R script for all models that have been fitted in this paper.
s
Northern Ireland Annual Descriptive House Price Statistics (Electoral Ward...
ckan.publishing.service.gov.uk
Updated Feb 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Northern Ireland Annual Descriptive House Price Statistics (Electoral Ward Level) - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/northern-ireland-annual-descriptive-house-price-statistics-electoral-ward-level
Explore at:
Dataset updated
Feb 29, 2020
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered
Ireland, Northern Ireland
Description
Annual descriptive price statistics for each calendar year 2005 – 2024 for 462 electoral wards within 11 Local Government Districts. The statistics include: • Minimum sale price • Lower quartile sale price • Median sale price • Simple Mean sale price • Upper Quartile sale price • Maximum sale price • Number of verified sales Prices are available where at least 30 sales were recorded in the area within the calendar year which could be included in the regression model i.e. the following sales are excluded: • Non Arms-Length sales • sales of properties where the habitable space are less than 30m2 or greater than 1000m2 • sales less than £20,000. Annual median or simple mean prices should not be used to calculate the property price change over time. The quality (where quality refers to the combination of all characteristics of a residential property, both physical and locational) of the properties that are sold may differ from one time period to another. For example, sales in one quarter could be disproportionately skewed towards low-quality properties, therefore producing a biased estimate of average price. The median and simple mean prices are not ‘standardised’ and so the varying mix of properties sold in each quarter could give a false impression of the actual change in prices. In order to calculate the pure property price change over time it is necessary to compare like with like, and this can only be achieved if the ‘characteristics-mix’ of properties traded is standardised. To calculate pure property change over time please use the standardised prices in the NI House Price Index Detailed Statistics file.
d
Grain-size distribution of sediments from DSDP Leg 65 Holes
search.dataone.org
doi.pangaea.de
Updated Jan 6, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gutiérrez-Estrada, Mario (2018). Grain-size distribution of sediments from DSDP Leg 65 Holes [Dataset]. http://doi.org/10.1594/PANGAEA.818016
Explore at:
Unique identifier
https://doi.org/10.1594/PANGAEA.818016
Dataset updated
Jan 6, 2018
Dataset provided by
PANGAEA Data Publisher for Earth and Environmental Science
Authors
Gutiérrez-Estrada, Mario
Time period covered
Jan 24, 1979 - Mar 5, 1979
Area covered

Description
The grain-size distribution of 223 unconsolidated sediment samples from four DSDP sites at the mouth of the Gulf of California was determined using sieve and pipette techniques. Shepard's (1954) and Inman's (1952) classification schemes were used for all samples. Most of the sediments are hemipelagic with minor turbidites of terrigenous origin. Sediment texture ranges from silty sand to silty clay. On the basis of grain-size parameters, the sediments can be divided into the following groups: (1) poorly to very poorly sorted coarse and medium sand; and (2) poorly to very poorly sorted fine to very fine sand and clay.
f
Data from: Integrating Data Transformation in Principal Components Analysis
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehdi Maadooliat; Jianhua Z. Huang; Jianhua Hu (2023). Integrating Data Transformation in Principal Components Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.960499.v3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.960499.v3
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francis
Authors
Mehdi Maadooliat; Jianhua Z. Huang; Jianhua Hu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples. Supplementary materials for this article are available online.
Z
CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...
data.niaid.nih.gov
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Armin Catovic (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957401
Explore at:
Dataset updated
Jun 4, 2024
Dataset provided by
Lele Cao
Richard Anselmo Stahl
Vilhelm von Ehrenheim
Dhiana Deva Cavacanti Rocha
Armin Catovic
Drew McCornack
Mark Granroth-Wilding
Description
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

Background and Motivation

In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

Source Code and Tutorial:https://github.com/llcresearch/CompanyKG2

Paper: to be published
f
Fit index false positive rate (of 500 samples) for all models and nonnormal...
figshare.com
xls
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiero Guerra-Peña; Zoilo Emilio García-Batista; Sarah Depaoli; Luis Eduardo Garrido (2023). Fit index false positive rate (of 500 samples) for all models and nonnormal data (skew = 1.6 and kurtosis = 4). [Dataset]. http://doi.org/10.1371/journal.pone.0231525.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0231525.t006
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Kiero Guerra-Peña; Zoilo Emilio García-Batista; Sarah Depaoli; Luis Eduardo Garrido
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fit index false positive rate (of 500 samples) for all models and nonnormal data (skew = 1.6 and kurtosis = 4).

Facebook

Twitter

Click to copy link

Link copied

Cite

Henry T. Robertson; David B. Allison (2023). A Novel Generalized Normal Distribution for Human Longevity and other Negatively Skewed Data [Dataset]. http://doi.org/10.1371/journal.pone.0037025

A Novel Generalized Normal Distribution for Human Longevity and other Negatively Skewed Data

Explore at:

14 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0037025

Dataset updated

May 30, 2023

Dataset provided by

PLOS ONE

Authors

Henry T. Robertson; David B. Allison

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Negatively skewed data arise occasionally in statistical practice; perhaps the most familiar example is the distribution of human longevity. Although other generalizations of the normal distribution exist, we demonstrate a new alternative that apparently fits human longevity data better. We propose an alternative approach of a normal distribution whose scale parameter is conditioned on attained age. This approach is consistent with previous findings that longevity conditioned on survival to the modal age behaves like a normal distribution. We derive such a distribution and demonstrate its accuracy in modeling human longevity data from life tables. The new distribution is characterized by 1. An intuitively straightforward genesis; 2. Closed forms for the pdf, cdf, mode, quantile, and hazard functions; and 3. Accessibility to non-statisticians, based on its close relationship to the normal distribution.

Clear search

Close search

Google apps

Main menu

A Novel Generalized Normal Distribution for Human Longevity and other...

Dataset for: Some Remarks on the R2 for Clustering

Data Sheet 1_The impact of distribution properties on sampling behavior.docx...

Data from: Selection on skewed characters and the paradox of stasis

Additional file 3 of Modelling count, bounded and skewed continuous outcomes...

Ames Housing Engineered Dataset

Preprocessing Steps Applied

New Features Created

Data Dictionary

License

Data from: Uneven missing data skew phylogenomic relationships within the...

EasyGSH-DB: Skew (1996, 2006, 2016)

Data from: STCSSP: A FORTRAN 77 routine to compute a structured staircase...

Training data, trained neural network models, trajectories and PLUMED input...

Data from: Sediment particle size analysis for stations from the Western...

Supplementary Material for: Modeling Caries Experience: Advantages of the...

Additional file 2 of Modelling count, bounded and skewed continuous outcomes...

Data and Code for: Experience-based Discrimination

Additional file 1 of Modelling count, bounded and skewed continuous outcomes...

Northern Ireland Annual Descriptive House Price Statistics (Electoral Ward...

Grain-size distribution of sediments from DSDP Leg 65 Holes

Data from: Integrating Data Transformation in Principal Components Analysis

CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

Fit index false positive rate (of 500 samples) for all models and nonnormal...

A Novel Generalized Normal Distribution for Human Longevity and other Negatively Skewed Data