Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reproducibility package for the article:Reaction times and other skewed distributions: problems with the mean and the medianGuillaume A. Rousselet & Rand R. Wilcoxpreprint: https://psyarxiv.com/3y54rdoi: 10.31234/osf.io/3y54rThis package contains all the code and data to reproduce the figures and analyses in the article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold's (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This dataset contains site information, basin characteristics, results of flood-frequency analysis, and a generalized (regional) flood skew for 76 selected streamgages operated by the U.S. Geological Survey (USGS) in the upper White River basin (4-digit hydrologic unit 1101) in southern Missouri and northern Arkansas. The Little Rock District U.S. Army Corps of Engineers (USACE) needed updated estimates of streamflows corresponding to selected annual exceedance probabilities (AEPs) and a basin-specific regional flood skew. USGS selected 111 candidate streamgages in the study area that had 20 or more years of gaged annual peak-flow data available through the 2020 water year. After screening for regulation, urbanization, redundant/nested basins, drainage areas greater than 2,500 square miles, and streamgage basins located in the Mississippi Alluvial Plain (8-digit hydrologic unit 11010013), 77 candidate streamgages remained. After conducting the initial flood-frequency analysis ...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
As global climate continues to change, so too will phenology of a wide range of insects. Changes in flight season usually are characterised as shifts to earlier dates or means, with attention less often paid to flight season breadth or whether seasons are now skewed. We amassed flight season data for the insect order Odonata, the dragonflies and damselflies, for Norway over the past century-and-a-half to examine the form of flight season change. By means of Bayesian analyses that incorporated uncertainty relative to annual variability in survey effort, we estimated shifts in flight season mean, breadth, and skew. We focussed on flight season breadth, positing that it will track documented growing season expansion. A specific mechanism explored was shifts in voltinism, the number of generations per year, which tends to increase with warming. We found strong evidence for an increase in flight season breadth but much less for a shift in mean, with any shift of the latter tending toward a later mean. Skew has become rightward for suborder Zygoptera, the damselflies, but not for Anisoptera, the dragonflies, or for the Odonata as a whole. We found weak support for voltinism as a predictor of broader flight season; instead, voltinism acted interactively with use of human-modified habitats, including decrease in shading (e.g., from timber extraction). Other potential mechanisms that link warming with broadening of flight season include protracted emergence and cohort splitting, both of which have been documented in the Odonata. It is likely that warming-induced broadening of flight seasons of these widespread insect predators will have wide-ranging consequences for freshwater ecosystems.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The reliability and consistency of the many measures proposed to quantify sexual selection have been questioned for decades. Realized selection on quantitative characters measured by the selection differential i was approximated by metrics based on variance in breeding success, using either the opportunity for sexual selection Is or indices of inequality. There is no consensus about which metric best approximates realized selection on sexual characters. Recently, the opportunity for selection on character mean OSM was proposed to quantify the maximum potential selection on characters. Using 21 years of data on bighorn sheep (Ovis canadensis), we investigated the correlations between seven indices of inequality, Is, OSM and i on horn length of males. Bighorn sheep are ideal for this comparison because they are highly polygynous, sexually dimorphic, ram horn length is under strong sexual selection, and we have detailed knowledge of individual breeding success. Different metrics provided conflicting information, potentially leading to spurious conclusions about selection patterns. Iδ, an index of breeding inequality, and to a lesser extent Is, showed the highest correlation with i on horn length, suggesting that these indices document breeding inequality in a selection context. OSM on horn length was strongly correlated with i, Is, and indices of inequality. By integrating information on both realized sexual selection and breeding inequality, OSM appeared to be the best proxy of sexual selection and may be best suited to explore its ecological bases.
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.
Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.
Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.
Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:
Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.
Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.
Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.
Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).
Background and Motivation
In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.
While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.
In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.
However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.
Source Code and Tutorial:https://github.com/llcresearch/CompanyKG2
Paper: to be published
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The data shows the station codes of all the 20 sites identified as K1 to K20. The value such as Ø5, Ø16, Ø25, Ø50, Ø75, Ø84, Ø95 and Ø99 for all the 20 stations are shown in the table along with values of statical perameters such as MEAN, STANDARD DEVIATION , SKEWNESS, KURTOSIS for all the 20 stations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cumulative COVID-19 cases and deaths summary statistics.
https://vocab.nerc.ac.uk/collection/L08/current/UN/https://vocab.nerc.ac.uk/collection/L08/current/UN/
This database, and the accompanying website called ‘SurgeWatch’ (http://surgewatch.stg.rlp.io), provides a systematic UK-wide record of high sea level and coastal flood events over the last 100 years (1915-2014). Derived using records from the National Tide Gauge Network, a dataset of exceedence probabilities from the Environment Agency and meteorological fields from the 20th Century Reanalysis, the database captures information of 96 storm events that generated the highest sea levels around the UK since 1915. For each event, the database contains information about: (1) the storm that generated that event; (2) the sea levels recorded around the UK during the event; and (3) the occurrence and severity of coastal flooding as consequence of the event. The data are presented to be easily assessable and understandable to a wide range of interested parties. The database contains 100 files; four CSV files and 96 PDF files. Two CSV files contain the meteorological and sea level data for each of the 96 events. A third file contains the list of the top 20 largest skew surges at each of the 40 study tide gauge site. In the file containing the sea level and skew surge data, the tide gauge sites are numbered 1 to 40. A fourth accompanying CSV file lists, for reference, the site name and location (longitude and latitude). A description of the parameters in each of the four CSV files is given in the table below. There are also 96 separate PDF files containing the event commentaries. For each event these contain a concise narrative of the meteorological and sea level conditions experienced during the event, and a succinct description of the evidence available in support of coastal flooding, with a brief account of the recorded consequences to people and property. In addition, these contain graphical representation of the storm track and mean sea level pressure and wind fields at the time of maximum high water, the return period and skew surge magnitudes at sites around the UK, and a table of the date and time, offset return period, water level, predicted tide and skew surge for each site where the 1 in 5 year threshold was reached or exceeded for each event. A detailed description of how the database was created is given in Haigh et al. (2015). Coastal flooding caused by extreme sea levels can be devastating, with long-lasting and diverse consequences. The UK has a long history of severe coastal flooding. The recent 2013-14 winter in particular, produced a sequence of some of the worst coastal flooding the UK has experienced in the last 100 years. At present 2.5 million properties and £150 billion of assets are potentially exposed to coastal flooding. Yet despite these concerns, there is no formal, national framework in the UK to record flood severity and consequences and thus benefit an understanding of coastal flooding mechanisms and consequences. Without a systematic record of flood events, assessment of coastal flooding around the UK coast is limited. The database was created at the School of Ocean and Earth Science, National Oceanography Centre, University of Southampton with help from the Faculty of Engineering and the Environment, University of Southampton, the National Oceanography Centre and the British Oceanographic Data Centre. Collation of the database and the development of the website was funded through a Natural Environment Research Council (NERC) impact acceleration grant. The database contributes to the objectives of UK Engineering and Physical Sciences Research Council (EPSRC) consortium project FLOOD Memory (EP/K013513/1).
The grain-size distribution of 223 unconsolidated sediment samples from four DSDP sites at the mouth of the Gulf of California was determined using sieve and pipette techniques. Shepard's (1954) and Inman's (1952) classification schemes were used for all samples. Most of the sediments are hemipelagic with minor turbidites of terrigenous origin. Sediment texture ranges from silty sand to silty clay. On the basis of grain-size parameters, the sediments can be divided into the following groups: (1) poorly to very poorly sorted coarse and medium sand; and (2) poorly to very poorly sorted fine to very fine sand and clay.
Hunting returns dataDelimited plain text file with the following columns: date, age_cat, sex, foraging_activity, total_foraging_time, total_kcal. These correspond to: (1) date of focal follow, (2) age category of focal forager, (3) sex of focal forager, (4) type of hunt, (5) total hunt time in min, (6) total energetic returns in kcal.hunting.returns.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Estimated confidence intervals and lengths for the common mean for Chloride concentration (in mg/litre) in water.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean skewness and kurtosis for simulated data scenarios.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid advancement of additive manufacturing (AM) requires researchers to keep up with these advancements by continually improving the AM processes. Improving manufacturing processes involves evaluating the process outputs and their conformity to the required specifications. Process capability indices, calculated using critical quality characteristics (QCs), have long been used in the evaluation process due to their proven effectiveness. AM processes typically involve multi-correlated critical QCs, indicating the need to develop a multivariate process capability index (MPCI) rather than a univariate capability index, which may lead to misleading results. In this regard, this study proposes a general methodological framework for evaluating AM processes using MPCI. The proposed framework starts by identifying the AM process and product design. Fused Deposition Modeling (FDM) is chosen for this investigation. Then, the specification limits associated with critical QCs are established. To ensure that the MPCI assumptions are met, the critical QCs data are examined for normality, stability, and correlation. Additionally, the MPCI is estimated by simulating a large sample using the properties of the collected QCs data and determining the percent of nonconforming (PNC). Furthermore, the FDM process and its capable tolerance limits are then assessed using the proposed MPCI. Finally, the study presents a sensitivity analysis of the FDM process and suggestions for improvement based on the analysis of assignable causes of variation. The results revealed that the considered process mean is shifted for all QCs, and the most variation is associated with part diameter data. Moreover, the process data are not normally distributed, and the proposed transformation algorithm performs well in reducing data skewness. Also, the performance of the FDM process according to different designations of specification limits was estimated. The results showed that the FDM process is incapable of different designs except with very coarse specifications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics of response styles measures and the overall mean, median, reliability measures alpha and omega and skewness of all response styles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this paper, we investigate the distributional properties of the estimated tangency portfolio (TP) weights assuming that the asset returns follow a matrix variate closed skew-normal distribution. We establish a stochastic representation of the linear combination of the estimated TP weights that fully characterizes its distribution. Using the stochastic representation we derive the mean and variance of the estimated weights of TP which are of key importance in portfolio analysis. Furthermore, we provide the asymptotic distribution of the linear combination of the estimated TP weights under the high-dimensional asymptotic regime, i.e., the dimension of the portfolio p and the sample size n tend to infinity such that p/n→c∈(0,1). A good performance of the theoretical findings is documented in the simulation study. In an empirical study, we apply the theoretical results to real data of the stocks included in the S&P 500 index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Posterior mean odds ratios and 95% credible intervals of the regression coefficients for the binary longitudinal models with response: CD4 counts ≥500 cells/μL.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reproducibility package for the article:Reaction times and other skewed distributions: problems with the mean and the medianGuillaume A. Rousselet & Rand R. Wilcoxpreprint: https://psyarxiv.com/3y54rdoi: 10.31234/osf.io/3y54rThis package contains all the code and data to reproduce the figures and analyses in the article.