63 datasets found
  1. Normal and Skewed Example Data

    • figshare.com
    txt
    Updated Dec 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Rogel-Salazar (2021). Normal and Skewed Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.17306285.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 21, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jesus Rogel-Salazar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data for normally distributed and skewed datasets.

  2. Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  3. A Novel Generalized Normal Distribution for Human Longevity and other...

    • plos.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henry T. Robertson; David B. Allison (2023). A Novel Generalized Normal Distribution for Human Longevity and other Negatively Skewed Data [Dataset]. http://doi.org/10.1371/journal.pone.0037025
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Henry T. Robertson; David B. Allison
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Negatively skewed data arise occasionally in statistical practice; perhaps the most familiar example is the distribution of human longevity. Although other generalizations of the normal distribution exist, we demonstrate a new alternative that apparently fits human longevity data better. We propose an alternative approach of a normal distribution whose scale parameter is conditioned on attained age. This approach is consistent with previous findings that longevity conditioned on survival to the modal age behaves like a normal distribution. We derive such a distribution and demonstrate its accuracy in modeling human longevity data from life tables. The new distribution is characterized by 1. An intuitively straightforward genesis; 2. Closed forms for the pdf, cdf, mode, quantile, and hazard functions; and 3. Accessibility to non-statisticians, based on its close relationship to the normal distribution.

  4. r

    Normal but skewed? (replication data)

    • resodate.org
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dante Amengual (2025). Normal but skewed? (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9ub3JtYWwtYnV0LXNrZXdlZA==
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW Journal Data Archive
    ZBW
    Authors
    Dante Amengual
    Description

    We propose a multivariate normality test against skew normal distributions using higher-order log-likelihood derivatives, which is asymptotically equivalent to the likelihood ratio but only requires estimation under the null. Numerically, it is the supremum of the univariate skewness coefficient test over all linear combinations of the variables. We can simulate its exact finite sample distribution for any multivariate dimension and sample size. Our Monte Carlo exercises confirm its power advantages over alternative approaches. Finally, we apply it to the joint distribution of US city sizes in two consecutive censuses finding that non-normality is very clearly seen in their growth rates.

  5. n

    Data from: Body temperature distributions of active diurnal lizards in three...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Aug 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raymond B. Huey; Eric R. Pianka (2018). Body temperature distributions of active diurnal lizards in three deserts: skewed up or skewed down? [Dataset]. http://doi.org/10.5061/dryad.45g3s
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 4, 2018
    Dataset provided by
    The University of Texas at Austin
    University of Washington
    Authors
    Raymond B. Huey; Eric R. Pianka
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Africa, North America, Australia
    Description
    1. The performance of ectotherms integrated over time depends in part on the position and shape of the distribution of body temperatures (Tb) experienced during activity. For several complementary reasons, physiological ecologists have long expected that Tb distributions during activity should have a long left tail (left-skewed); but only infrequently have they quantified the magnitude and direction of Tb skewness in nature.
    2. To evaluate whether left-skewed Tb distributions are general for diurnal desert lizards, we compiled and analyzed Tb (∑ = 9,023 temperatures) from our own prior studies of active desert lizards on three continents (25 species in Western Australia, 10 in the Kalahari Desert of Africa, and 10 species in western North America). We gathered these data over several decades, using standardized techniques.
    3. Many species showed significantly left-skewed Tb distributions, even when records were restricted to summer months. However, magnitudes of skewness were always small, such that mean Tb were never more than 1°C lower than median Tb. The significance of Tb skewness was sensitive to sample size, and power tests reinforced this sensitivity.
    4. The magnitude of skewness was not obviously related to phylogeny, desert, body size, or median body temperature. Moreover, formal phylogenetic analysis is inappropriate because geography and phylogeny are confounded (that is, are highly collinear).
    5. Skewness might be limited if lizards pre-warm inside retreats before emerging in the morning, emerge only when operative temperatures are high enough to speed warming to activity Tb, or if cold lizards are especially wary and difficult to spot or catch. Telemetry studies may help evaluate these possibilities.
  6. f

    Data Sheet 1_The impact of distribution properties on sampling behavior.docx...

    • figshare.com
    docx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thai Quoc Cao; Benjamin Scheibehenne (2025). Data Sheet 1_The impact of distribution properties on sampling behavior.docx [Dataset]. http://doi.org/10.3389/fpsyg.2025.1597227.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    Frontiers
    Authors
    Thai Quoc Cao; Benjamin Scheibehenne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.

  7. r

    Alternative technical efficiency measures: Skew, bias and scale (replication...

    • resodate.org
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qu Feng (2025). Alternative technical efficiency measures: Skew, bias and scale (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9hbHRlcm5hdGl2ZS10ZWNobmljYWwtZWZmaWNpZW5jeS1tZWFzdXJlcy1za2V3LWJpYXMtYW5kLXNjYWxl
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW Journal Data Archive
    ZBW
    Authors
    Qu Feng
    Description

    In the fixed-effects stochastic frontier model an efficiency measure relative to the best firm in the sample is universally employed. This paper considers a new measure relative to the worst firm in the sample. We find that estimates of this measure have smaller bias than those of the traditional measure when the sample consists of many firms near the efficient frontier. Moreover, a two-sided measure relative to both the best and the worst firms is proposed. Simulations suggest that the new measures may be preferred depending on the skewness of the inefficiency distribution and the scale of efficiency differences.

  8. n

    Data from: Selection on skewed characters and the paradox of stasis

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin (2017). Selection on skewed characters and the paradox of stasis [Dataset]. http://doi.org/10.5061/dryad.pt07g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 8, 2017
    Dataset provided by
    Centre National de la Recherche Scientifique
    Authors
    Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.

  9. Data from: bicycle store dataset

    • kaggle.com
    zip
    Updated Sep 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Sahoo (2020). bicycle store dataset [Dataset]. https://www.kaggle.com/rohitsahoo/bicycle-store-dataset
    Explore at:
    zip(682639 bytes)Available download formats
    Dataset updated
    Sep 11, 2020
    Authors
    Rohit Sahoo
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    Perform Exploratory Data Analysis on the Bicycle Store Dataset!

    DATA EXPLORATION Understand the characteristics of given fields in the underlying data such as variable distributions, whether the dataset is skewed towards a certain demographic and the data validity of the fields. For example, a training dataset may be highly skewed towards the younger age bracket. If so, how will this impact your results when using it to predict over the remaining customer base. Identify limitations surrounding the data and gather external data which may be useful for modelling purposes. This may include bringing in ABS data at different geographic levels and creating additional features for the model. For example, the geographic remoteness of different postcodes may be used as an indicator of proximity to consider to whether a customer is in need of a bike to ride to work.

    MODEL DEVELOPMENT Determine a hypothesis related to the business question that can be answered with the data. Perform statistical testing to determine if the hypothesis is valid or not. Create calculated fields based on existing data, for example, convert the D.O.B into an age bracket. Other fields that may be engineered include ‘High Margin Product’ which may be an indicator of whether the product purchased by the customer is in a high margin category in the past three months based on the fields ‘list_price’ and ‘standard cost’. Other examples include, calculating the distance from office to home address to as a factor in determining whether customers may purchase a bicycle for transportation purposes. Additionally, this may include thoughts around determining what the predicted variable actually is. For example, are results predicted in ordinal buckets, nominal, binary or continuous. Test the performance of the model using factors relevant for the given model chosen (i.e. residual deviance, AIC, ROC curves, R Squared). Appropriately document model performance, assumptions and limitations.

    INTEPRETATION AND REPORTING Visualisation and presentation of findings. This may involve interpreting the significant variables and co-efficient from a business perspective. These slides should tell a compelling storing around the business issue and support your case with quantitative and qualitative observations. Please refer to module below for further details

    Content

    The dataset is easy to understand and self-explanatory!

    Inspiration

    It is important to keep in mind the business context when presenting your findings: 1. What are the trends in the underlying data? 2. Which customer segment has the highest customer value? 3. What do you propose should be the marketing and growth strategy?

  10. m

    Impact of limited data availability on the accuracy of project duration...

    • data.mendeley.com
    Updated Nov 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naimeh Sadeghi (2022). Impact of limited data availability on the accuracy of project duration estimation in project networks [Dataset]. http://doi.org/10.17632/bjfdw6xbxw.3
    Explore at:
    Dataset updated
    Nov 22, 2022
    Authors
    Naimeh Sadeghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database includes simulated data showing the accuracy of estimated probability distributions of project durations when limited data are available for the project activities. The base project networks are taken from PSPLIB. Then, various stochastic project networks are synthesized by changing the variability and skewness of project activity durations. Number of variables: 20 Number of cases/rows: 114240 Variable List: • Experiment ID: The ID of the experiment • Experiment for network: The ID of the experiment for each of the synthesized networks • Network ID: ID of the synthesized network • #Activities: Number of activities in the network, including start and finish activities • Variability: Variance of the activities in the network (this value can be either high, low, medium or rand, where rand shows a random combination of low, high and medium variance in the network activities.) • Skewness: Skewness of the activities in the network (Skewness can be either right, left, None or rand, where rand shows a random combination of right, left, and none skewed in the network activities)
    • Fitted distribution type: Distribution type used to fit on sampled data • Sample size: Number of sampled data used for the experiment resembling limited data condition • Benchmark 10th percentile: 10th percentile of project duration in the benchmark stochastic project network • Benchmark 50th percentile: 50th project duration in the benchmark stochastic project network • Benchmark 90th percentile: 90th project duration in the benchmark stochastic project network • Benchmark mean: Mean project duration in the benchmark stochastic project network • Benchmark variance: Variance project duration in the benchmark stochastic project network • Experiment 10th percentile: 10th percentile of project duration distribution for the experiment • Experiment 50th percentile: 50th percentile of project duration distribution for the experiment • Experiment 90th percentile: 90th percentile of project duration distribution for the experiment • Experiment mean: Mean of project duration distribution for the experiment • Experiment variance: Variance of project duration distribution for the experiment • K-S: Kolmogorov–Smirnov test comparing benchmark distribution and project duration • distribution of the experiment • P_value: the P-value based on the distance calculated in the K-S test

  11. 4

    Supplementary data for the paper "Why psychologists should not default to...

    • data.4tu.nl
    zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joost de Winter (2025). Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Joost de Winter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.

  12. Ames Housing Engineered Dataset

    • kaggle.com
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atefeh Amjadian (2025). Ames Housing Engineered Dataset [Dataset]. https://www.kaggle.com/datasets/atefehamjadian/ameshousing-engineered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atefeh Amjadian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Ames
    Description

    This dataset is an engineered version of the original Ames Housing dataset from the "House Prices: Advanced Regression Techniques" Kaggle competition. The goal of this engineering was to clean the data, handle missing values, encode categorical features, scale numeric features, manage outliers, reduce skewness, select useful features, and create new features to improve model performance for house price prediction.

    The original dataset contains information on 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, with the target variable being SalePrice. This engineered version has undergone several preprocessing steps to make it ready for machine learning models.

    Preprocessing Steps Applied

    1. Missing Value Handling: Missing values in categorical columns with meaningful absence (e.g., no pool for PoolQC) were filled with "None". Numeric columns were filled with median, and other categorical columns with mode.
    2. Correlation-based Feature Selection: Numeric features with absolute correlation < 0.1 with SalePrice were removed.
    3. Encoding Categorical Variables: Ordinal features (e.g., quality ratings) were encoded using OrdinalEncoder, and nominal features (e.g., neighborhoods) using OneHotEncoder.
    4. Outlier Handling: Outliers in numeric features were detected using IQR and capped (Winsorized) to IQR bounds to preserve data while reducing extreme values.
    5. Skewness Handling: Highly skewed numeric features (|skew| > 1) were transformed using Yeo-Johnson to make distributions more normal-like.
    6. Additional Feature Selection: Low-variance one-hot features (variance < 0.01) and highly collinear features (|corr| > 0.8) were removed.
    7. Feature Scaling: Numeric features were scaled using RobustScaler to handle outliers.
    8. Duplicate Removal: Duplicate rows were checked and removed if found (none in this dataset).

    The final dataset has fewer columns than the original (reduced from 81 to approximately 250 after one-hot encoding, then further reduced by feature selection), with improved quality for modeling.

    New Features Created

    To add more predictive power, the following new features were created based on domain knowledge: 1. HouseAge: Age of the house at the time of sale. Calculated as YrSold - YearBuilt. This captures how old the house is, which can negatively affect price due to depreciation. - Example: A house built in 2000 and sold in 2008 has HouseAge = 8. 2. Quality_x_Size: Interaction term between overall quality and living area. Calculated as OverallQual * GrLivArea. This combines quality and size to capture the value of high-quality large homes. - Example: A house with OverallQual = 7 and GrLivArea = 1500 has Quality_x_Size = 10500. 3. TotalSF: Total square footage of the house. Calculated as GrLivArea + TotalBsmtSF + 1stFlrSF + 2ndFlrSF (if available). This aggregates area features into a single metric for better price prediction. - Example: If GrLivArea = 1500 and TotalBsmtSF = 1000, TotalSF = 2500. 4. Log_LotArea: Log-transformed lot area to reduce skewness. Calculated as np.log1p(LotArea). This makes the distribution of lot sizes more normal, helping models handle extreme values. - Example: A lot area of 10000 becomes Log_LotArea ≈ 9.21.

    These new features were created using the original (unscaled) values to maintain interpretability, then scaled with RobustScaler to match the rest of the dataset.

    Data Dictionary

    • Original Numeric Features: Kept features with |corr| > 0.1 with SalePrice, such as:
      • OverallQual: Material and finish quality (scaled, 1-10).
      • GrLivArea: Above grade (ground) living area square feet (scaled).
      • GarageCars: Size of garage in car capacity (scaled).
      • TotalBsmtSF: Total square feet of basement area (scaled).
      • And others like FullBath, YearBuilt, etc. (see the code for the full list).
    • Ordinal Encoded Features: Quality and condition ratings, e.g.:
      • ExterQual: Exterior material quality (encoded as 0=Po to 4=Ex).
      • BsmtQual: Basement quality (encoded as 0=None to 5=Ex).
    • One-Hot Encoded Features: Nominal categorical features, e.g.:
      • MSZoning_RL: 1 if residential low density, 0 otherwise.
      • Neighborhood_NAmes: 1 if in NAmes neighborhood, 0 otherwise.
    • New Engineered Features (as described above):
      • HouseAge: Age of the house (scaled).
      • Quality_x_Size: Overall quality times living area (scaled).
      • TotalSF: Total square footage (scaled).
      • Log_LotArea: Log-transformed lot area (scaled).
    • Target: SalePrice - The property's sale price in dollars (not scaled, as it's the target).

    Total columns: Approximately 200-250 (after one-hot encoding and feature selection).

    License

    This dataset is derived from the Ames Housing...

  13. D

    Data from: Uneven missing data skew phylogenomic relationships within the...

    • datasetcatalog.nlm.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jul 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smith, Brian; Andersen, Michael J.; Benz, Brett W.; Mauck III, William M. (2022). Uneven missing data skew phylogenomic relationships within the lories and lorikeets [Dataset]. http://doi.org/10.5061/dryad.n5tb2rbsp
    Explore at:
    Dataset updated
    Jul 21, 2022
    Authors
    Smith, Brian; Andersen, Michael J.; Benz, Brett W.; Mauck III, William M.
    Description

    Inlcuded is the supplementary data for Smith, B. T., Mauck, W. M., Benz, B., & Andersen, M. J. (2018). Uneven missing data skews phylogenomic relationships within the lories and lorikeets. BioRxiv, 398297. The resolution of the Tree of Life has accelerated with advances in DNA sequencing technology. To achieve dense taxon sampling, it is often necessary to obtain DNA from historical museum specimens to supplement modern genetic samples. However, DNA from historical material is generally degraded, which presents various challenges. In this study, we evaluated how the coverage at variant sites and missing data among historical and modern samples impacts phylogenomic inference. We explored these patterns in the brush-tongued parrots (lories and lorikeets) of Australasia by sampling ultraconserved elements in 105 taxa. Trees estimated with low coverage characters had several clades where relationships appeared to be influenced by whether the sample came from historical or modern specimens, which were not observed when more stringent filtering was applied. To assess if the topologies were affected by missing data, we performed an outlier analysis of sites and loci, and a data reduction approach where we excluded sites based on data completeness. Depending on the outlier test, 0.15% of total sites or 38% of loci were driving the topological differences among trees, and at these sites, historical samples had 10.9x more missing data than modern ones. In contrast, 70% data completeness was necessary to avoid spurious relationships. Predictive modeling found that outlier analysis scores were correlated with parsimony informative sites in the clades whose topologies changed the most by filtering. After accounting for biased loci and understanding the stability of relationships, we inferred a more robust phylogenetic hypothesis for lories and lorikeets.

  14. u

    Data publication: Demand-pull and technology-push: What drives the direction...

    • pub.uni-bielefeld.de
    Updated Mar 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kerstin Hötte (2023). Data publication: Demand-pull and technology-push: What drives the direction of technological change? Empirical data on a coupled two-layer input-output and patent-citation network [Dataset]. https://pub.uni-bielefeld.de/record/2952814
    Explore at:
    Dataset updated
    Mar 10, 2023
    Authors
    Kerstin Hötte
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The core data in this publication are empirical data on two coupled network layers inferred from cross-industrial citation links (patent citation network) and input-output flows among industries (input-output (IO) network). These data are available in quinquennial time steps for the years 1977, 1982, 1987, 1992, 1997, 2002, 2006. The data is available at the 6-digit level. The analyses in the paper mainly refer to 4-digit level results of a balanced panel of industries, i.e. industries for which both patent and IO data are available for the full time horizon. This publication also contains a sample of panel data on industry characteristics (mainly industry size by patent stock and output and network indicators). The data are available ein RData format and supplemented by the R-scripts used to compile and analyze the data.

    This data publication contains all data and R-code used for the paper:

    Hötte, Kerstin (2021): "Demand-pull Demand-pull and technology-push: What drives the direction of technological change? An empirical network-based approach".

    [Forthcoming. If you use the data, please cite the most recent version of the paper.]

    The paper also offers a description of the data and its compilation.

    Abstract:

    Demand-pull and technology-push are linked to an empirical two-layer network based on coupled cross-industrial input-output (IO) and patent-citation links among 155 4-digit (NAICS) US-industries in 1976-2006. I study the evolution of industry hierarchies and link formation. Both layers co-evolve, but differently: The patent network became denser and increasingly skewed, while market hierarchies are balanced and sluggish in change. Industries became more similar by patent-citations, but less by input use. Similar R&D capabilities as other big industries is beneficial for innovation providing access to knowledge but relying on the same market inputs is unfavorable if it intensifies competition. This may incite industries to explore other technological pathways. Growth in the market is constrained by scarcity and competition, but knowledge as innovation input is non-rival leading to increasing returns and a skewed distribution. This may strengthen existing R&D trajectories while market pressure may trigger a re-direction in both layers. This work is limited by its reliance on endogenously evolving classifications.

    To reproduce the results and the data from the raw data, you must run the code provided in the

    following order:

    (1) CREATING THE DATA: (a) The patent data can not be fully reconstructed from the data that are available in this data publication because one of the intermediate steps relies on proprietary data that can not be provided here. For the remainder: You can compile parts of the patent and the IO data from the raw data. To do so, please use the code and raw data provided in the folders io_data_R_files and patent_data_R_files. Further detail is provided below.
    (b) The folder R_scripts_both provides all code needed to create the merged panel data that is used in the analysis.

    (2) REPRODUCING THE ANALYSES: The folder R_script_both provides all code needed to reproduce the figures, tables, descriptive statistics and regression analyses. Further detail is provided below.

    This data publication also provides additional results on the analyses at different levels of data aggregation. You will find it in the folder statistical_output but you can also produce additional results running the code provided.

    This data publication consists of 6 folders:

    (1) patent_data_R_files (2) io_data_R_files (3) R_scripts_both (4) data_combined (5) statistical_output

    Details:

    (1) patent_data_R_files

    This folder contains 2 subfolders: code, data

    • code: This subfolder contains the R-scripts of all single steps executed to process the patent raw data. These steps are explained in detail in the Supplementary Material of the paper Hötte, K (2021): "Demand-pull and technology-push [forthcoming]"

    • data: This subfolder contains the processed data at different levels of aggregation and a folder where to put the source files, i.e. the original NBER patent data used in this analysis that need to be downloaded from https://sites.google.com/site/patentdataproject/Home [accessed on Mar 17, 2021]. To us

  15. Additional file 2 of Modelling count, bounded and skewed continuous outcomes...

    • springernature.figshare.com
    text/x-diff
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 2 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774297.v1
    Explore at:
    text/x-diffAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material 2: A supplementary file with examples of STATA script for all models that have been fitted in this paper.

  16. r

    Data from: STCSSP: A FORTRAN 77 routine to compute a structured staircase...

    • resodate.org
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Brüll; Volker Mehrmann (2021). STCSSP: A FORTRAN 77 routine to compute a structured staircase form for a (skew-)symmetric/(skew-)symmetric matrix pencil [Dataset]. http://doi.org/10.14279/depositonce-14386
    Explore at:
    Dataset updated
    Dec 17, 2021
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Tobias Brüll; Volker Mehrmann
    Description

    This paper contains the description of the algorithm STCSSP and its interface. STCSSP is a FORTRAN subroutine that computes a structured staircase form for a real (skew-) symmetric / (skew-) symmetric matrix pencil, i.e., a pencil where each of the two matrices is either symmetric or skew-symmetric. An example how to call the subroutine is given.

  17. Earnings by Workplace, Borough - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2025). Earnings by Workplace, Borough - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/earnings-by-workplace-borough
    Explore at:
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    CKANhttps://ckan.org/
    Description

    This dataset provides information about earnings of employees who are working in an area, who are on adult rates and whose pay for the survey pay-period was not affected by absence. Tables provided here include total gross weekly earnings, and full time weekly earnings with breakdowns by gender, and annual median, mean and lower quartile earnings by borough and UK region. These are provided both in nominal and real terms. Real earnings figures are on sheets labelled "real", are in 2016 prices, and calculated by applying ONS’s annual CPI index series for April to ASHE data. Annual Survey of Hours and Earnings (ASHE) is based on a sample of employee jobs taken from HM Revenue & Customs PAYE records. Information on earnings and hours is obtained in confidence from employers. ASHE does not cover the self-employed nor does it cover employees not paid during the reference period. The earnings information presented relates to gross pay before tax, National Insurance or other deductions, and excludes payments in kind. The confidence figure is the coefficient of variation (CV) of that estimate. The CV is the ratio of the standard error of an estimate to the estimate itself and is expressed as a percentage. The smaller the coefficient of variation the greater the accuracy of the estimate. The true value is likely to lie within +/- twice the CV. Results for 2003 and earlier exclude supplementary surveys. In 2006 there were a number of methodological changes made. For further details goto : http://www.nomisweb.co.uk/articles/341.aspx. The headline statistics for ASHE are based on the median rather than the mean. The median is the value below which 50 per cent of employees fall. It is ONS's preferred measure of average earnings as it is less affected by a relatively small number of very high earners and the skewed distribution of earnings. It therefore gives a better indication of typical pay than the mean. Survey data from a sample frame, use caution if using for performance measurement and trend analysis '#' These figures are suppressed as statistically unreliable. ! Estimate and confidence interval not available since the group sample size is zero or disclosive (0-2). Furthermore, data from Abstract of Regional Statistics, New Earnings Survey and ASHE have been combined to create long run historical series of full-time weekly earnings data for London and Great Britain, stretching back to 1965, and is broken down by sex.

  18. r

    Data from: Polynomial Eigenvalue Problems with Hamiltonian Structure

    • resodate.org
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volker Mehrmann; David Watkins (2021). Polynomial Eigenvalue Problems with Hamiltonian Structure [Dataset]. http://doi.org/10.14279/depositonce-14247
    Explore at:
    Dataset updated
    Dec 17, 2021
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Volker Mehrmann; David Watkins
    Description

    We discuss the numerical solution of eigenvalue problems for matrix polynomials, where the coefficient matrices are alternating symmetric and skew symmetric or Hamiltonian and skew Hamiltonian. We discuss several applications that lead to such structures. Matrix polynomials of this type have a symmetry in the spectrum that is the same as that of Hamiltonian matrices or skew-Hamiltonian/Hamiltonian pencils. The numerical methods that we derive are designed to preserve this eigenvalue symmetry. We also discuss linearization techniques that transform the polynomial into a skew-Hamiltonian/Hamiltonian linear eigenvalue problem with a specific substructure. For this linear eigenvalue problem we discuss special factorizations that are useful in shift-and-invert Krylov subspace methods for the solution of the eigenvalue problem. We present a numerical example that demonstrates the effectiveness of our approach.

  19. f

    Supplementary Material for: Modeling Caries Experience: Advantages of the...

    • datasetcatalog.nlm.nih.gov
    • karger.figshare.com
    Updated Sep 16, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. , Hofstetter; A. , Zeileis; E. , Dusseldorp; A. A. , Schuller (2016). Supplementary Material for: Modeling Caries Experience: Advantages of the Use of the Hurdle Model [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001582109
    Explore at:
    Dataset updated
    Sep 16, 2016
    Authors
    H. , Hofstetter; A. , Zeileis; E. , Dusseldorp; A. A. , Schuller
    Description

    In dental epidemiology, the decayed (D), missing (M), and filled (F) teeth or surfaces index (DFM index) is a frequently used measure. The DMF index is characterized by a strongly positive skewed distribution with a large stack of zero counts for those individuals without caries experience. Therefore, standard generalized linear models often lead to a poor fit. The hurdle regression model is a highly suitable class to model a DMF index, but its use is subordinated. We aim to overcome the gap between the suitability of the hurdle model to fit DMF indices and the frequency of its use in caries research. A theoretical introduction to the hurdle model is provided, and an extensive comparison with the zero-inflated model is given. Using an illustrative data example, both types of models are compared, with a special focus on interpretation of their parameters. Accompanying R code and example data are provided as online supplementary material.

  20. Additional file 3 of Modelling count, bounded and skewed continuous outcomes...

    • springernature.figshare.com
    txt
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 3 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774300.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material 3: A supplementary file with examples of SAS script for all models that have been fitted in this paper.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jesus Rogel-Salazar (2021). Normal and Skewed Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.17306285.v1
Organization logoOrganization logo

Normal and Skewed Example Data

Explore at:
txtAvailable download formats
Dataset updated
Dec 21, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jesus Rogel-Salazar
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Example data for normally distributed and skewed datasets.

Search
Clear search
Close search
Google apps
Main menu