Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.
Facebook
TwitterThe folder contains three datasets: Zomato restaurants, Restaurants on Yellow Pages, and Arabic poetry. Where all datasets have been taken from Kaggle and made some modifications by adding missing values, where the missing values are referred to as symbol (?). The experiment has been done to experiment with the processes of imputation missing values on nominal values. The missing values in the three datasets are in the range of 10%-80%.
The Arabic dataset has several modifications as follows: 1. Delete the columns that contain English values such as Id, poem_link, poet link. The reason is the need to evaluate the ERAR method on the Arabic data set. 2. Add diacritical marks to some records to check the effect of diacritical marks during frequent itemset generation. note: the results of the experiment on the Arabic dataset will be find in the paper under the title "Missing values imputation in Arabic datasets using enhanced robust association rules"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean absolute percentage errors for the scenario where error variance was 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40% with RSS in missing value estimation methods.
Facebook
Twitter
Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean absolute percentage errors of platinum loss for the scenarios with error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30, and 40% with SRS in missing value estimation methods.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The missing data problem has been widely addressed in the literature. The traditional methods for handling missing data may be not suited to spatial data, which can exhibit distinctive structures of dependence and/or heterogeneity. As a possible solution to the spatial missing data problem, this paper proposes an approach that combines the Bayesian Interpolation method [Benedetti, R. & Palma, D. (1994) Markov random field-based image subsampling method, Journal of Applied Statistics, 21(5), 495–509] with a multiple imputation procedure. The method is developed in a univariate and a multivariate framework, and its performance is evaluated through an empirical illustration based on data related to labour productivity in European regions.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Overview
Dive into the Extrovert vs. Introvert Personality Traits Dataset, a rich collection of behavioral and social data designed to explore the spectrum of human personality. This dataset captures key indicators of extroversion and introversion, making it a valuable resource for psychologists, data scientists, and researchers studying social behavior, personality prediction, or data preprocessing techniques.
Context
Personality traits like extroversion and introversion shape how individuals interact with their social environments. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to uncovering fascinating insights.
Dataset Details
Size: The dataset contains 2,900 rows and 8 columns.
Features:
- Time_spent_Alone: Hours spent alone daily (0–11).
- Stage_fear: Presence of stage fright (Yes/No).
- Social_event_attendance: Frequency of social events (0–10).
- Going_outside: Frequency of going outside (0–7).
- Drained_after_socializing: Feeling drained after socializing (Yes/No).
- Friends_circle_size: Number of close friends (0–15).
- Post_frequency: Social media post frequency (0–10).
- Personality: Target variable (Extrovert/Introvert).*
Data Quality: Includes some missing values, ideal for practicing imputation and preprocessing. Format: Single CSV file, compatible with Python, R, and other tools.*
Data Quality Notes
Potential Use Cases
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aim
Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.
Location
Global
Time period
Contemporary
Major taxa studied
Seabirds
Methods
We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.
Results
We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.
Main conclusions
Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.
Methods Trait Selection and Data
We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.
Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.
Multiple Imputation
All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).
We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).
Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.
Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.
To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A fully synthetic but realistic dataset created for Machine Learning practice, focusing on how lifestyle, health habits, and medical indicators influence the risk of diabetes.
This dataset is ideal for:
This dataset simulates 5000 individuals with various health and lifestyle attributes. Each record represents one person and includes both clinical metrics (Glucose, BMI, Insulin) and behavioral factors (Diet, Exercise, Smoking, etc.).
The target variable is Diabetes_Status (0 = No Diabetes, 1 = Diabetes).
Missing values (~5%) are intentionally added to practice data cleaning, imputation, and analysis.
| Column | Description |
|---|---|
| Glucose | Blood glucose level (mg/dL), influenced by diet and heredity |
| BMI | Body Mass Index, realistically correlated with diet & exercise |
| Insulin | Insulin level (µU/mL), correlated with glucose |
| Age | Age of the individual (18–80 years) |
| Column | Description |
|---|---|
| Gender | Male or Female |
| Column | Description |
|---|---|
| Diet_Type | Healthy / Moderate / Unhealthy |
| Exercise_Frequency | Daily, 3–5 / week, 1–2 / week, Rarely |
| Heredity | Family history of diabetes (Yes/No) |
| Smoking | Smoking habit (Yes/No) |
| Alcohol | None / Low / Moderate / High |
| Stress_Score | Stress rating from 1 to 10 |
| Sleep_Hours | Average daily sleep duration |
| Column | Description |
|---|---|
| Diabetes_Status | 0 = Non-diabetic, 1 = Diabetic |
The probability of diabetes is influenced by:
This creates realistic correlations suitable for ML models.
Approximately 5% missing values are added randomly across all columns to help learners practice:
You can use this dataset for:
Many diabetes datasets (e.g., Pima) are small and limited. This dataset provides:
Perfect for students, beginners, and ML practitioners.
This dataset is fully synthetic and AI-generated, created exclusively for educational and machine learning purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This panel dataset presents information on the impact of democracy and political stability on economic growth in 15 MENA countries for the period 1983-2022. The data are collected from five different sources; the World Bank Development Indicators (WDI), the World Bank Governance Indicators (WGI), the Penn World Table (PWT), Polity5 from the Integrated Network for Societal Conflict Research (INSCR), and the Varieties of Democracy (V-Dem). The dataset includes ten variables related to economic growth, democracy, and political stability. Data analysis was performed using statistical methods such as R in order to ensure data reliability through imputing missing data; hence, enabling future researchers to explore the impact of political factors on growth in various contexts. The data are presented in two sheets, before and after the imputation for missing values. The potential reuse of this dataset lies in the ability to examine the impact of different political factors on economic growth in the region.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.
This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:
The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.
The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.
A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.
A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:
Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.
One-Hot Encoding: All categorical features have been converted to a numerical format.
Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.
The result is a clean, comprehensive dataset ready for modeling.
Each row represents a student profile, and the columns are the features and the target.
Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).
The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.
Key Columns:
Target Variable:
had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset.
1: The student either failed (Fail) or withdrew (Withdrawn) from the course.
0: The student passed (Pass or Distinction).
Feature Groups:
OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE.
Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources.
Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics.
Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.
(Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)
This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:
OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset
UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance
This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:
Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).
Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).
Can you build separate models for each data origin (origem_dado_*) and compare ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predictive values are measures of the clinical accuracy of a binary diagnostic test, and depend on the sensitivity and the specificity of the diagnostic test and on the disease prevalence among the population being studied. This article studies hypothesis tests to simultaneously compare the predictive values of two binary diagnostic tests in the presence of missing data. The hypothesis tests were solved applying two computational methods: the expectation maximization and the supplemented expectation maximization algorithms, and multiple imputation. Simulation experiments were carried out to study the sizes and the powers of the hypothesis tests, giving some general rules of application. Two R programmes were written to apply each method, and they are available as supplementary material for the manuscript. The results were applied to the diagnosis of Alzheimer’s disease.
Facebook
TwitterData were collected from a network of UK farms using a density structured survey method outlined in Queensborough 2011.
Facebook
TwitterThe technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.
Facebook
TwitterThe main objective of the Tanzania NPS is to provide high-quality household-level data to the Tanzanian government and other stakeholders for monitoring poverty dynamics, tracking the progress of the Mkukuta poverty reduction strategy1, and to evaluate the impact of other major, national-level government policy initiatives. As an integrated survey covering a number of different socioeconomic factors, it compliments other more narrowly focused survey efforts, such as the Demographic and Health Survey on health, the Integrated Labour Force Survey on labour markets, the Household Budget Survey on expenditure, and the National Sample Census of Agriculture. Secondly, as a panel household survey in which the same households are revisited over time, the Tanzania NPS allows for the study of poverty and welfare transitions and the determinants of living standard changes
National
Households
Sample survey data [ssd]
The sample design for the second round of the NPS revisits all the households interviewed in the first round of the panel, as well as tracking adult split-off household members. The original sample size of 3,265 households was designed to representative at the national, urban/rural, and major agro-ecological zones. The total sample size was 3,265 households in 409 Enumeration Areas (2,063 households in rural areas and 1,202 urban areas). It is also be possible in the final analysis to produce disaggregated poverty rates for 4 different strata: Dar es Salaam, other urban areas on mainland Tanzania, rural mainland Tanzania, and Zanzibar.
Since the NPS is a panel survey, the second round of the fieldwork revisits all households originally interviewed during round one. If a household has moved from its original location, the members were interviewed in their new location. If that location was within one hour of the original location, the field team did the interview at the time of their visit to the enumeration area. If the household had located more than an hour from the original location, details of the new location were recorded on specialized forms, and the information passed to a dedicated tracking team for follow-up.
If a member of the original household had split from their original location to form or join a new household, information was recorded on the current whereabouts of this member. All adult former household members (those over the age of 15) were tracked to their new location. Similar to the protocol for the re-located households, if the new household is within one hour of the original location, the new household was interviewed by the main field team at the time of the visit to the enumeration area. For those that have moved more than one hour away, their information was passed to the dedicated tracking team for follow-up. Once the tracking targets have been found, teams are required to interview them and any new members of the household.
The total sample size for the second round of the NPS has a total sample size of 3924 households. This represents 3168 round-one households, a re-interview rate of over 97 percent. In addition, of the 10,420 eligible adults (over age 15 in 2010), 9,338 were re-interviewed, a re-interview rate of approximately 90 percent.
The total sample size for the second round of the NPS has a total sample size of 3924 households. This represents 3168 round one household, a re-interview rate of over 97 percent. In addition, of the 10,420 eligible adults (over age 15 in 2010), 9,338 were re-interviewed, a re-interview rate of approximately 90 percent. To obtain the attrition adjustment factor the probability that a sample household was successfully re-interviewed in the second round of surveys is modelled with the linear logistic model at the level of the individual. A binary response variable is created by coding the response disposition for eligible households that do not respond in the second round as 0, and households that do respond as 1.
Face-to-face [f2f]
CSPro-based data entry/editing system was used. A cross comparison between the entered values in the field based data entry and double entry was conducted and any differences in values between the two were flagged for manual inspection of the physical questionnaire. Corrections based on this inspection exercise were ultimately encoded in the dataset.
Additionally, an extensive review of data files was conducted, including interviewer errors such as missing values, ranges and outliers. Observations were returned for manual inspection of the physical questionnaires if continuous values fell outside five standard deviations of the mean, categorical values were not eligible responses, or there were internal inconsistencies within the dataset (for example, the age of an individual was not consistent with their educational status, there was more than one head of household listed, an individual was engaged in multiple primary activities, the quantity of crops and their by-products produced, harvested, and sold not listed, the distance from the market and an individual's plot was not listed, the number of weeks, days per week, and hours per day an individual engaged in fishery activity was not recorded, the species and quantity of fish caught, bought, sold, or traded was not listed, etc). When it was determined that these values were the result of data-entry error, the values were corrected. In addition, cases deemed to reflect obvious enumerator error were also corrected in this cleaning process. The majority of such cases involved the use of incorrect measurement units, e.g. recording grams as kilograms or vice versa.
Approximately 95 percent
To reduce the overall standard errors, and weight the population totals up to the known population figures, a post-stratification correction is applied. Based on the projected number of households in the urban and rural segments of each region, adjustment factors are calculated. This correction also reduces overall standard errors.
The estimated logistic model is used to obtain a predicted probability of response for each household member in the 2010/2011 survey. These response probabilities were then aggregated to the household level (by calculating the mean), the using the household-level predicted response probabilities as the ranking variable, all households are ranked into 10 equal groups (deciles). An attrition adjustment factor was then defined as the reciprocal of the empirical response rate for the household-level propensity
Then a logistic response propensity model is fitted, using 2005 UNHS household and individual characteristics measured in the first wave as covariates. In a few limited cases, values of unit level variables were missing from the 2008/2009 household dataset. These values were imputed using multivariate regression and logistic regression techniques. Imputations are done using the 'impute' command in Stata at the level of the UNPS strata (urban/rural and region). Overall, less than one percent of the variables required imputation to replace missing values.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This paper provides a method for constructing a new historical global nitrogen fertilizer application map (0.5° × 0.5° resolution) for the period 1961-2010 based on country-specific information from Food and Agriculture Organization statistics (FAOSTAT) and various global datasets. This new map incorporates the fraction of NH+4 (and NONO-3) in N fertilizer inputs by utilizing fertilizer species information in FAOSTAT, in which species can be categorized as NH+4 and/or NO-3-forming N fertilizers. During data processing, we applied a statistical data imputation method for the missing data (19 % of national N fertilizer consumption) in FAOSTAT. The multiple imputation method enabled us to fill gaps in the time-series data using plausible values using covariates information (year, population, GDP, and crop area). After the imputation, we downscaled the national consumption data to a gridded cropland map. Also, we applied the multiple imputation method to the available chemical fertilizer species consumption, allowing for the estimation of the NH+4/NO-3 ratio in national fertilizer consumption. In this study, the synthetic N fertilizer inputs in 2000 showed a general consistency with the existing N fertilizer map (Potter et al., 2010, doi:10.1175/2009EI288.1) in relation to the ranges of N fertilizer inputs. Globally, the estimated N fertilizer inputs based on the sum of filled data increased from 15 Tg-N to 110 Tg-N during 1961-2010. On the other hand, the global NO-3 input started to decline after the late 1980s and the fraction of NO-3 in global N fertilizer decreased consistently from 35 % to 13 % over a 50-year period. NH+4 based fertilizers are dominant in most countries; however, the NH+4/NO-3 ratio in N fertilizer inputs shows clear differences temporally and geographically. This new map can be utilized as an input data to global model studies and bring new insights for the assessment of historical terrestrial N cycling changes.
Facebook
TwitterPhylogenetic imputation has recently emerged as a potentially powerful tool for predicting missing data in functional traits datasets. As such, understanding the limitations of phylogenetic modelling in predicting trait values is critical if we are to use them in subsequent analyses. Previous studies have focused on the relationship between phylogenetic signal and clade-level prediction accuracy, yet variability in prediction accuracy among individual tips of phylogenies remains largely unexplored. Here, we used simulations of trait evolution along the branches of phylogenetic trees to show how the accuracy of phylogenetic imputations is influenced by the combined effects of (1) the amount of phylogenetic signal in the traits and (2) the branch length of the tips to be imputed. Specifically, we conducted cross-validation trials to estimate the variability in prediction accuracy among individual tips on the phylogenies (hereafter “tip-level accuracy”). We found that under a Brownian motion model of evolution (BM, Pagel's λ = 1), tip-level accuracy rapidly decreased with increasing tip branch-lengths, and only tips of approximately 10% or less of the total height of the trees showed consistently accurate predictions (i.e. cross-validation R-squared > 0.75). When phylogenetic signal was weak, the effect of tip branch-length was reduced, becoming negligible for traits simulated with λ < 0.7, where accuracy was in any case low. Our study shows that variability in prediction accuracy among individual tips of the phylogeny should be considered when evaluating the reliability of phylogenetically imputed trait values. To address this challenge, we describe a Monte Carlo-based method that allows one to estimate the expected tip-level accuracy of phylogenetic predictions for continuous traits. Our approach identifies gaps in functional trait datasets for which phylogenetic imputation performs poorly, and will help ecologists to design more efficient trait collection campaigns by focusing resources on lineages whose trait values are more uncertain.
Facebook
TwitterThis dataset contains QSAR data (from ChEMBL version 17) showing activity values (unit is pseudo-pCI50) of several compounds on drug target ChEMBL_ID: CHEMBL2010624 (TID: 104492), and it has 11 rows and 294 features (not including molecule IDs and class feature: molecule_id and pXC50). The features represent Molecular Descriptors which were generated from SMILES strings. Missing value imputation was applied to this dataset (By choosing the Median). Feature selection was also applied.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
Facebook
TwitterThis dataset includes imputation for missing data in key variables in the ten percent sample of the 2001 South African Census. Researchers at the Centre for the Analysis of South African Social Policy (CASASP) at the University of Oxford used sequential multiple regression techniques to impute income, education, age, gender, population group, occupation and employment status in the dataset. The main focus of the work was to impute income where it was missing or recorded as zero. The imputed results are similar to previous imputation work on the 2001 South African Census, including the single ‘hot-deck’ imputation carried out by Statistics South Africa.
Sample survey data [ssd]
Face-to-face [f2f]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.