37 datasets found

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5,...
plos.figshare.com
xls
Updated Mar 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t023
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0313772.t023
Dataset updated
Mar 17, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.
Imputation missing values in the nominal datasets
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Awsan thabet salem (2023). Imputation missing values in the nominal datasets [Dataset]. https://www.kaggle.com/datasets/awsanthabetsalem/imputation-in-arabic-dataset/data
Explore at:
zip(16588335 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
Awsan thabet salem
Description
The folder contains three datasets: Zomato restaurants, Restaurants on Yellow Pages, and Arabic poetry. Where all datasets have been taken from Kaggle and made some modifications by adding missing values, where the missing values are referred to as symbol (?). The experiment has been done to experiment with the processes of imputation missing values on nominal values. The missing values in the three datasets are in the range of 10%-80%.

The Arabic dataset has several modifications as follows: 1. Delete the columns that contain English values such as Id, poem_link, poet link. The reason is the need to evaluate the ERAR method on the Arabic data set. 2. Add diacritical marks to some records to check the effect of diacritical marks during frequent itemset generation. note: the results of the experiment on the Arabic dataset will be find in the paper under the title "Missing values imputation in Arabic datasets using enhanced robust association rules"
f
Mean absolute percentage errors for the scenario where error variance was 9;...
plos.figshare.com
xls
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors for the scenario where error variance was 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40% with RSS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t021
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0313772.t021
Dataset updated
Mar 17, 2025
Dataset provided by
PLOS ONE
Authors
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mean absolute percentage errors for the scenario where error variance was 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40% with RSS in missing value estimation methods.
2
QLFS
datacatalogue.ukdataservice.ac.uk
Updated Sep 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2025). QLFS [Dataset]. http://doi.org/10.5255/UKDA-SN-9445-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9445-1
Dataset updated
Sep 16, 2025
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Office for National Statistics
Area covered
United Kingdom
Description
Background
The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

Household datasets
Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

Change to coding of missing values for household series
From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

LFS Documentation
The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS LFS User Guidance page before commencing analysis.

Additional data derived from the QLFS
The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

End User Licence and Secure Access QLFS Household datasets
Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

Changes to variables in QLFS Household EUL datasets
In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

Review of imputation methods for LFS Household data - changes to missing values
A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Mean absolute percentage errors of platinum loss for the scenarios with...
plos.figshare.com
xls
Updated Mar 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors of platinum loss for the scenarios with error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30, and 40% with SRS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t027
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0313772.t027
Dataset updated
Mar 17, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mean absolute percentage errors of platinum loss for the scenarios with error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30, and 40% with SRS in missing value estimation methods.
f
Data from: A Bayesian Approach to Parameter Estimation in the Presence of...
tandf.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Domenica Panzera; Roberto Benedetti; Paolo Postiglione (2023). A Bayesian Approach to Parameter Estimation in the Presence of Spatial Missing Data [Dataset]. http://doi.org/10.6084/m9.figshare.3688599.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3688599.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Domenica Panzera; Roberto Benedetti; Paolo Postiglione
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The missing data problem has been widely addressed in the literature. The traditional methods for handling missing data may be not suited to spatial data, which can exhibit distinctive structures of dependence and/or heterogeneity. As a possible solution to the spatial missing data problem, this paper proposes an approach that combines the Bayesian Interpolation method [Benedetti, R. & Palma, D. (1994) Markov random field-based image subsampling method, Journal of Applied Statistics, 21(5), 495–509] with a multiple imputation procedure. The method is developed in a univariate and a multivariate framework, and its performance is evaluated through an empirical illustration based on data related to labour productivity in European regions.
Extrovert vs. Introvert Behavior Data
kaggle.com
zip
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Kapilavayi (2025). Extrovert vs. Introvert Behavior Data [Dataset]. https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data/discussion
Explore at:
zip(31277 bytes)Available download formats
Dataset updated
Jun 13, 2025
Authors
Rakesh Kapilavayi
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

Dive into the Extrovert vs. Introvert Personality Traits Dataset, a rich collection of behavioral and social data designed to explore the spectrum of human personality. This dataset captures key indicators of extroversion and introversion, making it a valuable resource for psychologists, data scientists, and researchers studying social behavior, personality prediction, or data preprocessing techniques.

Context

Personality traits like extroversion and introversion shape how individuals interact with their social environments. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to uncovering fascinating insights.

Dataset Details

Size: The dataset contains 2,900 rows and 8 columns.

Features:

- Time_spent_Alone: Hours spent alone daily (0–11). - Stage_fear: Presence of stage fright (Yes/No). - Social_event_attendance: Frequency of social events (0–10). - Going_outside: Frequency of going outside (0–7). - Drained_after_socializing: Feeling drained after socializing (Yes/No). - Friends_circle_size: Number of close friends (0–15). - Post_frequency: Social media post frequency (0–10). - Personality: Target variable (Extrovert/Introvert).*

Data Quality: Includes some missing values, ideal for practicing imputation and preprocessing. Format: Single CSV file, compatible with Python, R, and other tools.*

Data Quality Notes

Contains missing values in columns like Time_spent_Alone and Going_outside, offering opportunities for data cleaning practice.

Balanced classes ensure robust model training.

Binary categorical variables simplify encoding tasks.

Potential Use Cases

Build machine learning models to predict personality types.

Analyze correlations between social behaviors and personality traits.

Explore social media engagement patterns.

Practice data preprocessing techniques like imputation and encoding.

Create visualizations to uncover behavioral trends.
n
Data from: Biological traits of seabirds predict extinction risk and...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Mar 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerren Richards; Robert Cooke; Amanda Bates (2021). Biological traits of seabirds predict extinction risk and vulnerability to anthropogenic threats [Dataset]. http://doi.org/10.5061/dryad.x69p8czhd
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.x69p8czhd
Dataset updated
Mar 16, 2021
Dataset provided by
Memorial University of Newfoundland
University of Gothenburg
Authors
Cerren Richards; Robert Cooke; Amanda Bates
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Aim

Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.

Location

Global

Time period

Contemporary

Major taxa studied

Seabirds

Methods

We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.

Results

We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.

Main conclusions

Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.

Methods Trait Selection and Data

We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.

Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.

Multiple Imputation

All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).

We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).

Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.

Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.

To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more

Diabetes Risk & Lifestyle Factors

kaggle.com

zip

Updated Nov 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Diabetes Risk & Lifestyle Factors [Dataset]. https://www.kaggle.com/datasets/miadul/diabetes-risk-and-lifestyle-factors

Explore at:

zip(76925 bytes)Available download formats

Dataset updated

Nov 18, 2025

Authors

Arif Miah

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

A fully synthetic but realistic dataset created for Machine Learning practice, focusing on how lifestyle, health habits, and medical indicators influence the risk of diabetes.

This dataset is ideal for:

Classification (Diabetes Yes/No)
EDA & Visualization
Feature Engineering
Missing Value Handling
Model Benchmarking
ML Pipeline Practice (Beginner → Advanced)

📘 Dataset Overview

This dataset simulates 5000 individuals with various health and lifestyle attributes. Each record represents one person and includes both clinical metrics (Glucose, BMI, Insulin) and behavioral factors (Diet, Exercise, Smoking, etc.).

The target variable is Diabetes_Status (0 = No Diabetes, 1 = Diabetes).

Missing values (~5%) are intentionally added to practice data cleaning, imputation, and analysis.

📊 Columns Description

🔹 Medical Indicators

Column	Description
Glucose	Blood glucose level (mg/dL), influenced by diet and heredity
BMI	Body Mass Index, realistically correlated with diet & exercise
Insulin	Insulin level (µU/mL), correlated with glucose
Age	Age of the individual (18–80 years)

🔹 Demographics

Column	Description
Gender	Male or Female

🔹 Lifestyle Factors

Column	Description
Diet_Type	Healthy / Moderate / Unhealthy
Exercise_Frequency	Daily, 3–5 / week, 1–2 / week, Rarely
Heredity	Family history of diabetes (Yes/No)
Smoking	Smoking habit (Yes/No)
Alcohol	None / Low / Moderate / High
Stress_Score	Stress rating from 1 to 10
Sleep_Hours	Average daily sleep duration

🎯 Target Variable

Column	Description
Diabetes_Status	0 = Non-diabetic, 1 = Diabetic

🎯 How Diabetes Status Was Generated (Synthetic Logic)

The probability of diabetes is influenced by:

Higher Glucose & BMI
Unhealthy diet
Low exercise frequency
Family history (heredity)
High stress
Lifestyle habits (smoking, alcohol)

This creates realistic correlations suitable for ML models.

🧩 Missing Values

Approximately 5% missing values are added randomly across all columns to help learners practice:

Imputation (Mean/Median/Mode/KNN)
Handling missing categorical & numerical data
Impact on model training

🧠 Ideal ML Use-Cases

You can use this dataset for:

✔ Binary Classification Models

Logistic Regression
Random Forest
XGBoost
Neural Networks
SVM

✔ EDA + Data Visualization

Correlation heatmaps
Lifestyle vs. Diabetes analysis
Glucose/BMI distribution

✔ Data Preprocessing

Handling missing values
Scaling
Encoding categorical features

✔ Feature Engineering

Risk scoring
Interaction features
Normalization

💡 Why This Dataset?

Many diabetes datasets (e.g., Pima) are small and limited. This dataset provides:

More features
More samples (5000)
Better ML exploration
Realistic relationships
Modern lifestyle variables

Perfect for students, beginners, and ML practitioners.

📥 Source

This dataset is fully synthetic and AI-generated, created exclusively for educational and machine learning purposes.

m
Panel_democ_stability_growth_MENA_Over_1983_2022
data.mendeley.com
Updated Jun 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brahim Zirari (2023). Panel_democ_stability_growth_MENA_Over_1983_2022 [Dataset]. http://doi.org/10.17632/vhh9cg2wzt.3
Explore at:
Unique identifier
https://doi.org/10.17632/vhh9cg2wzt.3
Dataset updated
Jun 23, 2023
Authors
Brahim Zirari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This panel dataset presents information on the impact of democracy and political stability on economic growth in 15 MENA countries for the period 1983-2022. The data are collected from five different sources; the World Bank Development Indicators (WDI), the World Bank Governance Indicators (WGI), the Penn World Table (PWT), Polity5 from the Integrated Network for Societal Conflict Research (INSCR), and the Varieties of Democracy (V-Dem). The dataset includes ten variables related to economic growth, democracy, and political stability. Data analysis was performed using statistical methods such as R in order to ensure data reliability through imputing missing data; hence, enabling future researchers to explore the impact of political factors on growth in various contexts. The data are presented in two sheets, before and after the imputation for missing values. The potential reuse of this dataset lies in the ability to examine the impact of different political factors on economic growth in the region.
A Hybrid Educational Dataset
kaggle.com
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanoel Carvalho Lopes (2025). A Hybrid Educational Dataset [Dataset]. https://www.kaggle.com/datasets/emanoelcarvalholopes/uci-oulad-sintetico-unificados
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Emanoel Carvalho Lopes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.

This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:

The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.

The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.

A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.

Data Unification and Pre-processing

A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:

Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.

One-Hot Encoding: All categorical features have been converted to a numerical format.

Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.

The result is a clean, comprehensive dataset ready for modeling.

File Information

Instance

Each row represents a student profile, and the columns are the features and the target.

Feature

Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).

Sensitive Information

The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.

Key Columns:

Target Variable: had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset. 1: The student either failed (Fail) or withdrew (Withdrawn) from the course. 0: The student passed (Pass or Distinction). Feature Groups: OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE. Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources. Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics. Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.

(Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)

Acknowledgements

This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:

OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance

Inspiration

This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:

Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).

Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).

Can you build separate models for each data origin (origem_dado_*) and compare ...
f
Data from: Computational methods to simultaneously compare the predictive...
tandf.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. A. Roldán-Nofuentes (2023). Computational methods to simultaneously compare the predictive values of two diagnostic tests with missing data: EM-SEM algorithms and multiple imputation [Dataset]. http://doi.org/10.6084/m9.figshare.14602706.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14602706.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
J. A. Roldán-Nofuentes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predictive values are measures of the clinical accuracy of a binary diagnostic test, and depend on the sensitivity and the specificity of the diagnostic test and on the disease prevalence among the population being studied. This article studies hypothesis tests to simultaneously compare the predictive values of two binary diagnostic tests in the presence of missing data. The hypothesis tests were solved applying two computational methods: the expectation maximization and the supplemented expectation maximization algorithms, and multiple imputation. Simulation experiments were carried out to study the sizes and the powers of the hypothesis tests, giving some general rules of application. Two R programmes were written to apply each method, and they are available as supplementary material for the manuscript. The results were applied to the diagnosis of Alzheimer’s disease.
d
Data from: Quantifying the impacts of management and herbicide resistance on...
datadryad.org
search.dataone.org
zip
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs (2023). Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data [Dataset]. http://doi.org/10.5061/dryad.9cnp5hqn5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.9cnp5hqn5
Dataset updated
Nov 28, 2023
Dataset provided by
Dryad
Authors
Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs
Time period covered
Nov 17, 2023
Description
Data were collected from a network of UK farms using a density structured survey method outlined in Queensborough 2011.
f
Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...
datasetcatalog.nlm.nih.gov
Updated Sep 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris (2020). proteiNorm – A User-Friendly Tool for Normalization and Analysis of TMT and Label-Free Protein Quantification [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000568582
Explore at:
Dataset updated
Sep 30, 2020
Authors
Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris
Description
The technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.
f
National Panel Survey 2010-2011 - United Republic of Tanzania
microdata.fao.org
Updated Nov 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Bureau of Statistics (2022). National Panel Survey 2010-2011 - United Republic of Tanzania [Dataset]. https://microdata.fao.org/index.php/catalog/study/TZA_2010-2011_NPS-W2_v01_EN_M_v01_A_OCS
Explore at:
Dataset updated
Nov 8, 2022
Dataset authored and provided by
National Bureau of Statistics
Time period covered
2010 - 2011
Area covered
Tanzania
Description
Abstract

The main objective of the Tanzania NPS is to provide high-quality household-level data to the Tanzanian government and other stakeholders for monitoring poverty dynamics, tracking the progress of the Mkukuta poverty reduction strategy1, and to evaluate the impact of other major, national-level government policy initiatives. As an integrated survey covering a number of different socioeconomic factors, it compliments other more narrowly focused survey efforts, such as the Demographic and Health Survey on health, the Integrated Labour Force Survey on labour markets, the Household Budget Survey on expenditure, and the National Sample Census of Agriculture. Secondly, as a panel household survey in which the same households are revisited over time, the Tanzania NPS allows for the study of poverty and welfare transitions and the determinants of living standard changes

Geographic coverage

National

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample design for the second round of the NPS revisits all the households interviewed in the first round of the panel, as well as tracking adult split-off household members. The original sample size of 3,265 households was designed to representative at the national, urban/rural, and major agro-ecological zones. The total sample size was 3,265 households in 409 Enumeration Areas (2,063 households in rural areas and 1,202 urban areas). It is also be possible in the final analysis to produce disaggregated poverty rates for 4 different strata: Dar es Salaam, other urban areas on mainland Tanzania, rural mainland Tanzania, and Zanzibar.

Since the NPS is a panel survey, the second round of the fieldwork revisits all households originally interviewed during round one. If a household has moved from its original location, the members were interviewed in their new location. If that location was within one hour of the original location, the field team did the interview at the time of their visit to the enumeration area. If the household had located more than an hour from the original location, details of the new location were recorded on specialized forms, and the information passed to a dedicated tracking team for follow-up.

If a member of the original household had split from their original location to form or join a new household, information was recorded on the current whereabouts of this member. All adult former household members (those over the age of 15) were tracked to their new location. Similar to the protocol for the re-located households, if the new household is within one hour of the original location, the new household was interviewed by the main field team at the time of the visit to the enumeration area. For those that have moved more than one hour away, their information was passed to the dedicated tracking team for follow-up. Once the tracking targets have been found, teams are required to interview them and any new members of the household.

The total sample size for the second round of the NPS has a total sample size of 3924 households. This represents 3168 round-one households, a re-interview rate of over 97 percent. In addition, of the 10,420 eligible adults (over age 15 in 2010), 9,338 were re-interviewed, a re-interview rate of approximately 90 percent.

Sampling deviation

The total sample size for the second round of the NPS has a total sample size of 3924 households. This represents 3168 round one household, a re-interview rate of over 97 percent. In addition, of the 10,420 eligible adults (over age 15 in 2010), 9,338 were re-interviewed, a re-interview rate of approximately 90 percent. To obtain the attrition adjustment factor the probability that a sample household was successfully re-interviewed in the second round of surveys is modelled with the linear logistic model at the level of the individual. A binary response variable is created by coding the response disposition for eligible households that do not respond in the second round as 0, and households that do respond as 1.

Mode of data collection

Face-to-face [f2f]

Cleaning operations

CSPro-based data entry/editing system was used. A cross comparison between the entered values in the field based data entry and double entry was conducted and any differences in values between the two were flagged for manual inspection of the physical questionnaire. Corrections based on this inspection exercise were ultimately encoded in the dataset.

Additionally, an extensive review of data files was conducted, including interviewer errors such as missing values, ranges and outliers. Observations were returned for manual inspection of the physical questionnaires if continuous values fell outside five standard deviations of the mean, categorical values were not eligible responses, or there were internal inconsistencies within the dataset (for example, the age of an individual was not consistent with their educational status, there was more than one head of household listed, an individual was engaged in multiple primary activities, the quantity of crops and their by-products produced, harvested, and sold not listed, the distance from the market and an individual's plot was not listed, the number of weeks, days per week, and hours per day an individual engaged in fishery activity was not recorded, the species and quantity of fish caught, bought, sold, or traded was not listed, etc). When it was determined that these values were the result of data-entry error, the values were corrected. In addition, cases deemed to reflect obvious enumerator error were also corrected in this cleaning process. The majority of such cases involved the use of incorrect measurement units, e.g. recording grams as kilograms or vice versa.

Response rate

Approximately 95 percent

Sampling error estimates

To reduce the overall standard errors, and weight the population totals up to the known population figures, a post-stratification correction is applied. Based on the projected number of households in the urban and rural segments of each region, adjustment factors are calculated. This correction also reduces overall standard errors.

Data appraisal

The estimated logistic model is used to obtain a predicted probability of response for each household member in the 2010/2011 survey. These response probabilities were then aggregated to the household level (by calculating the mean), the using the household-level predicted response probabilities as the ranking variable, all households are ranked into 10 equal groups (deciles). An attrition adjustment factor was then defined as the reciprocal of the empirical response rate for the household-level propensity

Then a logistic response propensity model is fitted, using 2005 UNHS household and individual characteristics measured in the first wave as covariates. In a few limited cases, values of unit level variables were missing from the 2008/2009 household dataset. These values were imputed using multivariate regression and logistic regression techniques. Imputations are done using the 'impute' command in Stata at the level of the UNPS strata (urban/rural and region). Overall, less than one percent of the variables required imputation to replace missing values.
t
Historical global map of NH4+ and NO3- application in synthetic nitrogen...
service.tib.eu
Updated Nov 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Historical global map of NH4+ and NO3- application in synthetic nitrogen fertilizer, link to NetCDF files - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-861203
Explore at:
Dataset updated
Nov 30, 2024
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
This paper provides a method for constructing a new historical global nitrogen fertilizer application map (0.5° × 0.5° resolution) for the period 1961-2010 based on country-specific information from Food and Agriculture Organization statistics (FAOSTAT) and various global datasets. This new map incorporates the fraction of NH+4 (and NONO-3) in N fertilizer inputs by utilizing fertilizer species information in FAOSTAT, in which species can be categorized as NH+4 and/or NO-3-forming N fertilizers. During data processing, we applied a statistical data imputation method for the missing data (19 % of national N fertilizer consumption) in FAOSTAT. The multiple imputation method enabled us to fill gaps in the time-series data using plausible values using covariates information (year, population, GDP, and crop area). After the imputation, we downscaled the national consumption data to a gridded cropland map. Also, we applied the multiple imputation method to the available chemical fertilizer species consumption, allowing for the estimation of the NH+4/NO-3 ratio in national fertilizer consumption. In this study, the synthetic N fertilizer inputs in 2000 showed a general consistency with the existing N fertilizer map (Potter et al., 2010, doi:10.1175/2009EI288.1) in relation to the ranges of N fertilizer inputs. Globally, the estimated N fertilizer inputs based on the sum of filled data increased from 15 Tg-N to 110 Tg-N during 1961-2010. On the other hand, the global NO-3 input started to decline after the late 1980s and the fraction of NO-3 in global N fertilizer decreased consistently from 35 % to 13 % over a 50-year period. NH+4 based fertilizers are dominant in most countries; however, the NH+4/NO-3 ratio in N fertilizer inputs shows clear differences temporally and geographically. This new map can be utilized as an input data to global model studies and bring new insights for the assessment of historical terrestrial N cycling changes.
D
Assessing among-lineage variability in phylogenetic imputation of functional...
datasetcatalog.nlm.nih.gov
search.dataone.org
+1more
Updated Jan 23, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodriguez, Miguel Á.; Peres-Neto, Pedro R.; Moreno-Saiz, Juan Carlos; Castro, Isabel Castro; Davies, T. Jonathan; Molina-Venegas, Rafael; Molin-Venegas, Rafael (2018). Assessing among-lineage variability in phylogenetic imputation of functional trait datasets [Dataset]. http://doi.org/10.5061/dryad.12111
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.12111
Dataset updated
Jan 23, 2018
Authors
Rodriguez, Miguel Á.; Peres-Neto, Pedro R.; Moreno-Saiz, Juan Carlos; Castro, Isabel Castro; Davies, T. Jonathan; Molina-Venegas, Rafael; Molin-Venegas, Rafael
Description
Phylogenetic imputation has recently emerged as a potentially powerful tool for predicting missing data in functional traits datasets. As such, understanding the limitations of phylogenetic modelling in predicting trait values is critical if we are to use them in subsequent analyses. Previous studies have focused on the relationship between phylogenetic signal and clade-level prediction accuracy, yet variability in prediction accuracy among individual tips of phylogenies remains largely unexplored. Here, we used simulations of trait evolution along the branches of phylogenetic trees to show how the accuracy of phylogenetic imputations is influenced by the combined effects of (1) the amount of phylogenetic signal in the traits and (2) the branch length of the tips to be imputed. Specifically, we conducted cross-validation trials to estimate the variability in prediction accuracy among individual tips on the phylogenies (hereafter “tip-level accuracy”). We found that under a Brownian motion model of evolution (BM, Pagel's λ = 1), tip-level accuracy rapidly decreased with increasing tip branch-lengths, and only tips of approximately 10% or less of the total height of the trees showed consistently accurate predictions (i.e. cross-validation R-squared > 0.75). When phylogenetic signal was weak, the effect of tip branch-length was reduced, becoming negligible for traits simulated with λ < 0.7, where accuracy was in any case low. Our study shows that variability in prediction accuracy among individual tips of the phylogeny should be considered when evaluating the reliability of phylogenetically imputed trait values. To address this challenge, we describe a Monte Carlo-based method that allows one to estimate the expected tip-level accuracy of phylogenetic predictions for continuous traits. Our approach identifies gaps in functional trait datasets for which phylogenetic imputation performs poorly, and will help ecologists to design more efficient trait collection campaigns by focusing resources on lineages whose trait values are more uncertain.
o
QSAR-DATASET-FOR-DRUG-TARGET-CHEMBL2010624
openml.org
Updated Jul 16, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr Jeremy Besnard; Dr Ivan Olier; Dr Noureddin Sadawi; Dr Larisa Soldatova; Dr Crina Grosan; Prof Ross King; Dr Richard Bickerton; Prof Andrew Hopkins and Dr Willem van Hoorn (2016). QSAR-DATASET-FOR-DRUG-TARGET-CHEMBL2010624 [Dataset]. https://www.openml.org/d/40430
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2016
Authors
Dr Jeremy Besnard; Dr Ivan Olier; Dr Noureddin Sadawi; Dr Larisa Soldatova; Dr Crina Grosan; Prof Ross King; Dr Richard Bickerton; Prof Andrew Hopkins and Dr Willem van Hoorn
Description
This dataset contains QSAR data (from ChEMBL version 17) showing activity values (unit is pseudo-pCI50) of several compounds on drug target ChEMBL_ID: CHEMBL2010624 (TID: 104492), and it has 11 rows and 294 features (not including molecule IDs and class feature: molecule_id and pXC50). The features represent Molecular Descriptors which were generated from SMILES strings. Missing value imputation was applied to this dataset (By choosing the Median). Feature selection was also applied.
n
Data from: A comparison of genomic selection models across time in interior...
data.niaid.nih.gov
datadryad.org
zip
Updated May 27, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby (2015). A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods [Dataset]. http://doi.org/10.5061/dryad.m4vh4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m4vh4
Dataset updated
May 27, 2015
Dataset provided by
Ministry of Forests
University of British Columbia
Authors
Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
British Columbia
Description
Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.
South African Census 2001, CASASP imputed data - South Africa
datafirst.uct.ac.za
catalog.ihsn.org
+2more
Updated Mar 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics South Africa (2020). South African Census 2001, CASASP imputed data - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/227
Explore at:
Dataset updated
Mar 29, 2020
Dataset provided by
Statistics South Africahttp://www.statssa.gov.za/
Centre for the Analysis of South African Social Policy
Time period covered
2006
Area covered
South Africa
Description
Abstract

This dataset includes imputation for missing data in key variables in the ten percent sample of the 2001 South African Census. Researchers at the Centre for the Analysis of South African Social Policy (CASASP) at the University of Oxford used sequential multiple regression techniques to impute income, education, age, gender, population group, occupation and employment status in the dataset. The main focus of the work was to impute income where it was missing or recorded as zero. The imputed results are similar to previous imputation work on the 2001 South African Census, including the single ‘hot-deck’ imputation carried out by Statistics South Africa.

Kind of data

Sample survey data [ssd]

Mode of data collection

Face-to-face [f2f]

Facebook

Twitter

Click to copy link

Link copied

Cite

Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t023

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0313772.t023

Dataset updated

Mar 17, 2025

Dataset provided by

PLOShttp://plos.org/

Authors

Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.

Clear search

Close search

Google apps

Main menu

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5,...

Imputation missing values in the nominal datasets

Mean absolute percentage errors for the scenario where error variance was 9;...

QLFS

Mean absolute percentage errors of platinum loss for the scenarios with...

Data from: A Bayesian Approach to Parameter Estimation in the Presence of...

Extrovert vs. Introvert Behavior Data

Data from: Biological traits of seabirds predict extinction risk and...

Diabetes Risk & Lifestyle Factors

📘 Dataset Overview

📊 Columns Description

🔹 Medical Indicators

🔹 Demographics

🔹 Lifestyle Factors

🎯 Target Variable

🎯 How Diabetes Status Was Generated (Synthetic Logic)

🧩 Missing Values

🧠 Ideal ML Use-Cases

✔ Binary Classification Models

✔ EDA + Data Visualization

✔ Data Preprocessing

✔ Feature Engineering

💡 Why This Dataset?

📥 Source

Panel_democ_stability_growth_MENA_Over_1983_2022

A Hybrid Educational Dataset

Context

Data Unification and Pre-processing

File Information

Instance

Feature

Sensitive Information

Acknowledgements

Inspiration

Data from: Computational methods to simultaneously compare the predictive...

Data from: Quantifying the impacts of management and herbicide resistance on...

Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...

National Panel Survey 2010-2011 - United Republic of Tanzania

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Cleaning operations

Response rate

Sampling error estimates

Data appraisal

Historical global map of NH4+ and NO3- application in synthetic nitrogen...

Assessing among-lineage variability in phylogenetic imputation of functional...

QSAR-DATASET-FOR-DRUG-TARGET-CHEMBL2010624

Data from: A comparison of genomic selection models across time in interior...

South African Census 2001, CASASP imputed data - South Africa

Abstract

Kind of data

Mode of data collection

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.See More Versions

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.