100+ datasets found

f
Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...
plos.figshare.com
figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0164464
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nikolaj Bak; Lars K. Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.
Data from: Water-quality data imputation with a high percentage of missing...
zenodo.org
explore.openaire.eu
+1more
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
f
Data from: Variable Selection with Multiply-Imputed Datasets: Choosing...
tandf.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee (2023). Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods [Dataset]. http://doi.org/10.6084/m9.figshare.19111441.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19111441.v2
Dataset updated
Jun 3, 2023
Dataset provided by
Taylor & Francis
Authors
Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.
n
Data from: Using multiple imputation to estimate missing data in...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 25, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray (2015). Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m2v4m
Dataset updated
Nov 25, 2015
Dataset provided by
Trent University
University of Prince Edward Island
Authors
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
n
Data from: Missing data estimation in morphometrics: how much is too much?
data.niaid.nih.gov
datadryad.org
+2more
zip
Updated Dec 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel (2013). Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Dec 5, 2013
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
J
MAXIMUM LIKELIHOOD ESTIMATION OF FACTOR MODELS ON DATASETS WITH ARBITRARY...
jda-test.zbw.eu
journaldata.zbw.eu
txt
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marta Banbura; Michele Modugno; Marta Banbura; Michele Modugno (2022). MAXIMUM LIKELIHOOD ESTIMATION OF FACTOR MODELS ON DATASETS WITH ARBITRARY PATTERN OF MISSING DATA (replication data) [Dataset]. https://jda-test.zbw.eu/dataset/maximum-likelihood-estimation-of-factor-models-on-datasets-with-arbitrary-pattern-of-missing-data
Explore at:
txt(2114), txt(94822), txt(6719)Available download formats
Dataset updated
Nov 4, 2022
Dataset provided by
ZBW - Leibniz Informationszentrum Wirtschaft
Authors
Marta Banbura; Michele Modugno; Marta Banbura; Michele Modugno
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this paper we modify the expectation maximization algorithm in order to estimate the parameters of the dynamic factor model on a dataset with an arbitrary pattern of missing data. We also extend the model to the case with a serially correlated idiosyncratic component. The framework allows us to handle efficiently and in an automatic manner sets of indicators characterized by different publication delays, frequencies and sample lengths. This can be relevant, for example, for young economies for which many indicators have been compiled only recently. We evaluate the methodology in a Monte Carlo experiment and we apply it to nowcasting of the euro area gross domestic product.
f
MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...
tandf.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche (2023). MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as Cellwise and Rowwise Outliers [Dataset]. http://doi.org/10.6084/m9.figshare.7624424.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7624424.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francis
Authors
Mia Hubert; Peter J. Rousseeuw; Wannes Van den Bossche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multivariate data are typically represented by a rectangular matrix (table) in which the rows are the objects (cases) and the columns are the variables (measurements). When there are many variables one often reduces the dimension by principal component analysis (PCA), which in its basic form is not robust to outliers. Much research has focused on handling rowwise outliers, that is, rows that deviate from the majority of the rows in the data (e.g., they might belong to a different population). In recent years also cellwise outliers are receiving attention. These are suspicious cells (entries) that can occur anywhere in the table. Even a relatively small proportion of outlying cells can contaminate over half the rows, which causes rowwise robust methods to break down. In this article, a new PCA method is constructed which combines the strengths of two existing robust methods to be robust against both cellwise and rowwise outliers. At the same time, the algorithm can cope with missing values. As of yet it is the only PCA method that can deal with all three problems simultaneously. Its name MacroPCA stands for PCA allowing for Missingness And Cellwise & Rowwise Outliers. Several simulations and real datasets illustrate its robustness. New residual maps are introduced, which help to determine which variables are responsible for the outlying behavior. The method is well-suited for online process control.
d
A Correction for Structural Equation Modeling Fit Indices Under Missingness:...
dataone.org
dataverse.harvard.edu
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fitzgerald, Cailey E. (2023). A Correction for Structural Equation Modeling Fit Indices Under Missingness: Adapting the Root Mean Squared Error of Approximation to Conditions of Missing Data [Dataset]. http://doi.org/10.7910/DVN/28657
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/28657
Dataset updated
Nov 20, 2023
Dataset provided by
Harvard Dataverse
Authors
Fitzgerald, Cailey E.
Description
Missing data is a frequent occurrence in both small and large datasets. Among other things, missingness may be a result of coding or computer error, participant absences, or it may be intentional, as in a planned missing design. Whatever the cause, the problem of how to approach a dataset with holes is of much relevance in scientific research. First, missingness is approached as a theoretical construct, and its impacts on data analysis are encountered. I discuss missingness as it relates to structural equation modeling and model fit indices, specifically its interaction with the Root Mean Square Error of Approximation (RMSEA). Data simulation is used to show that RMSEA has a downward bias with missing data, yielding skewed fit indices. Two alternative formulas for RMSEA calculation are proposed: one correcting degrees of freedom and one using Kullback-Leibler divergence to result in an RMSEA calculation which is relatively independent of missingness. Simulations are conducted in Java, with results indicating that the Kullback-Leibler divergence provides a better correction for RMSEA calculation. Next, I approach missingness in an applied manner with an existing large dataset examining ideology measures. The researchers assessed ideology using a planned missingness design, resulting in high proportions of missing data. Factor analysis was performed to gauge uniqueness of ideology measures.
Z
Empathy dataset
data.niaid.nih.gov
zenodo.org
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathematical Research Data Initiative (2024). Empathy dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7683906
Explore at:
Dataset updated
Dec 18, 2024
Dataset authored and provided by
Mathematical Research Data Initiative
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The database for this study (Briganti et al. 2018; the same for the Braun study analysis) was composed of 1973 French-speaking students in several universities or schools for higher education in the following fields: engineering (31%), medicine (18%), nursing school (16%), economic sciences (15%), physiotherapy, (4%), psychology (11%), law school (4%) and dietetics (1%). The subjects were 17 to 25 years old (M = 19.6 years, SD = 1.6 years), 57% were females and 43% were males. Even though the full dataset was composed of 1973 participants, only 1270 answered the full questionnaire: missing data are handled using pairwise complete observations in estimating a Gaussian Graphical Model, meaning that all available information from every subject are used.

The feature set is composed of 28 items meant to assess the four following components: fantasy, perspective taking, empathic concern and personal distress. In the questionnaire, the items are mixed; reversed items (items 3, 4, 7, 12, 13, 14, 15, 18, 19) are present. Items are scored from 0 to 4, where “0” means “Doesn’t describe me very well” and “4” means “Describes me very well”; reverse-scoring is calculated afterwards. The questionnaires were anonymized. The reanalysis of the database in this retrospective study was approved by the ethical committee of the Erasmus Hospital.

Size: A dataset of size 1973*28

Number of features: 28

Ground truth: No

Type of Graph: Mixed graph

The following gives the description of the variables:

Feature FeatureLabel Domain Item meaning from Davis 1980

001 1FS Green I daydream and fantasize, with some regularity, about things that might happen to me.

002 2EC Purple I often have tender, concerned feelings for people less fortunate than me.

003 3PT_R Yellow I sometimes find it difficult to see things from the “other guy’s” point of view.

004 4EC_R Purple Sometimes I don’t feel very sorry for other people when they are having problems.

005 5FS Green I really get involved with the feelings of the characters in a novel.

006 6PD Red In emergency situations, I feel apprehensive and ill-at-ease.

007 7FS_R Green I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it.(Reversed)

008 8PT Yellow I try to look at everybody’s side of a disagreement before I make a decision.

009 9EC Purple When I see someone being taken advantage of, I feel kind of protective towards them.

010 10PD Red I sometimes feel helpless when I am in the middle of a very emotional situation.

011 11PT Yellow sometimes try to understand my friends better by imagining how things look from their perspective

012 12FS_R Green Becoming extremely involved in a good book or movie is somewhat rare for me. (Reversed)

013 13PD_R Red When I see someone get hurt, I tend to remain calm. (Reversed)

014 14EC_R Purple Other people’s misfortunes do not usually disturb me a great deal. (Reversed)

015 15PT_R Yellow If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments. (Reversed)

016 16FS Green After seeing a play or movie, I have felt as though I were one of the characters.

017 17PD Red Being in a tense emotional situation scares me.

018 18EC_R Purple When I see someone being treated unfairly, I sometimes don’t feel very much pity for them. (Reversed)

019 19PD_R Red I am usually pretty effective in dealing with emergencies. (Reversed)

020 20FS Green I am often quite touched by things that I see happen.

021 21PT Yellow I believe that there are two sides to every question and try to look at them both.

022 22EC Purple I would describe myself as a pretty soft-hearted person.

023 23FS Green When I watch a good movie, I can very easily put myself in the place of a leading character.

024 24PD Red I tend to lose control during emergencies.

025 25PT Yellow When I’m upset at someone, I usually try to “put myself in his shoes” for a while.

026 26FS Green When I am reading an interesting story or novel, I imagine how I would feel if the events in the story were happening to me.

027 27PD Red When I see someone who badly needs help in an emergency, I go to pieces.

028 28PT Yellow Before criticizing somebody, I try to imagine how I would feel if I were in their place

More information about the dataset is contained in empathy_description.html file.
National Missing and Unidentified Persons System (NamUs)
catalog.data.gov
datasets.ai
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Justice Programs (2025). National Missing and Unidentified Persons System (NamUs) [Dataset]. https://catalog.data.gov/dataset/national-missing-and-unidentified-persons-system-namus
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
Office of Justice Programshttps://ojp.gov/
Description
NamUs is the only national repository for missing, unidentified, and unclaimed persons cases. The program provides a singular resource hub for law enforcement, medical examiners, coroners, and investigating professionals. It is the only national database for missing, unidentified, and unclaimed persons that allows limited access to the public, empowering family members to take a more proactive role in the search for their missing loved ones.
n
Data from: Macaques preferentially attend to intermediately surprising...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Apr 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D15Q7Q
Dataset updated
Apr 26, 2022
Dataset provided by
Klaviyo
University of California, Berkeley
Yale University
University of Minnesota
Authors
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

Empty Values in Datasets:

There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

Codes:

In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
o
Data from: Incomplete specimens in geometric morphometric analyses
explore.openaire.eu
search.dataone.org
+2more
Updated Jan 1, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica H. Arbour; Caleb M. Brown (2013). Data from: Incomplete specimens in geometric morphometric analyses [Dataset]. http://doi.org/10.5061/dryad.mp713
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.mp713
Dataset updated
Jan 1, 2013
Authors
Jessica H. Arbour; Caleb M. Brown
Description
1.The analysis of morphological diversity frequently relies on the use of multivariate methods for characterizing biological shape. However, many of these methods are intolerant of missing data, which can limit the use of rare taxa and hinder the study of broad patterns of ecological diversity and morphological evolution. This study applied a mutli-dataset approach to compare variation in missing data estimation and its effect on geometric morphometric analysis across taxonomically-variable groups, landmark position and sample sizes. 2.Missing morphometric landmark data was simulated from five real, complete datasets, including modern fish, primates and extinct theropod dinosaurs. Missing landmarks were then estimated using several standard approaches and a geometric-morphometric-specific method. The accuracy of missing data estimation was determined for each estimation method, landmark position, and morphological dataset. Procrustes superimposition was used to compare the eigenvectors and principal component scores of a geometric morphometric analysis of the original landmark data, to datasets with A) missing values estimated, or B) simulated incomplete specimens excluded, for varying levels of specimens incompleteness and sample sizes. 3.Standard estimation techniques were more reliable estimators and had lower impacts on morphometric analysis compared to a geometric-morphometric-specific estimator. For most datasets and estimation techniques, estimating missing data produced a better fit to the structure of the original data than exclusion of incomplete specimens, and this was maintained even at considerably reduced sample sizes. The impact of missing data on geometric morphometric analysis was disproportionately affected by the most fragmentary specimens. 4.Missing data estimation was influenced by variability of specific anatomical features, and may be improved by a better understanding of shape variation present in a dataset. Our results suggest that the inclusion of incomplete specimens through the use of effective missing data estimators better reflects the patterns of shape variation within a dataset than using only complete specimens, however the effectiveness of missing data estimation can be maximized by excluding only the most incomplete specimens. It is advised that missing data estimators be evaluated for each dataset and landmark independently, as the effectiveness of estimators can vary strongly and unpredictably between different taxa and structures.
Number of missing persons files in the U.S. 2022, by race
statista.com
Updated Jul 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Number of missing persons files in the U.S. 2022, by race [Dataset]. https://www.statista.com/statistics/240396/number-of-missing-persons-files-in-the-us-by-race/
Explore at:
Dataset updated
Jul 5, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
United States
Description
In 2022, there were 313,017 cases filed by the NCIC where the race of the reported missing was White. In the same year, 18,928 people were missing whose race was unknown.

What is the NCIC?

The National Crime Information Center (NCIC) is a digital database that stores crime data for the United States, so criminal justice agencies can access it. As a part of the FBI, it helps criminal justice professionals find criminals, missing people, stolen property, and terrorists. The NCIC database is broken down into 21 files. Seven files belong to stolen property and items, and 14 belong to persons, including the National Sex Offender Register, Missing Person, and Identify Theft. It works alongside federal, tribal, state, and local agencies. The NCIC’s goal is to maintain a centralized information system between local branches and offices, so information is easily accessible nationwide.

Missing people in the United States

A person is considered missing when they have disappeared and their location is unknown. A person who is considered missing might have left voluntarily, but that is not always the case. The number of the NCIC unidentified person files in the United States has fluctuated since 1990, and in 2022, there were slightly more NCIC missing person files for males as compared to females. Fortunately, the number of NCIC missing person files has been mostly decreasing since 1998.
f
Dataset information.
plos.figshare.com
xls
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min-Wei Huang; Chih-Fong Tsai; Shu-Ching Tsui; Wei-Chao Lin (2023). Dataset information. [Dataset]. http://doi.org/10.1371/journal.pone.0295032.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295032.t001
Dataset updated
Nov 30, 2023
Dataset provided by
PLOS ONE
Authors
Min-Wei Huang; Chih-Fong Tsai; Shu-Ching Tsui; Wei-Chao Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and data analysis on medical domain problem datasets containing continuous features. However, certain feature values were frequently missing. Many data-mining algorithms cannot handle incomplete datasets. In this study, we considered the use of both discretization and missing-value imputation to process incomplete medical datasets, examining how the order of discretization and missing-value imputation combined influenced performance. The experimental results were obtained using seven different medical domain problem datasets: two discretizers, including the minimum description length principle (MDLP) and ChiMerge; three imputation methods, including the mean/mode, classification and regression tree (CART), and k-nearest neighbor (KNN) methods; and two classifiers, including support vector machines (SVM) and the C4.5 decision tree. The results show that a better performance can be obtained by first performing discretization followed by imputation, rather than vice versa. Furthermore, the highest classification accuracy rate was achieved by combining ChiMerge and KNN with SVM.
t
PV Generation and Consumption Dataset of an Estonian Residential Dwelling
data.taltech.ee
Updated Mar 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov (2025). PV Generation and Consumption Dataset of an Estonian Residential Dwelling [Dataset]. http://doi.org/10.48726/6hayh-x0h25
Explore at:
Unique identifier
https://doi.org/10.48726/6hayh-x0h25
Dataset updated
Mar 22, 2025
Dataset provided by
TalTech Data Repository
Authors
Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Estonia
Description
This is a Residential PV generation and consumption data set from an Estonian house. At the time of submission, one year (2023) of data was available. The data was logged at a 10-second resolution. The untouched dataset can be found in the raw data folder, which is separated month-wise. A few missing points in the dataset were filled with a simple KNN algorithm. However, improved data imputation methods based on machine learning are also possible. To carry out the imputing, run the scripts in the script folder one by one in the numerical serial order (SC1..py, SC2..py, etc.).

Data Descriptor (Scientific Data): https://doi.org/10.1038/s41597-025-04747-w">https://doi.org/10.1038/s41597-025-04747-w

General Information:

Duration: January 2023 – December 2023

Resolution: 10 seconds

Dataset Type: Aggregated consumption and PV generation data

Logging Device: Camile Bauer PQ1000 (×2)

Load/Appliance Information:

5 kW Rooftop PV array connected to AC Bus via 4.2kW 3-ϕ Inverter

Air conditioner: 0.44 kW (Cooling), 0.62 kW (Heating)

Air to Water (ATW) Heat Pump: 2.5kW (Cooling), 2.6 kW (Heating)

ATW Cylinder unit: 0.21 kW (Controller), 9 kW (Booster Heater)

Microwave oven: 0.9 kW

Coffee Maker: 1 kW

Cooktop Hot Plate: 4.6 kW

TV: 0.103 kW

Vacuum Cleaner: 1.5 kW

Ventilation: 0.1 kW

Washing Machine: 2.2 kW

Electric Sauna: 10 kW

Lighting: 0.25 kW

EV charger: 2.4 kW 1-ϕ

Measurement Points:

PV converter-side current transformer, potential transformer (Measurement of PV generation).

Utility meter-side current transformer, potential transformer (Measurement of power exchange with the grid).

Measured Parameters:

Per-phase mean power recorded within the sampling period

Per-phase Minimum power recorded within the sampling period

Per-phase maximum power recorded within the sampling period

Quadrant-wise mean power recorded within the sampling period (1st + 3rd), (2nd + 4th)

Quadrant-wise minimum power recorded within the sampling period (1st + 3rd), (2nd + 4th)

Quadrant-wise maximum power recorded within the sampling period (1st + 3rd), (2nd + 4th)

mean power Factor recorded within the sampling period

Minimum power Factor recorded within the sampling period

Maximum power Factor recorded within the sampling period

System Voltage

Minimum system Voltage

Maximum system Voltage

Mean Voltage between phase and neutral

Minimum voltage between phase and neutral

Maximum voltage between phase and neutral

Zero displacement voltage 4-wire systems (mean, min, max)

Script Description:

SC1_PV_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for PV generation data.

SC2_L2_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for meter-side measurement data.

SC3_PV_KNN_impute.py : Filling missing data points by simple KNN for PV generation data.

SC4_L2_KNN_impute.py : Filling missing data points by simple KNN for meter-side measurement data.

SC5_Final_data_gen.py : Merge PV and meter-side measurement data, and calculate load consumption.

The dataset provides all the outcomes (CSV files) from the scripts. All processed variables (PV generation, load, power import, and export) are expressed in kW units.

Update: 'SC1_PV_auto_sort.py' & 'SC2_L2_auto_sort.py' are adequate for cleaning up data and making the missing point visible. 'SC3_PV_KNN_impute.py' & 'SC4_L2_KNN_impute.py' work fine for short-range missing data points; however, these two scripts won't help much for missing data points for a longer period. They are provided as examples of one method of processing data. Future updates will include proper ML-based forecasting to predict missing data points.

Funding Agency and Grant Number:

European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 955614.

Estonian Research Council under Grant PRG1086.

Estonian Centre of Excellence in Energy Efficiency, ENER, funded by the Estonian Ministry of Education and Research under Grant TK230.
A
‘Missing Migrants Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Apr 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘Missing Migrants Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-missing-migrants-dataset-c736/2e62d69f/?v=grid
Explore at:
Dataset updated
Apr 23, 2019
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Missing Migrants Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jmataya/missingmigrants on 14 February 2022.

--- Dataset description provided by original source is as follows ---

About the Missing Migrants Data

This data is sourced from the International Organization for Migration. The data is part of a specific project called the Missing Migrants Project which tracks deaths of migrants, including refugees , who have gone missing along mixed migration routes worldwide. The research behind this project began with the October 2013 tragedies, when at least 368 individuals died in two shipwrecks near the Italian island of Lampedusa. Since then, Missing Migrants Project has developed into an important hub and advocacy source of information that media, researchers, and the general public access for the latest information.

Where is the data from?

Missing Migrants Project data are compiled from a variety of sources. Sources vary depending on the region and broadly include data from national authorities, such as Coast Guards and Medical Examiners; media reports; NGOs; and interviews with survivors of shipwrecks. In the Mediterranean region, data are relayed from relevant national authorities to IOM field missions, who then share it with the Missing Migrants Project team. Data are also obtained by IOM and other organizations that receive survivors at landing points in Italy and Greece. In other cases, media reports are used. IOM and UNHCR also regularly coordinate on such data to ensure consistency. Data on the U.S./Mexico border are compiled based on data from U.S. county medical examiners and sheriff’s offices, as well as media reports for deaths occurring on the Mexico side of the border. Estimates within Mexico and Central America are based primarily on media and year-end government reports. Data on the Bay of Bengal are drawn from reports by UNHCR and NGOs. In the Horn of Africa, data are obtained from media and NGOs. Data for other regions is drawn from a combination of sources, including media and grassroots organizations. In all regions, Missing Migrants Projectdata represents minimum estimates and are potentially lower than in actuality.

Updated data and visuals can be found here: https://missingmigrants.iom.int/

Who is included in Missing Migrants Project data?

IOM defines a migrant as any person who is moving or has moved across an international border or within a State away from his/her habitual place of residence, regardless of

(1) the person’s legal status; (2) whether the movement is voluntary or involuntary; (3) what the causes for the movement are; or (4) what the length of the stay is.[1]

Missing Migrants Project counts migrants who have died or gone missing at the external borders of states, or in the process of migration towards an international destination. The count excludes deaths that occur in immigration detention facilities, during deportation, or after forced return to a migrant’s homeland, as well as deaths more loosely connected with migrants’ irregular status, such as those resulting from labour exploitation. Migrants who die or go missing after they are established in a new home are also not included in the data, so deaths in refugee camps or housing are excluded. This approach is chosen because deaths that occur at physical borders and while en route represent a more clearly definable category, and inform what migration routes are most dangerous. Data and knowledge of the risks and vulnerabilities faced by migrants in destination countries, including death, should not be neglected, rather tracked as a distinct category.

How complete is the data on dead and missing migrants?

Data on fatalities during the migration process are challenging to collect for a number of reasons, most stemming from the irregular nature of migratory journeys on which deaths tend to occur. For one, deaths often occur in remote areas on routes chosen with the explicit aim of evading detection. Countless bodies are never found, and rarely do these deaths come to the attention of authorities or the media. Furthermore, when deaths occur at sea, frequently not all bodies are recovered - sometimes with hundreds missing from one shipwreck - and the precise number of missing is often unknown. In 2015, over 50 per cent of deaths recorded by the Missing Migrants Project refer to migrants who are presumed dead and whose bodies have not been found, mainly at sea.

Data are also challenging to collect as reporting on deaths is poor, and the data that does exist are highly scattered. Few official sources are collecting data systematically. Many counts of death rely on media as a source. Coverage can be spotty and incomplete. In addition, the involvement of criminal actors in incidents means there may be fear among survivors to report deaths and some deaths may be actively covered-up. The irregular immigration status of many migrants, and at times their families as well, also impedes reporting of missing persons or deaths.

The varying quality and comprehensiveness of data by region in attempting to estimate deaths globally may exaggerate the share of deaths that occur in some regions, while under-representing the share occurring in others.

What can be understood through this data?

The available data can give an indication of changing conditions and trends related to migration routes and the people travelling on them, which can be relevant for policy making and protection plans. Data can be useful to determine the relative risks of irregular migration routes. For example, Missing Migrants Project data show that despite the increase in migrant flows through the eastern Mediterranean in 2015, the central Mediterranean remained the more deadly route. In 2015, nearly two people died out of every 100 travellers (1.85%) crossing the Central route, as opposed to one out of every 1,000 that crossed from Turkey to Greece (0.095%). From the data, we can also get a sense of whether groups like women and children face additional vulnerabilities on migration routes.

However, it is important to note that because of the challenges in data collection for the missing and dead, basic demographic information on the deceased is rarely known. Often migrants in mixed migration flows do not carry appropriate identification. When bodies are found it may not be possible to identify them or to determine basic demographic information. In the data compiled by Missing Migrants Project, sex of the deceased is unknown in over 80% of cases. Region of origin has been determined for the majority of the deceased. Even this information is at times extrapolated based on available information – for instance if all survivors of a shipwreck are of one origin it was assumed those missing also came from the same region.

The Missing Migrants Project dataset includes coordinates for where incidents of death took place, which indicates where the risks to migrants may be highest. However, it should be noted that all coordinates are estimates.

Why collect data on missing and dead migrants?

By counting lives lost during migration, even if the result is only an informed estimate, we at least acknowledge the fact of these deaths. What before was vague and ill-defined is now a quantified tragedy that must be addressed. Politically, the availability of official data is important. The lack of political commitment at national and international levels to record and account for migrant deaths reflects and contributes to a lack of concern more broadly for the safety and well-being of migrants, including asylum-seekers. Further, it drives public apathy, ignorance, and the dehumanization of these groups.

Data are crucial to better understand the profiles of those who are most at risk and to tailor policies to better assist migrants and prevent loss of life. Ultimately, improved data should contribute to efforts to better understand the causes, both direct and indirect, of fatalities and their potential links to broader migration control policies and practices.

Counting and recording the dead can also be an initial step to encourage improved systems of identification of those who die. Identifying the dead is a moral imperative that respects and acknowledges those who have died. This process can also provide a some sense of closure for families who may otherwise be left without ever knowing the fate of missing loved ones.

Identification and tracing of the dead and missing

As mentioned above, the challenge remains to count the numbers of dead and also identify those counted. Globally, the majority of those who die during migration remain unidentified. Even in cases in which a body is found identification rates are low. Families may search for years or a lifetime to find conclusive news of their loved one. In the meantime, they may face psychological, practical, financial, and legal problems.

Ultimately Missing Migrants Project would like to see that every unidentified body, for which it is possible to recover, is adequately “managed”, analysed and tracked to ensure proper documentation, traceability and dignity. Common forensic protocols and standards should be agreed upon, and used within and between States. Furthermore, data relating to the dead and missing should be held in searchable and open databases at local, national and international levels to facilitate identification.

For more in-depth analysis and discussion of the numbers of missing and dead migrants around the world, and the challenges involved in identification and tracing, read our two reports on the issue, Fatal Journeys: Tracking Lives Lost during Migration (2014) and Fatal Journeys Volume 2, Identification and Tracing of Dead and Missing Migrants

Content

The data set records
General Social Survey 2014 Cross-Section and Panel Combined - Instructional...
thearda.com
Updated 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom W. Smith (2014). General Social Survey 2014 Cross-Section and Panel Combined - Instructional Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/ZFRD2
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/ZFRD2
Dataset updated
2014
Dataset provided by
Association of Religion Data Archives
Authors
Tom W. Smith
Dataset funded by
National Science Foundation
Description
This file contains all of the cases and variables that are in the original 2014 General Social Survey, but is prepared for easier use in the classroom. Changes have been made in two areas. First, to avoid confusion when constructing tables or interpreting basic analysis, all missing data codes have been set to system missing. Second, many of the continuous variables have been categorized into fewer categories, and added as additional variables to the file.

The General Social Surveys (GSS) have been conducted by the National Opinion Research Center (NORC) annually since 1972, except for the years 1979, 1981, and 1992 (a supplement was added in 1992), and biennially beginning in 1994. The GSS are designed to be part of a program of social indicator research, replicating questionnaire items and wording in order to facilitate time-trend studies. This data file has all cases and variables asked on the 2014 GSS. There are a total of 3,842 cases in the data set but their initial sampling years vary because the GSS now contains panel cases. Sampling years can be identified with the variable SAMPTYPE.

To download syntax files for the GSS that reproduce well-known religious group recodes, including RELTRAD, please visit the "/research/syntax-repository-list" Target="_blank">ARDA's Syntax Repository.
Cybersecurity Incidents Dataset
kaggle.com
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
huzpsb (2025). Cybersecurity Incidents Dataset [Dataset]. http://doi.org/10.34740/kaggle/ds/6538710
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/6538710
Dataset updated
Jan 24, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
huzpsb
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The number of cybersecurity incident reports filed with local authorities and the estimated loss. Loss is calculated in USD. Some missing fields are imputed. Overall, it's consistent with the IC3 report.
A
‘Young People Survey’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘Young People Survey’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-young-people-survey-40db/latest
Explore at:
Dataset updated
Aug 27, 2016
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Young People Survey’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/miroslavsabo/young-people-survey on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Introduction

In 2013, students of the Statistics class at "https://fses.uniba.sk/en/">FSEV UK were asked to invite their friends to participate in this survey.

The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).

For convenience, the original variable names were shortened in the data file. See the columns.csv file if you want to match the data with the original names.

The data contain missing values.

The survey was presented to participants in both electronic and written form.

The original questionnaire was in Slovak language and was later translated into English.

All participants were of Slovakian nationality, aged between 15-30.

The variables can be split into the following groups:

Music preferences (19 items)

Movie preferences (12 items)

Hobbies & interests (32 items)

Phobias (10 items)

Health habits (3 items)

Personality traits, views on life, & opinions (57 items)

Spending habits (7 items)

Demographics (10 items)

Research questions

Many different techniques can be used to answer many questions, e.g.

Clustering: Given the music preferences, do people make up any clusters of similar behavior?

Hypothesis testing: Do women fear certain phenomena significantly more than men? Do the left handed people have different interests than right handed?

Predictive modeling: Can we predict spending habits of a person from his/her interests and movie or music preferences?

Dimension reduction: Can we describe a large number of human interests by a smaller number of latent concepts?

Correlation analysis: Are there any connections between music and movie preferences?

Visualization: How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?

(Multivariate) Outlier detection: Small number of participants often cheats and randomly answers the questions. Can you identify them? Hint: [Local outlier factor][1] may help.

Missing values analysis: Are there any patterns in missing responses? What is the optimal way of imputing the values in surveys?

Recommendations: If some of user's interests are known, can we predict the other? Or, if we know what a person listen, can we predict which kind of movies he/she might like?

Past research

(in slovak) Sleziak, P. - Sabo, M.: Gender differences in the prevalence of specific phobias. Forum Statisticum Slovacum. 2014, Vol. 10, No. 6. [Differences (gender + whether people lived in village/town) in the prevalence of phobias.]

Sabo, Miroslav. Multivariate Statistical Methods with Applications. Diss. Slovak University of Technology in Bratislava, 2014. [Clustering of variables (music preferences, movie preferences, phobias) + Clustering of people w.r.t. their interests.]

Questionnaire

MUSIC PREFERENCES

I enjoy listening to music.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I prefer.: Slow paced music 1-2-3-4-5 Fast paced music (integer)

Dance, Disco, Funk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Folk music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Country: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Classical: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Musicals: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Pop: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Metal, Hard rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Punk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Hip hop, Rap: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Reggae, Ska: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Swing, Jazz: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Rock n Roll: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Alternative music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Latin: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Techno, Trance: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Opera: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

MOVIE PREFERENCES

I really enjoy watching movies.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

Horror movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Thriller movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Comedies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Romantic movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Sci-fi movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

War movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Tales: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Cartoons: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Documentaries: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Western movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Action movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

HOBBIES & INTERESTS

History: Not interested 1-2-3-4-5 Very interested (integer)

Psychology: Not interested 1-2-3-4-5 Very interested (integer)

Politics: Not interested 1-2-3-4-5 Very interested (integer)

Mathematics: Not interested 1-2-3-4-5 Very interested (integer)

Physics: Not interested 1-2-3-4-5 Very interested (integer)

Internet: Not interested 1-2-3-4-5 Very interested (integer)

PC Software, Hardware: Not interested 1-2-3-4-5 Very interested (integer)

Economy, Management: Not interested 1-2-3-4-5 Very interested (integer)

Biology: Not interested 1-2-3-4-5 Very interested (integer)

Chemistry: Not interested 1-2-3-4-5 Very interested (integer)

Poetry reading: Not interested 1-2-3-4-5 Very interested (integer)

Geography: Not interested 1-2-3-4-5 Very interested (integer)

Foreign languages: Not interested 1-2-3-4-5 Very interested (integer)

Medicine: Not interested 1-2-3-4-5 Very interested (integer)

Law: Not interested 1-2-3-4-5 Very interested (integer)

Cars: Not interested 1-2-3-4-5 Very interested (integer)

Art: Not interested 1-2-3-4-5 Very interested (integer)

Religion: Not interested 1-2-3-4-5 Very interested (integer)

Outdoor activities: Not interested 1-2-3-4-5 Very interested (integer)

Dancing: Not interested 1-2-3-4-5 Very interested (integer)

Playing musical instruments: Not interested 1-2-3-4-5 Very interested (integer)

Poetry writing: Not interested 1-2-3-4-5 Very interested (integer)

Sport and leisure activities: Not interested 1-2-3-4-5 Very interested (integer)

Sport at competitive level: Not interested 1-2-3-4-5 Very interested (integer)

Gardening: Not interested 1-2-3-4-5 Very interested (integer)

Celebrity lifestyle: Not interested 1-2-3-4-5 Very interested (integer)

Shopping: Not interested 1-2-3-4-5 Very interested (integer)

Science and technology: Not interested 1-2-3-4-5 Very interested (integer)

Theatre: Not interested 1-2-3-4-5 Very interested (integer)

Socializing: Not interested 1-2-3-4-5 Very interested (integer)

Adrenaline sports: Not interested 1-2-3-4-5 Very interested (integer)

Pets: Not interested 1-2-3-4-5 Very interested (integer)

PHOBIAS

Flying: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Thunder, lightning: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Darkness: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Heights: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Spiders: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Snakes: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Rats, mice: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Ageing: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Dangerous dogs: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Public speaking: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

HEALTH HABITS

Smoking habits: Never smoked - Tried smoking - Former smoker - Current smoker (categorical)

Drinking: Never - Social drinker - Drink a lot (categorical)

I live a very healthy lifestyle.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS

I take notice of what goes on around me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I try to do tasks as soon as possible and not leave them until last minute.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I always make a list so I don't forget anything.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I often study or work even in my spare time.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I look at things from all different angles before I go ahead.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I believe that bad people will suffer one day and good people will be rewarded.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I am reliable at work and always complete all tasks given to me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I always keep my promises.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

**I can fall for someone very quickly and then
Dataset for: Latent trait shared parameter mixed-models for missing...
wiley.figshare.com
bin
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Cursio; Robin J. Mermelstein; Donald Hedeker (2023). Dataset for: Latent trait shared parameter mixed-models for missing ecological momentary assessment data [Dataset]. http://doi.org/10.6084/m9.figshare.7087064.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7087064.v1
Dataset updated
Jun 3, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
John Cursio; Robin J. Mermelstein; Donald Hedeker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Latent trait shared-parameter mixed-models (LTSPMM) for ecological momentary assessment (EMA) data containing missing values are developed in which data are collected in an intermittent manner. In such studies, data are often missing due to unanswered prompts. Using item response theory (IRT) models, a latent trait is used to represent the missing prompts and modeled jointly with a mixed-model for bivariate longitudinal outcomes. Both one- and two-parameter LTSPMMs are presented. These new models offer a unique way to analyze missing EMA data with many response patterns. Here, the proposed models represent missingness via a latent trait that corresponds to the students' "ability" to respond to the prompting device. Data containing more than 10,300 observations from an EMA study involving high-school students' positive and negative affect are presented. The latent trait representing missingness was a significant predictor of both positive affect and negative affect outcomes. The models are compared to a missing at random (MAR) mixed-model. A simulation study indicates that the proposed models can provide lower bias and increased efficiency compared to the standard MAR approach commonly used with intermittently missing longitudinal data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

pdfAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0164464

Dataset updated

Jun 1, 2023

Dataset provided by

PLOS ONE

Authors

Nikolaj Bak; Lars K. Hansen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

Clear search

Close search

Google apps

Main menu

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

Data from: Water-quality data imputation with a high percentage of missing...

Data from: Variable Selection with Multiply-Imputed Datasets: Choosing...

Data from: Using multiple imputation to estimate missing data in...

Data from: Missing data estimation in morphometrics: how much is too much?

MAXIMUM LIKELIHOOD ESTIMATION OF FACTOR MODELS ON DATASETS WITH ARBITRARY...

MacroPCA: An All-in-One PCA Method Allowing for Missing Values as Well as...

A Correction for Structural Equation Modeling Fit Indices Under Missingness:...

Empathy dataset

National Missing and Unidentified Persons System (NamUs)

Data from: Macaques preferentially attend to intermediately surprising...

Data from: Incomplete specimens in geometric morphometric analyses

Number of missing persons files in the U.S. 2022, by race

Dataset information.

PV Generation and Consumption Dataset of an Estonian Residential Dwelling

‘Missing Migrants Dataset’ analyzed by Analyst-2

About the Missing Migrants Data

Where is the data from?

Who is included in Missing Migrants Project data?

How complete is the data on dead and missing migrants?

What can be understood through this data?

Why collect data on missing and dead migrants?

Identification and tracing of the dead and missing

Content

General Social Survey 2014 Cross-Section and Panel Combined - Instructional...

Cybersecurity Incidents Dataset

‘Young People Survey’ analyzed by Analyst-2

Introduction

Research questions

Past research

Questionnaire

MUSIC PREFERENCES

MOVIE PREFERENCES

HOBBIES & INTERESTS

PHOBIAS

HEALTH HABITS

PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS

Dataset for: Latent trait shared parameter mixed-models for missing...

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option