27 datasets found

z
Missing data in the analysis of multilevel and dependent data (Example data...
zenodo.org
bin
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7773614
Dataset updated
Jul 20, 2023
Dataset provided by
Springer
Authors
Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000)
x = numeric (Level 1)
y = numeric (Level 1)
w = binary (Level 2)

In all data sets, missing values are coded as "NA".
r
Air Quality Monitoring - 2019 (grouped by pollutant)
researchdata.edu.au
data.qld.gov.au
+1more
Updated Apr 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.qld.gov.au (2020). Air Quality Monitoring - 2019 (grouped by pollutant) [Dataset]. https://researchdata.edu.au/air-quality-monitoring-grouped-pollutant/1459199
Explore at:
Dataset updated
Apr 20, 2020
Dataset provided by
data.qld.gov.au
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annual hourly air quality and meteorological data by pollutant for the 2019 calendar year. For more information on air quality, including live air data, please visit www.qld.gov.au/environment/pollution/monitoring/air. \r \r Data resolution: One-hour average values \r Data row timestamp: Start of averaging period \r Missing data/not monitored: Blank cell \r Sampling height: Four metres above ground (unless otherwise indicated)
Z
Car Parts Dataset (without Missing Values)
data.niaid.nih.gov
zenodo.org
Updated Apr 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Godahewa, Rakshitha (2021). Car Parts Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3994910
Explore at:
Dataset updated
Apr 1, 2021
Dataset provided by
Bergmeir, Christoph
Webb, Geoff
Godahewa, Rakshitha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 2674 intermittent monthly time series that represent car parts sales from January 1998 to March 2002. It was extracted from R expsmooth package.

The original dataset contains missing values and they have been replaced by zeros.
Data from: Missing data estimation in morphometrics: how much is too much?
zenodo.org
data.niaid.nih.gov
+2more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel (2022). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.f0b50
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
R
Conjunto de datos de: Water-quality data imputation with a high percentage...
redata.anii.org.uy
tsv
Updated Mar 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione; Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione (2024). Conjunto de datos de: Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.60895/redata/TNRT8Q
Explore at:
tsv(11154), tsv(47413)Available download formats
Unique identifier
https://doi.org/10.60895/redata/TNRT8Q
Dataset updated
Mar 6, 2024
Dataset provided by
Repositorio de datos abiertos de investigación de Uruguay
Authors
Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione; Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
ANII
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries. This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges. To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)). IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases. In this dataset, we include the original and imputed values for the following variables: - Water temperature (Tw) - Dissolved oxygen (DO) - Electrical conductivity (EC) - pH - Turbidity (Turb) - Nitrite (NO2-) - Nitrate (NO3-) - Total Nitrogen (TN) Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC]. More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318. If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
f
Data from: pmartR: Quality Control and Statistics for Mass...
acs.figshare.com
figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kelly G. Stratton; Bobbie-Jo M. Webb-Robertson; Lee Ann McCue; Bryan Stanfill; Daniel Claborne; Iobani Godinez; Thomas Johansen; Allison M. Thompson; Kristin E. Burnum-Johnson; Katrina M. Waters; Lisa M. Bramer (2023). pmartR: Quality Control and Statistics for Mass Spectrometry-Based Biological Data [Dataset]. http://doi.org/10.1021/acs.jproteome.8b00760.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.8b00760.s001
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Kelly G. Stratton; Bobbie-Jo M. Webb-Robertson; Lee Ann McCue; Bryan Stanfill; Daniel Claborne; Iobani Godinez; Thomas Johansen; Allison M. Thompson; Kristin E. Burnum-Johnson; Katrina M. Waters; Lisa M. Bramer
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Prior to statistical analysis of mass spectrometry (MS) data, quality control (QC) of the identified biomolecule peak intensities is imperative for reducing process-based sources of variation and extreme biological outliers. Without this step, statistical results can be biased. Additionally, liquid chromatography–MS proteomics data present inherent challenges due to large amounts of missing data that require special consideration during statistical analysis. While a number of R packages exist to address these challenges individually, there is no single R package that addresses all of them. We present pmartR, an open-source R package, for QC (filtering and normalization), exploratory data analysis (EDA), visualization, and statistical analysis robust to missing data. Example analysis using proteomics data from a mouse study comparing smoke exposure to control demonstrates the core functionality of the package and highlights the capabilities for handling missing data. In particular, using a combined quantitative and qualitative statistical test, 19 proteins whose statistical significance would have been missed by a quantitative test alone were identified. The pmartR package provides a single software tool for QC, EDA, and statistical comparisons of MS data that is robust to missing data and includes numerous visualization capabilities.
f
A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...
acs.figshare.com
figshare.com
xlsx
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.1c00070.s003
Dataset updated
Jun 11, 2023
Dataset provided by
ACS Publications
Authors
Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

MOCK Big Five Inventory dataset (German metadata demo)

cran.r-project.org
rubenarslan.github.io
+4more

Updated Jun 1, 2016

Facebook

Twitter

Click to copy link

Link copied

Cite

Ruben Arslan (2016). MOCK Big Five Inventory dataset (German metadata demo) [Dataset]. http://doi.org/10.5281/zenodo.1326520

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.1326520

Dataset updated

Jun 1, 2016

Dataset provided by

MPI Human Development, Berlin

Authors

Ruben Arslan

Time period covered

2016

Area covered

Germany, Goettingen

Variables measured

age, ended, created, expired, session, modified, BFIK_agree, BFIK_neuro, BFIK_agree_2, BFIK_neuro_3, and 5 more

Description

a small mock Big Five Inventory dataset

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name	label	n_missing
session	NA	0
created	user first opened survey	0
modified	user last edited survey	0
ended	user finished survey	0
expired	NA	28
BFIK_agree_4R	Ich kann mich schroff und abweisend anderen gegenüber verhalten.	0
BFIK_agree_1R	Ich neige dazu, andere zu kritisieren.	0
BFIK_neuro_2R	Ich bin entspannt, lasse mich durch Stress nicht aus der Ruhe bringen.	0
BFIK_agree_3R	Ich kann mich kalt und distanziert verhalten.	0
BFIK_neuro_3	Ich mache mir viele Sorgen.	0
BFIK_neuro_4	Ich werde leicht nervös und unsicher.	0
BFIK_agree_2	Ich schenke anderen leicht Vertrauen, glaube an das Gute im Menschen.	0
BFIK_agree	4 BFIK_agree items aggregated by aggregation_function	0
BFIK_neuro	3 BFIK_neuro items aggregated by aggregation_function	0
age	Alter	0

Note

This dataset was automatically described using the codebook R package (version 0.9.6).

r
MOCK Qualtrics dataset
cran.r-universe.dev
rubenarslan.github.io
Updated Aug 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruben Arslan (2018). MOCK Qualtrics dataset [Dataset]. http://doi.org/10.5281/zenodo.1326520
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1326520
Dataset updated
Aug 1, 2018
Dataset provided by
MPI Human Development, Berlin
Authors
Ruben Arslan
Time period covered
2018
Area covered
Nowhere
Variables measured
Q7, Q10, ResponseSet
Description
a MOCK dataset used to show how to import Qualtrics metadata into the codebook R package

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name label n_missing
ResponseSet NA 0
Q7 NA 0
Q10 NA 0

Note

This dataset was automatically described using the codebook R package (version 0.9.6).
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Samoilova, Evgenia (Zhenya)
Loist, Skadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
r
Air Quality Monitoring - 2009 (grouped by pollutant)
researchdata.edu.au
data.qld.gov.au
Updated Jan 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.qld.gov.au (2021). Air Quality Monitoring - 2009 (grouped by pollutant) [Dataset]. https://researchdata.edu.au/air-quality-monitoring-grouped-pollutant/1673367
Explore at:
Dataset updated
Jan 7, 2021
Dataset provided by
data.qld.gov.au
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annual hourly air quality and meteorological data by pollutant for the 2009 calendar year. For more information on air quality, including live air data, please visit qld.gov.au/environment/pollution/monitoring/air. \r \r Data resolution: One-hour average values \r Data row timestamp: Start of averaging period \r Missing data/not monitored: Blank cell \r Sampling height: Four metres above ground (unless otherwise indicated) \r \r
a
Catholics per Parish, Null Values
hub.arcgis.com
catholic-geo-hub-cgisc.hub.arcgis.com
Updated Oct 26, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
burhansm2 (2019). Catholics per Parish, Null Values [Dataset]. https://hub.arcgis.com/content/0829ec89280f42babd17011489a0d240
Explore at:
Dataset updated
Oct 26, 2019
Dataset authored and provided by
burhansm2
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Area covered
Pacific Ocean, North Pacific Ocean
Description
Catholics per Parish {title at top of page}Data Developers: Burhans, Molly A., Cheney, David M., Emege, Thomas, Gerlt, R.. . “Catholics per Parish {title at top of page}”. Scale not given. Version 1.0. MO and CT, USA: GoodLands Inc., Catholic Hierarchy, Environmental Systems Research Institute, Inc., 2019.Web map developer: Molly Burhans, October 2019Web app developer: Molly Burhans, October 2019GoodLands’ polygon data layers, version 2.0 for global ecclesiastical boundaries of the Roman Catholic Church:Although care has been taken to ensure the accuracy, completeness and reliability of the information provided, due to this being the first developed dataset of global ecclesiastical boundaries curated from many sources it may have a higher margin of error than established geopolitical administrative boundary maps. Boundaries need to be verified with appropriate Ecclesiastical Leadership. The current information is subject to change without notice. No parties involved with the creation of this data are liable for indirect, special or incidental damage resulting from, arising out of or in connection with the use of the information. We referenced 1960 sources to build our global datasets of ecclesiastical jurisdictions. Often, they were isolated images of dioceses, historical documents and information about parishes that were cross checked. These sources can be viewed here:https://docs.google.com/spreadsheets/d/11ANlH1S_aYJOyz4TtG0HHgz0OLxnOvXLHMt4FVOS85Q/edit#gid=0To learn more or contact us please visit: https://good-lands.org/The Catholic Leadership global maps information is derived from the Annuario Pontificio, which is curated and published by the Vatican Statistics Office annually, and digitized by David Cheney at Catholic-Hierarchy.org -- updated are supplemented with diocesan and news announcements. GoodLands maps this into global ecclesiastical boundaries. Admin 3 Ecclesiastical Territories:Burhans, Molly A., Cheney, David M., Gerlt, R.. . “Admin 3 Ecclesiastical Territories For Web”. Scale not given. Version 1.2. MO and CT, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2019.Derived from:Global Diocesan Boundaries:Burhans, M., Bell, J., Burhans, D., Carmichael, R., Cheney, D., Deaton, M., Emge, T. Gerlt, B., Grayson, J., Herries, J., Keegan, H., Skinner, A., Smith, M., Sousa, C., Trubetskoy, S. “Diocesean Boundaries of the Catholic Church” [Feature Layer]. Scale not given. Version 1.2. Redlands, CA, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2016.Using: ArcGIS. 10.4. Version 10.0. Redlands, CA: Environmental Systems Research Institute, Inc., 2016.Boundary ProvenanceStatistics and Leadership DataCheney, D.M. “Catholic Hierarchy of the World” [Database]. Date Updated: August 2019. Catholic Hierarchy. Using: Paradox. Retrieved from Original Source.Catholic HierarchyAnnuario Pontificio per l’Anno .. Città del Vaticano :Tipografia Poliglotta Vaticana, Multiple Years.The data for these maps was extracted from the gold standard of Church data, the Annuario Pontificio, published yearly by the Vatican. The collection and data development of the Vatican Statistics Office are unknown. GoodLands is not responsible for errors within this data. We encourage people to document and report errant information to us at data@good-lands.org or directly to the Vatican.Additional information about regular changes in bishops and sees comes from a variety of public diocesan and news announcements.
d
Data from: Unforeseen consequences of excluding missing data from...
datadryad.org
zenodo.org
zip
Updated Jun 27, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huateng Huang; L. Lacey Knowles (2014). Unforeseen consequences of excluding missing data from next-generation sequences: simulation study of RAD sequences [Dataset]. http://doi.org/10.5061/dryad.jf361
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jf361
Dataset updated
Jun 27, 2014
Dataset provided by
Dryad
Authors
Huateng Huang; L. Lacey Knowles
Time period covered
2014
Description
speciestree20 species tree simulated under Yule model using MesquitemsPerl script to simulate coalescent genealogies for given species treeseqgene_paramtershort R script to generate random mutation rates (as the theta for simulating dna sequences ) from a log normal distributionseqgen parametersthe result from seqgene_paramter.R-- the mutation rates used in simulating sequencesseqge.parameterseq-genPerl script to simulate sequences given genealogies and mutation rate (as theta)infomissinga Perl script to filter out sequences with mutations at enzyme cutting sitescoveragea Perl script for filtering out sequences with no read (coverage draw from a poisson distribution)clustera Perl script for generating post-sequencing missing datalociinda perl script for summarizing the number of individuals for each locus (output in a txt file)countsa Perl script for counting the number loci at different tolerance levels (output a txt file)phylipranda Perl script for generating phylip formatted sequenc...
d
Data from: Quantifying the impacts of management and herbicide resistance on...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs (2023). Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data [Dataset]. http://doi.org/10.5061/dryad.9cnp5hqn5
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.9cnp5hqn5
Dataset updated
Nov 29, 2023
Dataset provided by
Dryad Digital Repository
Authors
Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs
Time period covered
Jan 1, 2022
Description
A key challenge in the management of populations is to quantify the impact of interven-tions in the face of environmental and phenotypic variability. However, accurate estima-tion of the effects of management and environment, in large-scale ecological research is often limited by the expense of data collection, the inherent trade-off between quality and quantity, and missing data. In this paper we develop a novel modelling framework, and demographically informed imputation scheme, to comprehensively account for the uncertainty generated by miss-ing population, management, and herbicide resistance data. Using this framework and a large dataset (178 sites over 3 years) on the densities of a destructive arable weed (Alo-pecurus myosuroides) we investigate the effects of environment, management, and evolved herbicide resistance, on weed population dynamics. In this study we quantify the marginal effects of a suite of common management prac-tices, including cropping, cultivation, and herbici..., Data were collected from a network of UK farms using a density structured survey method outlined in Queensborough 2011.Â , , # Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data

Contained are the datasets and code required to replicate the analyses in Goodsell et al (2023), Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data.

Description of the data and file structure

Data: Contains data required to run all stages in the analysis.

Many files contain the same variable names, important variables have been described in the first object they appear in.

all_imputation_data.rds - The data required to run the imputation scheme, this is an R list containing the following:

$Management - data frame containing missing and observed values for management imputation

FF & FFY: the specific field, and field year.

year: the year.

crop: crop

cult_cat : cultivation category

a_gly: number of autumn (post September 1st) glyphosate applicatio...
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
b2find.dkrz.de
zip
Updated Feb 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Feb 14, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset paper (public preprint)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also available when an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (0), and where the interpolated value is used (1) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the Soil Moisture Climate Data Records from satellites community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
Daily and Annual NO2 Concentrations for the Contiguous United States, 1-km...
data.nasa.gov
cmr.earthdata.nasa.gov
+2more
application/rdfxml +5
Updated Feb 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Daily and Annual NO2 Concentrations for the Contiguous United States, 1-km Grids, Version 1.10 (2000-2016) [Dataset]. https://data.nasa.gov/dataset/Daily-and-Annual-NO2-Concentrations-for-the-Contig/ipab-ju3m
Explore at:
xml, csv, application/rssxml, application/rdfxml, tsv, jsonAvailable download formats
Dataset updated
Feb 5, 2024
Area covered
United States, Contiguous United States
Description
The Daily and Annual NO2 Concentrations for the Contiguous United States, 1-km Grids, Version 1.10 (2000-2016) data set contains daily predictions of Nitrogen Dioxide (NO2) concentrations at a high resolution (1-km grid cells) for the years 2000 to 2016. An ensemble modeling framework was used to assess NO2 levels with high accuracy, which combined estimates from three machine learning models (neural network, random forest, and gradient boosting), with a generalized additive model. Predictor variables included NO2 column concentrations from satellites, land-use variables, meteorological variables, predictions from two chemical transport models, GEOS-Chem and the U.S. Environmental Protection Agency (EPA) CommUnity Multiscale Air Quality Modeling System (CMAQ), along with other ancillary variables. The annual predictions were calculated by averaging the daily predictions for each year in each grid cell. The ensemble produced a cross-validated R-squared value of 0.79 overall, a spatial R-squared value of 0.84, and a temporal R-squared value of 0.73. In version 1.10, the completeness of daily NO2 predictions have been enhanced by employing linear interpolation to impute missing values. Specifically, for days with small spatial patches of missing data with less than 100 grid cells, inverse distance weighting interpolation was used to fill the missing grid cells. Other missing daily NO2 predictions were interpolated from the nearest days with available data. Annual predictions were updated by averaging the imputed daily predictions for each year in each grid cell. These daily and annual NO2 predictions allow public health researchers to respectively estimate the short- and long-term effects of NO2 exposures on human health, supporting the U.S. EPA for the revision of the National Ambient Air Quality Standards for daily average and annual average concentrations of NO2. The data are available in RDS and GeoTIFF formats for statistical research and geospatial analysis.
d
Wheat Corn Soy Estimates Red River of the North Basin.
datadiscoverystudio.org
data.usgs.gov
+2more
Updated May 20, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Wheat Corn Soy Estimates Red River of the North Basin. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/e9664ee0a6744b83a88f27325051bc01/html
Explore at:
Dataset updated
May 20, 2018
Description
description: These data describe the percent of cropland harvested as wheat, corn, and soybean within each basin (basins 1-8, see accompanying shapefiles). Data are available for other crops; however, these three were chosen because wheat is a traditional crop that has been grown for a long time in the Basin and corn and soybeans have increased in recent times because of wetter conditions, the demand for biofuels, and advances in breeding short-season, drought-tolerant crops. The data come from the National Agricultural Statistics Service (NASS) Census of Agriculture (COA) and have estimates for 1974, 1978, 1982, 1986, 1992, 1997, 2002, 2007, and 2012. Years with missing data were estimated estimated using multivariate imputation of missing values with principal components analysis (PCA) via the function imputePCA in the R (R Core Team, 2015) package missMDA (Husson and Josse, 2015). In the interest of dimension reduction, the scores of the first principal component of principal component analysis, by basin, of the wheat, corn, and soy variables is included. Husson, F., and Josse, J., 2015, missMDAHandling missing values with multivariate data analysis: R package version 1.9, https://CRAN.R-project.org/package=missMDA. R Core Team, 2015, R: A language and environment for statistical computing: R Foundation for Statistical Computing, Vienna, http://www.R-project.org.; abstract: These data describe the percent of cropland harvested as wheat, corn, and soybean within each basin (basins 1-8, see accompanying shapefiles). Data are available for other crops; however, these three were chosen because wheat is a traditional crop that has been grown for a long time in the Basin and corn and soybeans have increased in recent times because of wetter conditions, the demand for biofuels, and advances in breeding short-season, drought-tolerant crops. The data come from the National Agricultural Statistics Service (NASS) Census of Agriculture (COA) and have estimates for 1974, 1978, 1982, 1986, 1992, 1997, 2002, 2007, and 2012. Years with missing data were estimated estimated using multivariate imputation of missing values with principal components analysis (PCA) via the function imputePCA in the R (R Core Team, 2015) package missMDA (Husson and Josse, 2015). In the interest of dimension reduction, the scores of the first principal component of principal component analysis, by basin, of the wheat, corn, and soy variables is included. Husson, F., and Josse, J., 2015, missMDAHandling missing values with multivariate data analysis: R package version 1.9, https://CRAN.R-project.org/package=missMDA. R Core Team, 2015, R: A language and environment for statistical computing: R Foundation for Statistical Computing, Vienna, http://www.R-project.org.
g
codebook_selection_data
laurabotzet.github.io
Updated Oct 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). codebook_selection_data [Dataset]. https://laurabotzet.github.io/effects_of_contraception/4_codebook.html
Explore at:
Dataset updated
Oct 3, 2021
Variables measured
age, bfi_open, bfi_agree, bfi_consc, bfi_extra, bfi_neuro, bfi_open_1, bfi_open_2, bfi_open_3, bfi_open_4, and 56 more
Description
The dataset has N=1179 rows and 66 columns. 622 rows have no missing values on any column.

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

[truncated]

Note

This dataset was automatically described using the codebook R package (version 0.9.2).
Data from: Effects of growth rate, size, and light availability on tree...
zenodo.org
narcis.nl
Updated May 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aristides Moustakas; Matthew R. Evans; Aristides Moustakas; Matthew R. Evans (2022). Data from: Effects of growth rate, size, and light availability on tree survival across life stages: a demographic analysis accounting for missing values and small sample sizes [Dataset]. http://doi.org/10.5061/dryad.6f4qs
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.6f4qs
Dataset updated
May 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Aristides Moustakas; Matthew R. Evans; Aristides Moustakas; Matthew R. Evans
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Background: Plant survival is a key factor in forest dynamics and survival probabilities often vary across life stages. Studies specifically aimed at assessing tree survival are unusual and so data initially designed for other purposes often need to be used; such data are more likely to contain errors than data collected for this specific purpose. Results: We investigate the survival rates of ten tree species in a dataset designed to monitor growth rates. As some individuals were not included in the census at some time points we use capture-mark-recapture methods both to allow us to account for missing individuals, and to estimate relocation probabilities. Growth rates, size, and light availability were included as covariates in the model predicting survival rates. The study demonstrates that tree mortality is best described as constant between years and size-dependent at early life stages and size independent at later life stages for most species of UK hardwood. We have demonstrated that even with a twenty-year dataset it is possible to discern variability both between individuals and between species. Conclusions: Our work illustrates the potential utility of the method applied here for calculating plant population dynamics parameters in time replicated datasets with small sample sizes and missing individuals without any loss of sample size, and including explanatory covariates.
f
MAPE and PB statistics for IBFI compared with other imputation methods...
plos.figshare.com
xls
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique (2023). MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR). [Dataset]. http://doi.org/10.1371/journal.pone.0262131.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0262131.t003
Dataset updated
Jun 6, 2023
Dataset provided by
PLOS ONE
Authors
Adil Aslam Mir; Kimberlee Jane Kearfott; Fatih Vehbi Çelebi; Muhammad Rafique
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR).

name	label	n_missing
ResponseSet	NA	0
Q7	NA	0
Q10	NA	0

Facebook

Twitter

Click to copy link

Link copied

Cite

Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Example data sets) [Dataset]. http://doi.org/10.5281/zenodo.7773614

Missing data in the analysis of multilevel and dependent data (Example data sets)

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7773614

Dataset updated

Jul 20, 2023

Dataset provided by

Springer

Authors

Simon Grund; Simon Grund; Oliver Lüdtke; Oliver Lüdtke; Alexander Robitzsch; Alexander Robitzsch

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Example data sets for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the data sets used in both example analyses (Examples 1 and 2) in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000)
x = numeric (Level 1)
y = numeric (Level 1)
w = binary (Level 2)

In all data sets, missing values are coded as "NA".

Clear search

Close search

Google apps

Main menu

Missing data in the analysis of multilevel and dependent data (Example data...

Air Quality Monitoring - 2019 (grouped by pollutant)

Car Parts Dataset (without Missing Values)

Data from: Missing data estimation in morphometrics: how much is too much?

Conjunto de datos de: Water-quality data imputation with a high percentage...

Data from: pmartR: Quality Control and Statistics for Mass...

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

MOCK Big Five Inventory dataset (German metadata demo)

Table of variables

Note

MOCK Qualtrics dataset

Table of variables

Note

Film Circulation dataset

Air Quality Monitoring - 2009 (grouped by pollutant)

Catholics per Parish, Null Values

Data from: Unforeseen consequences of excluding missing data from...

Data from: Quantifying the impacts of management and herbicide resistance on...

Description of the data and file structure

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset paper (public preprint)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

Daily and Annual NO2 Concentrations for the Contiguous United States, 1-km...

Wheat Corn Soy Estimates Red River of the North Basin.

codebook_selection_data

Table of variables

Note

Data from: Effects of growth rate, size, and light availability on tree...

MAPE and PB statistics for IBFI compared with other imputation methods...

Missing data in the analysis of multilevel and dependent data (Example data sets)See More Versions

Missing data in the analysis of multilevel and dependent data (Example data sets)