Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the ESA CCI Soil Moisture science data records community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset and scripts used for manuscript: High consistency and repeatability in the breeding migrations of a benthic shark.
Project title: High consistency and repeatability in the breeding migrations of a benthic shark
Date:23/04/2024
Folders:
- 1_Raw_data
- Perpendicular_Point_068151, Sanctuary_Point_068088, SST raw data, sst_nc_files, IMOS_animal_measurements, IMOS_detections, PS&Syd&JB tags, rainfall_raw, sample_size, Point_Perpendicular_2013_2019, Sanctuary_Point_2013_2019, EAC_transport
- 2_Processed_data
- SST (anomaly, historic_sst, mean_sst_31_years, week_1992_sst:week_2022_sst including week_2019_complete_sst)
- Rain (weekly_rain, weekly_rainfall_completed)
- Clean (clean, cleaned_data, cleaned_gam, cleaned_pj_data)
- 3_Script_processing_data
- Plots(dual_axis_plot (Fig. 1 & Fig. 4).R, period_plot (Fig. 2).R, sd_plot (Fig. 5).R, sex_plot (Fig. 3).R
- cleaned_data.R, cleaned_data_gam.R, weekly_rainfall_completed.R, descriptive_stats.R, sst.R, sst_2019b.R, sst_anomaly.R
- 4_Script_analyses
- gam.R, gam_eac.R, glm.R, lme.R, Repeatability.R
- 5_Output_doc
- Plots (arrival_dual_plot_with_anomaly (Fig. 1).png, period_plot (Fig.2).png, sex_arrival_departure (Fig. 3).png, departure_dual_plot_with_anomaly (Fig. 4).png, standard deviation plot (Fig. 5).png)
- Tables (gam_arrival_eac_selection_table.csv (Table S2), gam_departure_eac_selection_table (Table S5), gam_arrival_selection_table (Table. S3), gam_departure_selection_table (Table. S6), glm_arrival_selection_table, glm_departure_selection_table, lme_arrival_anova_table, lme_arrival_selection_table (Table S4), lme_departure_anova_table, lme_departure_selection_table (Table. S8))
Descriptions of scripts and files used:
- cleaned_data.R: script to extract detections of sharks at Jervis Bay. Calculate arrival and departure dates over the seven breeding seasons. Add sex and length for each individual. Extract moon phase (numerical value) and period of the day from arrival and departure times.
- IMOS_detections.csv: raw data file with detections of Port Jackson sharks over different sites in Australia.
- IMOS_animal_measurements.csv: raw data file with morphological data of Port Jackson sharks
- PS&Syd&JB tags: file with measurements and sex identification of sharks (different from IMOS, it was used to complete missing sex and length).
- cleaned_data.csv: file with arrival and departure dates of the final sample size of sharks (N=49) with missing sex and length for some individuals.
- clean.csv: completed file using PS&Syd&JB tags, note: tag ID 117393679 was wrongly identified as a male in IMOS and correctly identified as a female in PS&Syd&JB tags
file as indicated by its large size.
- cleaned_pj_data: Final data file with arrival and departure dates, sex, length, moon phase (numerical) and period of the day.
- weekly_rainfall_completed.R: script to calculate average weekly rainfall and correlation between the two weather stations used (Point perpendicular and Sanctuary point).
- weekly_rain.csv: file with the corresponding week number (1-28) for each date (01-06-2013 to 13-12-2019)
- weekly_rainfall_completed.csv: file with week number (1-28), year (2013-2019) and weekly rainfall average completed with Sanctuary Point for week 2 of 2017
- Point_Perpendicular_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Point Perpendicular weather station
- Sanctuary_Point_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Sanctuary Point weather station
- IDCJAC0009_068088_2017_Data.csv: Rainfall (mm) from 01-01-2017 to 31-12-2017 at the Sanctuary Point weather station (to fill in missing value for average rainfall of week 2 of 2017)
- cleaned_data_gam.R: script to calculate weekly counts of sharks to run gam models and add weekly averages of rainfall and sst anomaly
- cleaned_pj_data.csv
- anomaly.csv: weekly (1-28) average sst anomalies for Jervis Bay (2013-2019)
- weekly_rainfall_completed.csv: weekly (1-28) average rainfall for Jervis Bay (2013-2019_
- sample_size.csv: file with the number of sharks tagged (13-49) for each year (2013-2019)
- sst.R: script to extract daily and weekly sst from IMOS nc files from 01-05 until 31-12 for the following years: 1992:2022 for Jervis Bay
- sst_raw_data: folder with all the raw weekly (1:28) csv files for each year (1992:2022) to fill in with sst data using the sst script
- sst_nc_files: folder with all the nc files downloaded from IMOS from the last 31 years (1992-2022) at the sensor (IMOS - SRS - SST - L3S-Single Sensor - 1 day - night time – Australia).
- SST: folder with the average weekly (1-28) sst data extracted from the nc files using the sst script for each of the 31 years (to calculate temperature anomaly).
- sst_2019b.R: script to extract daily and weekly sst from IMOS nc file for 2019 (missing value for week 19) for Jervis Bay
- week_2019_sst: weekly average sst 2019 with a missing value for week 19
- week_2019b_sst: sst data from 2019 with another sensor (IMOS – SRS – MODIS - 01 day - Ocean Colour-SST) to fill in the gap of week 19
- week_2019_complete_sst: completed average weekly sst data from the year 2019 for weeks 1-28.
- sst_anomaly.R: script to calculate mean weekly sst anomaly for the study period (2013-2019) using mean historic weekly sst (1992-2022)
- historic_sst.csv: mean weekly (1-28) and yearly (1992-2022) sst for Jervis Bay
- mean_sst_31_years.csv: mean weekly (1-28) sst across all years (1992-2022) for Jervis Bay
- anomaly.csv: mean weekly and yearly sst anomalies for the study period (2013-2019)
- Descriptive_stats.R: script to calculate minimum and maximum length of sharks, mean Julian arrival and departure dates per individual per year, mean Julian arrival and departure dates per year for all sharks (Table. S10), summary of standard deviation of julian arrival dates (Table. S9)
- cleaned_pj_data.csv
- gam.R: script used to run the Generalized additive model for rainfall and sea surface temperature
- cleaned_gam.csv
- glm.R: script used to run the Generalized linear mixed models for the period of the day and moon phase
- cleaned_pj_data.csv
- sample_size.csv
- lme.R: script used to run the Linear mixed model for sex and size
- cleaned_pj_data.csv
- Repeatability.R: script used to run the Repeatability for Julian arrival and Julian departure dates
- cleaned_pj_data.csv
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Included are: 1. The raw data (before cleaning and preprocessing) can be found in the files ending "Raw3". The codebooks for each of these data files end in "codebook". This will enable the user to identify the statements that are associated with the items EU1 … 7, Eco1 … 7, Cul1 … 7, AD1 and AD2 that are used in the manuscript.// 2. The R codes ending cleaning_plus.R are used to a) clean the datasets according to the procedure outlined in the online Appendix and b) remove entries with missing values for any of the variables that are used in the calibration process to produce balanced datasets (age, education, gender, political interest). Because of step b), the new datasets generated will be smaller than the clean datasets listed in Table 1 of the Appendix.// 3. For the balancing and calibrating (pre-processing), we use a) the datasets for each country generated by 2 above (the files that are followed by the suffix "_clean"), b) the file drop.py, which is the code (in python) for the balancing algorithm that is based on the principle of raking (see the online Appendix), c) the R files that are used to generate the new calibrated datasets that will be used in the Mokken Scale analysis in 5 below (followed by the suffix "balCode"), and d) a set of files ending in the suffix "estimates" that contain the joint distributions derived from the ESS data (i) for age, below versus above the median age and (ii) for education, degree versus no degree, as well as the marginal distributions for gender and political interest. The median ages of the voting population derived from ESS are as follows: Austria: 50 Bulgaria: 52 Croatia: 52 Cyprus: 47 Czech Republic 50 Denmark: 50 England: 53 Estonia: 50 Finland: 54 France: 55 Germany: 53 Greece: 50 Hungary: 49 Ireland: 50 Italy: 50 Lithuania: 53 Poland: 50 Portugal: 52 Romania: 46 Slovakia: 52 Slovenia: 52 Spain: 50// 4. A set of data files with the suffix myBal, which contain the new calibrated datasets that will be used in the Mokken Scale analysis in 5 (below).// 5. A set of R codes for each country, beginning with the prefix "RCodes" that are used to generate the findings on dimensionality that are presented in the manuscript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Average performance of imputation approaches across performance measures for the 27-item MCQ.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reference Sullivan J.A., Samii, C., Brown, D., Moyo, F., Agrawal, A. 2023. Large-scale land acquisitions exacerbate local farmland inequalities in Tanzania. Proceedings of the National Academy of Sciences 120, e2207398120. https://doi.org/10.1073/pnas.2207398120 Abstract Land inequality stalls economic development, entrenches poverty, and is associated with environmental degradation. Yet, rigorous assessments of land-use interventions attend to inequality only rarely. A land inequality lens is especially important to understand how recent large-scale land acquisitions (LSLAs) affect smallholder and indigenous communities across as much as 100 million hectares around the world. This paper studies inequalities in land assets, specifically landholdings and farm size, to derive insights into the distributional outcomes of LSLAs. Using a household survey covering four pairs of land acquisition and control sites in Tanzania, we use a quasi-experimental design to characterize changes in land inequality and subsequent impacts on well-being. We find convincing evidence that LSLAs in Tanzania lead to both reduced landholdings and greater farmland inequality among smallholders. Households in proximity to LSLAs are associated with 21.1% (P = 0.02) smaller landholdings while evidence, although insignificant, is suggestive that farm sizes are also declining. Aggregate estimates, however, hide that households in the bottom quartiles of farm size suffer the brunt of landlessness and land loss induced by LSLAs that combine to generate greater farmland inequality. Additional analyses find that land inequality is not offset by improvements in other livelihood dimensions, rather farm size decreases among households near LSLAs are associated with no income improvements, lower wealth, increased poverty, and higher food insecurity. The results demonstrate that without explicit consideration of distributional outcomes, land-use policies can systematically reinforce existing inequalities. Replication Data We include anonymized household survey data from our analysis to support open and reproducible science. In particular, we provide i) an anoymized household dataset collected in 2018 (n=994) for households nearby (treatment) and far-away from (control) LSLAs and ii) a household dataset collected in 2019 (n=165) within the same sites. For the 2018 surveys, several anonymized extracts are provided including an imputed (n=10) dataset to fill in missing data that was used for the main analysis. This data can be found in the hh_data folder and includes:
hh_imputed10_2018: anonymized household dataset for 2018 with variables used for the main analysis where missing data was imputed 10 times hh_compensation_2018: anonymized household extract for 2018 representing household benefits and compensation directly received from LSLAs hh_migration_2018: anonymized household extract for 2018 representing household migration behavior following LSLAs hh_rsdata_2018: extracted remote sensing data at the household geo-location for 2018 hh_land_2019: anonymized household extract for 2019 of land variables Our analysis also incorporates data from the Living Standards Measurement Survey (LSMS) collected by the World Bank (found in lsms_data folder). We've provide sub-modules from the LSMS dataset relevant to our analysis but the full datasets can be access through the World Bank's Microdata Library (https://microdata.worldbank.org/index.php/home). Across several analyses we use the LSLA boundaries for our four selected sites. We provide a shapefile for the LSLA boundaries in the gis_data folder. Finally, our data replication includes several model outputs (found in mod_outputs), particularly those that are lengthy to run in R. These datasets can optionally be loaded into R rather than re-running analysis using our main_analysis.Rmd script. Replication Code We provide replication code in the form of R Markdown (.Rmd) or R (.R) files. Alongside the replication data, this can be used to reproduce main figures, table, supplementary materials, and results reported in our article. Scripts include:
main_analysis.Rmd: main analysis supporting the finding, graphs, and tables reported in our main manuscript compensation.R: analysis of benefits and compensation received directly by households from LSLAs landvalue.R: analysis of household land values as a function of distance from LSLAs migration.R: analysis of migration behavior following LSLAs selection_bias.R: analysis of LSLA selection bias between control and treatment enumeration areas
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annual hourly air quality and meteorological data by monitoring site for the 2014 calendar year. For more information on air quality, including live air data, please visit environment.des.qld.gov.au/air. \r \r Data resolution: One-hour average values (one-hour sum for rainfall) \r Data row timestamp: Start of averaging period \r Missing data/not monitored: Blank cell \r Calm conditions: No hourly average wind direction is reported when the hourly average wind speed is zero \r Barometric pressure: Values are at monitoring station elevation, not corrected to mean sea level \r Daily zero/span response check: Automated instrument zero/span response checks are conducted daily between midnight and 1am at Queensland Government sites (can differ at industry sites). Where this takes place an ambient hourly value cannot be reported. \r Sampling height: Four metres above ground (unless otherwise indicated) \r \r PLEASE NOTE: \r \r * The Townsville Coast Guard 2014 air quality monitoring site data was updated on 26/10/2015 due to the wind direction sensor being misaligned and the reported wind direction values have now been corrected. \r * The Auckland Point 2014 air quality monitoring site data was updated on 24/04/2018 to remove invalid wind data due to a sensor fault.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data.
Table 1 I-AMICA project was planned to increase the observational capacity of the monitoring of marine coastal ecosystems, particularly vulnerable in the sensitive Mediterranean area and strictly connected to the natural and anthropic continental system. For this reason research activities were mainly focused on the neritic environment adjacent to the continental shelf in front of the Volturno river mouth with a bathymetric range of 5-50 m. Advanced knowledge on the dynamics in time of marine coastal ecosystems, in relation to the physical, chemical and biological processes that characterize their habitat, were acquired while new methods of integrated monitoring, in relation to the specific characteristics of the study area, were tested. Particular attention was given to the identification of bio-indicators in water column and sediment at sea floor. During each survey 20-25 hydrological casts along five transects perpendicular to the coast have been collected at a depth range of 9-50 m. A quasi-regular grid, of about 2 km in longitude and 3 km in latitude, represents the classic strategy for a synoptic ocean sampling. Data of pressure, conductivity, temperature, dissolved oxygen, pH, beam transmission and attenuation, Chlorophyll-a fluorescence (Chl-a) were acquired by sensors installed on a SBE11 plus (Firmware version 5.0) multiparametric probe by Sea-Bird Inc.. The beam transmission and attenuation was used to estimate the "turbidity" or to measure the attenuation of the infrared beam from the emitter to the receiver, as a result it gives % and 1/m, but just the % was used. The calibration of sensors was made in 2013 for pressure, conductivity, temperature, oxygen sensors at SeaBird Inc. while in 2011 for pH, transmissometer and fluorometer. The vertical profiles of all parameters were obtained by sampling the signals at 24 Hz, with the CTD/rosette going down at a speed of 1 m/s. The probe was used on board the R/V Astrea of ISPRA, a vessel with a length overall of 24 m, a breadth extreme of 6 m and a draught of 3 m, that can use any type of instrumentation and perform oceanographic research (biological, chemical and physical) in coastal and high seas areas. Each survey was performed in 1-2 days and in good weather and sea conditions. A quality check control on acquired CTD data has been done in order to remove possible spikes along the profiles. The raw data collected have been converted and processed using the SBE Data Processing software (version 7.26) while the Ocean Data View software [Schlitzer R. (2019). Ocean Data View. https://odv.awi.de/] was used for the representation of the sections of the sampled transects on paper. The data set is provided per cruise as ODV Spreadsheet files in TXT format where missing data values are set to -1.e10. MetaVariable Cruise name Station Type of acquisition (here C) Date in mon/day/yr and Time in hh:mm Longitude [degrees_east] Latitude [degrees_north] Bot. Depth [m] DataVariable Pressure, Digiquartz with TC [db] Temperature [deg C]; Conductivity [mS/cm] Oxygen, SBE43 [ml/l] Fluorescence, Turner Cyclops fluorometer. No data in I-AMICA7. Beam transmission (%) and attenuation (1/m), Transmissometer, WET Labs C-Star
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was generated within the research project Constructing AcTive CitizensHip with European Youth: Policies, Practices, Challenges and Solutions (CATCH-EyoU) funded by European Union, Horizon 2020 Programme - Grant Agreement No 649538 http://www.catcheyou.eu/. The data set consists of: 1 data file saved in .sav format “CATCH-EyoU Processes in Youth’s Construction of Active EU Citizenship Cross-national Wave 1 Questionnaires Italy, Sweden, Germany, Greece, Portugal, Czech Republic, UK, and Estonia - EXTRACT.sav” 1 README file The file was generated through IBM SPSS software. Discrete missing values: 88, 99. The .sav file (SPSS) can be processed using “R” (library “foreign”): https://cran.r-project.org This dataset relates to following paper: Ekaterina Enchikova, Tiago Neves, Sam Mejias, Veronika Kalmus, Elvira Cicognani, Pedro Ferreira (2019) Civic and Political Participation of European Youth: fair measurement in different cultural and social contexts. Frontiers in Education. Data Set Contact Person: Ekaterina Enchikova [UP-CIIE]; mail: enchicova@gmail.com Data Set License: this data set is distributed under a Creative Commons Attribution (CC-BY) http://creativecommons.org/licenses
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
unofficial mirror of FPT Open Speech Dataset (FOSD)
released publicly in 2018 by FPT Corporation 100h, 25.9k samples official link (dead): https://fpt.ai/fpt-open-speech-data/ mirror: https://data.mendeley.com/datasets/k9sxg2twv4/4 DOI: 10.17632/k9sxg2twv4.4 pre-process:
remove non-sense strings: -N \r
remove 4 files because missing transcription: Set001_V0.1_008210.mp3 Set001_V0.1_010753.mp3 Set001_V0.1_011477.mp3 Set001_V0.1_011841.mp3
need to do: check misspelling usage… See the full description on the dataset page: https://huggingface.co/datasets/doof-ferb/fpt_fosd.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Initial data analysis checklist for data screening in longitudinal studies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Methods
Study Participants and Samples
The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.
All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.
Blood Collection and Processing
Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.
Characterization of DNA Methylation using the EPIC array
Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).
Processing and Analysis of DNA Methylation Data
The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.
Normalization Methods Evaluated
The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List SSM PACKAGE Feb 2009.zip -- A package containing all needed files, including instructions, implement to models described in this paper. Listed below are the individual files contained in SSM_PACKAGE_FEB_2009.zip: SSM_instructions.pdf -- A set of instructions for implementing the SSM for Argos tracking data using the files provided here. Detailed descriptions of each file and how to use them may be found in this document. .RData -- An R workspace with all scripts and functions needed already loaded. 1diff2stateM.bug -- The WinBUGS CRW state–space model file. add.missing.dates.R -- Small subroutine for handling days with no Argos locations. calcj.R -- Small subroutine for indexing irregular data to regular timesteps. dat2bugslite.R -- Major subroutine for data preparation. find.missing.dates.R -- Small subroutine for handling days with no Argos locations (needs add.missing.dates.R). prepdat.R -- Function called to select, extract, and prepare data for WinBUGS from the sample data set testdata.csv. runSSM.R -- Simple script that allows for easy adjustment of important MCMC parameters and executes the call to WinBUGS via wbs.R. saveresults.R -- Function that saves the means and medians of lat, long, and behavioral state as a small text file for easy import into a mapping program of the user's choice for inspection. seald.R -- Small subroutine that extracts raw data from the testdata.csv datafile. step.time.R -- Small subroutine needed to index the irregular data to regular timesteps. testdata.csv -- A sample data set including three complete grey seal tracks from the North Atlantic. wbs.R -- The main function which calls WinBUGS from R; includes all the information to create MCMC initials. Description SSM_PACKAGE_Feb_2009.zip contains all scripts, functions, and sample data needed to fit the state–space correlated random walk models presented in this paper. Following the instructions (SSM_instructions.pdf) should allow readers to reproduce the results and/or fit their own Argos tracking data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.