20 datasets found
  1. Percentage (%) and number (n) of missing values in the explanatory variables...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the explanatory variables and outcome by measurement occasion and sex. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.

  2. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated May 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J. (2021). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000907442
    Explore at:
    Dataset updated
    May 3, 2021
    Authors
    Dabke, Kruttika; Jones, Michelle R.; Kreimer, Simion; Parker, Sarah J.
    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  3. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    • b2find.eudat.eu
    zip
    Updated Sep 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset Paper (Open Access)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the ESA CCI Soil Moisture science data records community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  4. Example of how to manually extract incubation bouts from interactive plots...

    • figshare.com
    txt
    Updated Jan 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 22, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Martin Bulla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    {# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}

  5. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Samoilova, Evgenia (Zhenya)
    Loist, Skadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

  6. Dataset from: High consistency and repeatability in the breeding migrations...

    • zenodo.org
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). Dataset from: High consistency and repeatability in the breeding migrations of a benthic shark [Dataset]. http://doi.org/10.5281/zenodo.11467089
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 4, 2024
    Description

    Dataset and scripts used for manuscript: High consistency and repeatability in the breeding migrations of a benthic shark.

    Project title: High consistency and repeatability in the breeding migrations of a benthic shark
    Date:23/04/2024

    Folders:
    - 1_Raw_data
    - Perpendicular_Point_068151, Sanctuary_Point_068088, SST raw data, sst_nc_files, IMOS_animal_measurements, IMOS_detections, PS&Syd&JB tags, rainfall_raw, sample_size, Point_Perpendicular_2013_2019, Sanctuary_Point_2013_2019, EAC_transport
    - 2_Processed_data
    - SST (anomaly, historic_sst, mean_sst_31_years, week_1992_sst:week_2022_sst including week_2019_complete_sst)
    - Rain (weekly_rain, weekly_rainfall_completed)
    - Clean (clean, cleaned_data, cleaned_gam, cleaned_pj_data)
    - 3_Script_processing_data
    - Plots(dual_axis_plot (Fig. 1 & Fig. 4).R, period_plot (Fig. 2).R, sd_plot (Fig. 5).R, sex_plot (Fig. 3).R
    - cleaned_data.R, cleaned_data_gam.R, weekly_rainfall_completed.R, descriptive_stats.R, sst.R, sst_2019b.R, sst_anomaly.R
    - 4_Script_analyses
    - gam.R, gam_eac.R, glm.R, lme.R, Repeatability.R
    - 5_Output_doc
    - Plots (arrival_dual_plot_with_anomaly (Fig. 1).png, period_plot (Fig.2).png, sex_arrival_departure (Fig. 3).png, departure_dual_plot_with_anomaly (Fig. 4).png, standard deviation plot (Fig. 5).png)
    - Tables (gam_arrival_eac_selection_table.csv (Table S2), gam_departure_eac_selection_table (Table S5), gam_arrival_selection_table (Table. S3), gam_departure_selection_table (Table. S6), glm_arrival_selection_table, glm_departure_selection_table, lme_arrival_anova_table, lme_arrival_selection_table (Table S4), lme_departure_anova_table, lme_departure_selection_table (Table. S8))


    Descriptions of scripts and files used:
    - cleaned_data.R: script to extract detections of sharks at Jervis Bay. Calculate arrival and departure dates over the seven breeding seasons. Add sex and length for each individual. Extract moon phase (numerical value) and period of the day from arrival and departure times.
    - IMOS_detections.csv: raw data file with detections of Port Jackson sharks over different sites in Australia.
    - IMOS_animal_measurements.csv: raw data file with morphological data of Port Jackson sharks
    - PS&Syd&JB tags: file with measurements and sex identification of sharks (different from IMOS, it was used to complete missing sex and length).
    - cleaned_data.csv: file with arrival and departure dates of the final sample size of sharks (N=49) with missing sex and length for some individuals.
    - clean.csv: completed file using PS&Syd&JB tags, note: tag ID 117393679 was wrongly identified as a male in IMOS and correctly identified as a female in PS&Syd&JB tags
    file as indicated by its large size.
    - cleaned_pj_data: Final data file with arrival and departure dates, sex, length, moon phase (numerical) and period of the day.

    - weekly_rainfall_completed.R: script to calculate average weekly rainfall and correlation between the two weather stations used (Point perpendicular and Sanctuary point).
    - weekly_rain.csv: file with the corresponding week number (1-28) for each date (01-06-2013 to 13-12-2019)
    - weekly_rainfall_completed.csv: file with week number (1-28), year (2013-2019) and weekly rainfall average completed with Sanctuary Point for week 2 of 2017
    - Point_Perpendicular_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Point Perpendicular weather station
    - Sanctuary_Point_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Sanctuary Point weather station
    - IDCJAC0009_068088_2017_Data.csv: Rainfall (mm) from 01-01-2017 to 31-12-2017 at the Sanctuary Point weather station (to fill in missing value for average rainfall of week 2 of 2017)

    - cleaned_data_gam.R: script to calculate weekly counts of sharks to run gam models and add weekly averages of rainfall and sst anomaly
    - cleaned_pj_data.csv
    - anomaly.csv: weekly (1-28) average sst anomalies for Jervis Bay (2013-2019)
    - weekly_rainfall_completed.csv: weekly (1-28) average rainfall for Jervis Bay (2013-2019_
    - sample_size.csv: file with the number of sharks tagged (13-49) for each year (2013-2019)

    - sst.R: script to extract daily and weekly sst from IMOS nc files from 01-05 until 31-12 for the following years: 1992:2022 for Jervis Bay
    - sst_raw_data: folder with all the raw weekly (1:28) csv files for each year (1992:2022) to fill in with sst data using the sst script
    - sst_nc_files: folder with all the nc files downloaded from IMOS from the last 31 years (1992-2022) at the sensor (IMOS - SRS - SST - L3S-Single Sensor - 1 day - night time – Australia).
    - SST: folder with the average weekly (1-28) sst data extracted from the nc files using the sst script for each of the 31 years (to calculate temperature anomaly).

    - sst_2019b.R: script to extract daily and weekly sst from IMOS nc file for 2019 (missing value for week 19) for Jervis Bay
    - week_2019_sst: weekly average sst 2019 with a missing value for week 19
    - week_2019b_sst: sst data from 2019 with another sensor (IMOS – SRS – MODIS - 01 day - Ocean Colour-SST) to fill in the gap of week 19
    - week_2019_complete_sst: completed average weekly sst data from the year 2019 for weeks 1-28.

    - sst_anomaly.R: script to calculate mean weekly sst anomaly for the study period (2013-2019) using mean historic weekly sst (1992-2022)
    - historic_sst.csv: mean weekly (1-28) and yearly (1992-2022) sst for Jervis Bay
    - mean_sst_31_years.csv: mean weekly (1-28) sst across all years (1992-2022) for Jervis Bay
    - anomaly.csv: mean weekly and yearly sst anomalies for the study period (2013-2019)

    - Descriptive_stats.R: script to calculate minimum and maximum length of sharks, mean Julian arrival and departure dates per individual per year, mean Julian arrival and departure dates per year for all sharks (Table. S10), summary of standard deviation of julian arrival dates (Table. S9)
    - cleaned_pj_data.csv

    - gam.R: script used to run the Generalized additive model for rainfall and sea surface temperature
    - cleaned_gam.csv

    - glm.R: script used to run the Generalized linear mixed models for the period of the day and moon phase
    - cleaned_pj_data.csv
    - sample_size.csv

    - lme.R: script used to run the Linear mixed model for sex and size
    - cleaned_pj_data.csv

    - Repeatability.R: script used to run the Repeatability for Julian arrival and Julian departure dates
    - cleaned_pj_data.csv

  7. H

    Replication Data for Reconceptualising dimensions of political competition...

    • dataverse.harvard.edu
    Updated Feb 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Wheatley; Fernando Mendez (2019). Replication Data for Reconceptualising dimensions of political competition in Europe: A demand side approach [Dataset]. http://doi.org/10.7910/DVN/1B1MXY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Jonathan Wheatley; Fernando Mendez
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Europe
    Description

    Included are: 1. The raw data (before cleaning and preprocessing) can be found in the files ending "Raw3". The codebooks for each of these data files end in "codebook". This will enable the user to identify the statements that are associated with the items EU1 … 7, Eco1 … 7, Cul1 … 7, AD1 and AD2 that are used in the manuscript.// 2. The R codes ending cleaning_plus.R are used to a) clean the datasets according to the procedure outlined in the online Appendix and b) remove entries with missing values for any of the variables that are used in the calibration process to produce balanced datasets (age, education, gender, political interest). Because of step b), the new datasets generated will be smaller than the clean datasets listed in Table 1 of the Appendix.// 3. For the balancing and calibrating (pre-processing), we use a) the datasets for each country generated by 2 above (the files that are followed by the suffix "_clean"), b) the file drop.py, which is the code (in python) for the balancing algorithm that is based on the principle of raking (see the online Appendix), c) the R files that are used to generate the new calibrated datasets that will be used in the Mokken Scale analysis in 5 below (followed by the suffix "balCode"), and d) a set of files ending in the suffix "estimates" that contain the joint distributions derived from the ESS data (i) for age, below versus above the median age and (ii) for education, degree versus no degree, as well as the marginal distributions for gender and political interest. The median ages of the voting population derived from ESS are as follows: Austria: 50 Bulgaria: 52 Croatia: 52 Cyprus: 47 Czech Republic 50 Denmark: 50 England: 53 Estonia: 50 Finland: 54 France: 55 Germany: 53 Greece: 50 Hungary: 49 Ireland: 50 Italy: 50 Lithuania: 53 Poland: 50 Portugal: 52 Romania: 46 Slovakia: 52 Slovenia: 52 Spain: 50// 4. A set of data files with the suffix myBal, which contain the new calibrated datasets that will be used in the Mokken Scale analysis in 5 (below).// 5. A set of R codes for each country, beginning with the prefix "RCodes" that are used to generate the findings on dimensionality that are presented in the manuscript.

  8. f

    Average performance of imputation approaches across performance measures for...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu-Hua Yeh; Allison N. Tegge; Roberta Freitas-Lemos; Joel Myerson; Leonard Green; Warren K. Bickel (2023). Average performance of imputation approaches across performance measures for the 27-item MCQ. [Dataset]. http://doi.org/10.1371/journal.pone.0292258.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yu-Hua Yeh; Allison N. Tegge; Roberta Freitas-Lemos; Joel Myerson; Leonard Green; Warren K. Bickel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Average performance of imputation approaches across performance measures for the 27-item MCQ.

  9. Z

    Replication Data & Code - Large-scale land acquisitions exacerbate local...

    • data.niaid.nih.gov
    Updated Nov 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan A. Sullivan (2023). Replication Data & Code - Large-scale land acquisitions exacerbate local land inequalities in Tanzania [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6512229
    Explore at:
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Daniel G. Brown
    Arun Agrawal
    Cyrus Samii
    Francis Moyo
    Jonathan A. Sullivan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Tanzania
    Description

    Reference Sullivan J.A., Samii, C., Brown, D., Moyo, F., Agrawal, A. 2023. Large-scale land acquisitions exacerbate local farmland inequalities in Tanzania. Proceedings of the National Academy of Sciences 120, e2207398120. https://doi.org/10.1073/pnas.2207398120 Abstract Land inequality stalls economic development, entrenches poverty, and is associated with environmental degradation. Yet, rigorous assessments of land-use interventions attend to inequality only rarely. A land inequality lens is especially important to understand how recent large-scale land acquisitions (LSLAs) affect smallholder and indigenous communities across as much as 100 million hectares around the world. This paper studies inequalities in land assets, specifically landholdings and farm size, to derive insights into the distributional outcomes of LSLAs. Using a household survey covering four pairs of land acquisition and control sites in Tanzania, we use a quasi-experimental design to characterize changes in land inequality and subsequent impacts on well-being. We find convincing evidence that LSLAs in Tanzania lead to both reduced landholdings and greater farmland inequality among smallholders. Households in proximity to LSLAs are associated with 21.1% (P = 0.02) smaller landholdings while evidence, although insignificant, is suggestive that farm sizes are also declining. Aggregate estimates, however, hide that households in the bottom quartiles of farm size suffer the brunt of landlessness and land loss induced by LSLAs that combine to generate greater farmland inequality. Additional analyses find that land inequality is not offset by improvements in other livelihood dimensions, rather farm size decreases among households near LSLAs are associated with no income improvements, lower wealth, increased poverty, and higher food insecurity. The results demonstrate that without explicit consideration of distributional outcomes, land-use policies can systematically reinforce existing inequalities. Replication Data We include anonymized household survey data from our analysis to support open and reproducible science. In particular, we provide i) an anoymized household dataset collected in 2018 (n=994) for households nearby (treatment) and far-away from (control) LSLAs and ii) a household dataset collected in 2019 (n=165) within the same sites. For the 2018 surveys, several anonymized extracts are provided including an imputed (n=10) dataset to fill in missing data that was used for the main analysis. This data can be found in the hh_data folder and includes:

    hh_imputed10_2018: anonymized household dataset for 2018 with variables used for the main analysis where missing data was imputed 10 times hh_compensation_2018: anonymized household extract for 2018 representing household benefits and compensation directly received from LSLAs hh_migration_2018: anonymized household extract for 2018 representing household migration behavior following LSLAs hh_rsdata_2018: extracted remote sensing data at the household geo-location for 2018 hh_land_2019: anonymized household extract for 2019 of land variables Our analysis also incorporates data from the Living Standards Measurement Survey (LSMS) collected by the World Bank (found in lsms_data folder). We've provide sub-modules from the LSMS dataset relevant to our analysis but the full datasets can be access through the World Bank's Microdata Library (https://microdata.worldbank.org/index.php/home). Across several analyses we use the LSLA boundaries for our four selected sites. We provide a shapefile for the LSLA boundaries in the gis_data folder. Finally, our data replication includes several model outputs (found in mod_outputs), particularly those that are lengthy to run in R. These datasets can optionally be loaded into R rather than re-running analysis using our main_analysis.Rmd script. Replication Code We provide replication code in the form of R Markdown (.Rmd) or R (.R) files. Alongside the replication data, this can be used to reproduce main figures, table, supplementary materials, and results reported in our article. Scripts include:

    main_analysis.Rmd: main analysis supporting the finding, graphs, and tables reported in our main manuscript compensation.R: analysis of benefits and compensation received directly by households from LSLAs landvalue.R: analysis of household land values as a function of distance from LSLAs migration.R: analysis of migration behavior following LSLAs selection_bias.R: analysis of LSLA selection bias between control and treatment enumeration areas

  10. r

    Air Quality Monitoring - 2014

    • researchdata.edu.au
    • data.qld.gov.au
    • +1more
    Updated Jun 4, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environment, Tourism, Science and Innovation (2015). Air Quality Monitoring - 2014 [Dataset]. https://researchdata.edu.au/air-quality-monitoring-2014/658145
    Explore at:
    Dataset updated
    Jun 4, 2015
    Dataset provided by
    data.qld.gov.au
    Authors
    Environment, Tourism, Science and Innovation
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annual hourly air quality and meteorological data by monitoring site for the 2014 calendar year. For more information on air quality, including live air data, please visit environment.des.qld.gov.au/air. \r \r Data resolution: One-hour average values (one-hour sum for rainfall) \r Data row timestamp: Start of averaging period \r Missing data/not monitored: Blank cell \r Calm conditions: No hourly average wind direction is reported when the hourly average wind speed is zero \r Barometric pressure: Values are at monitoring station elevation, not corrected to mean sea level \r Daily zero/span response check: Automated instrument zero/span response checks are conducted daily between midnight and 1am at Queensland Government sites (can differ at industry sites). Where this takes place an ambient hourly value cannot be reported. \r Sampling height: Four metres above ground (unless otherwise indicated) \r \r PLEASE NOTE: \r \r * The Townsville Coast Guard 2014 air quality monitoring site data was updated on 26/10/2015 due to the wind direction sensor being misaligned and the reported wind direction values have now been corrected. \r * The Auckland Point 2014 air quality monitoring site data was updated on 24/04/2018 to remove invalid wind data due to a sensor fault.

  11. f

    Percentage (%) and number (n) of missing values in the outcome (maximum grip...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data.

  12. e

    I-AMICA coastal hydrological surveys in the eastern Tyrrhenian Sea - Dataset...

    • b2find.eudat.eu
    Updated Nov 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). I-AMICA coastal hydrological surveys in the eastern Tyrrhenian Sea - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0f0f8c20-9403-5273-bc29-62b3e06a8880
    Explore at:
    Dataset updated
    Nov 16, 2019
    Area covered
    Tyrrhenian Sea
    Description

    Table 1 I-AMICA project was planned to increase the observational capacity of the monitoring of marine coastal ecosystems, particularly vulnerable in the sensitive Mediterranean area and strictly connected to the natural and anthropic continental system. For this reason research activities were mainly focused on the neritic environment adjacent to the continental shelf in front of the Volturno river mouth with a bathymetric range of 5-50 m. Advanced knowledge on the dynamics in time of marine coastal ecosystems, in relation to the physical, chemical and biological processes that characterize their habitat, were acquired while new methods of integrated monitoring, in relation to the specific characteristics of the study area, were tested. Particular attention was given to the identification of bio-indicators in water column and sediment at sea floor. During each survey 20-25 hydrological casts along five transects perpendicular to the coast have been collected at a depth range of 9-50 m. A quasi-regular grid, of about 2 km in longitude and 3 km in latitude, represents the classic strategy for a synoptic ocean sampling. Data of pressure, conductivity, temperature, dissolved oxygen, pH, beam transmission and attenuation, Chlorophyll-a fluorescence (Chl-a) were acquired by sensors installed on a SBE11 plus (Firmware version 5.0) multiparametric probe by Sea-Bird Inc.. The beam transmission and attenuation was used to estimate the "turbidity" or to measure the attenuation of the infrared beam from the emitter to the receiver, as a result it gives % and 1/m, but just the % was used. The calibration of sensors was made in 2013 for pressure, conductivity, temperature, oxygen sensors at SeaBird Inc. while in 2011 for pH, transmissometer and fluorometer. The vertical profiles of all parameters were obtained by sampling the signals at 24 Hz, with the CTD/rosette going down at a speed of 1 m/s. The probe was used on board the R/V Astrea of ISPRA, a vessel with a length overall of 24 m, a breadth extreme of 6 m and a draught of 3 m, that can use any type of instrumentation and perform oceanographic research (biological, chemical and physical) in coastal and high seas areas. Each survey was performed in 1-2 days and in good weather and sea conditions. A quality check control on acquired CTD data has been done in order to remove possible spikes along the profiles. The raw data collected have been converted and processed using the SBE Data Processing software (version 7.26) while the Ocean Data View software [Schlitzer R. (2019). Ocean Data View. https://odv.awi.de/] was used for the representation of the sections of the sampled transects on paper. The data set is provided per cruise as ODV Spreadsheet files in TXT format where missing data values are set to -1.e10. MetaVariable Cruise name Station Type of acquisition (here C) Date in mon/day/yr and Time in hh:mm Longitude [degrees_east] Latitude [degrees_north] Bot. Depth [m] DataVariable Pressure, Digiquartz with TC [db] Temperature [deg C]; Conductivity [mS/cm] Oxygen, SBE43 [ml/l] Fluorescence, Turner Cyclops fluorometer. No data in I-AMICA7. Beam transmission (%) and attenuation (1/m), Transmissometer, WET Labs C-Star

  13. Data from: CATCH-EyoU Processes in Youth's Construction of Active EU...

    • data.europa.eu
    unknown
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2020). CATCH-EyoU Processes in Youth's Construction of Active EU Citizenship Cross-national Wave 1 Questionnaires Italy, Sweden, Germany, Greece, Portugal, Czech Republic, UK, and Estonia - EXTRACT [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2557710?locale=bg
    Explore at:
    unknown(234704)Available download formats
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Sweden, Italy, Czechia, Germany, Portugal, European Union, Estonia, Greece, United Kingdom
    Description

    The dataset was generated within the research project Constructing AcTive CitizensHip with European Youth: Policies, Practices, Challenges and Solutions (CATCH-EyoU) funded by European Union, Horizon 2020 Programme - Grant Agreement No 649538 http://www.catcheyou.eu/. The data set consists of: 1 data file saved in .sav format “CATCH-EyoU Processes in Youth’s Construction of Active EU Citizenship Cross-national Wave 1 Questionnaires Italy, Sweden, Germany, Greece, Portugal, Czech Republic, UK, and Estonia - EXTRACT.sav” 1 README file The file was generated through IBM SPSS software. Discrete missing values: 88, 99. The .sav file (SPSS) can be processed using “R” (library “foreign”): https://cran.r-project.org This dataset relates to following paper: Ekaterina Enchikova, Tiago Neves, Sam Mejias, Veronika Kalmus, Elvira Cicognani, Pedro Ferreira (2019) Civic and Political Participation of European Youth: fair measurement in different cultural and social contexts. Frontiers in Education. Data Set Contact Person: Ekaterina Enchikova [UP-CIIE]; mail: enchicova@gmail.com Data Set License: this data set is distributed under a Creative Commons Attribution (CC-BY) http://creativecommons.org/licenses

  14. h

    fpt_fosd

    • huggingface.co
    Updated Apr 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phan Tuấn Anh (2022). fpt_fosd [Dataset]. https://huggingface.co/datasets/doof-ferb/fpt_fosd
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 24, 2022
    Authors
    Phan Tuấn Anh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    unofficial mirror of FPT Open Speech Dataset (FOSD)

    released publicly in 2018 by FPT Corporation 100h, 25.9k samples official link (dead): https://fpt.ai/fpt-open-speech-data/ mirror: https://data.mendeley.com/datasets/k9sxg2twv4/4 DOI: 10.17632/k9sxg2twv4.4 pre-process:

    remove non-sense strings: -N \r

    remove 4 files because missing transcription: Set001_V0.1_008210.mp3 Set001_V0.1_010753.mp3 Set001_V0.1_011477.mp3 Set001_V0.1_011841.mp3

    need to do: check misspelling usage… See the full description on the dataset page: https://huggingface.co/datasets/doof-ferb/fpt_fosd.

  15. f

    Potential consequences of data screening.

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Potential consequences of data screening. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.

  16. f

    Number of interviews per participant.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Number of interviews per participant. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.

  17. Initial data analysis checklist for data screening in longitudinal studies.

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Initial data analysis checklist for data screening in longitudinal studies. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial data analysis checklist for data screening in longitudinal studies.

  18. f

    Correlations (above diagonal), standard deviations (diagonal) and...

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males.

  19. n

    Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • dataone.org
    • +3more
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    Hospital for Sick Children
    University of Toronto
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  20. f

    Supplement 1. A compressed package of scripts, functions, sample data, and...

    • wiley.figshare.com
    html
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Greg A. Breed; Ian D. Jonsen; Ransom A. Myers; W. Don Bowen; Marty L. Leonard (2023). Supplement 1. A compressed package of scripts, functions, sample data, and instructions required to implement the state–space model described in the text. [Dataset]. http://doi.org/10.6084/m9.figshare.3532229.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wiley
    Authors
    Greg A. Breed; Ian D. Jonsen; Ransom A. Myers; W. Don Bowen; Marty L. Leonard
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    File List SSM PACKAGE Feb 2009.zip -- A package containing all needed files, including instructions, implement to models described in this paper. Listed below are the individual files contained in SSM_PACKAGE_FEB_2009.zip: SSM_instructions.pdf -- A set of instructions for implementing the SSM for Argos tracking data using the files provided here. Detailed descriptions of each file and how to use them may be found in this document. .RData -- An R workspace with all scripts and functions needed already loaded. 1diff2stateM.bug -- The WinBUGS CRW state–space model file. add.missing.dates.R -- Small subroutine for handling days with no Argos locations. calcj.R -- Small subroutine for indexing irregular data to regular timesteps. dat2bugslite.R -- Major subroutine for data preparation. find.missing.dates.R -- Small subroutine for handling days with no Argos locations (needs add.missing.dates.R). prepdat.R -- Function called to select, extract, and prepare data for WinBUGS from the sample data set testdata.csv. runSSM.R -- Simple script that allows for easy adjustment of important MCMC parameters and executes the call to WinBUGS via wbs.R. saveresults.R -- Function that saves the means and medians of lat, long, and behavioral state as a small text file for easy import into a mapping program of the user's choice for inspection. seald.R -- Small subroutine that extracts raw data from the testdata.csv datafile. step.time.R -- Small subroutine needed to index the irregular data to regular timesteps. testdata.csv -- A sample data set including three complete grey seal tracks from the North Atlantic. wbs.R -- The main function which calls WinBUGS from R; includes all the information to create MCMC initials. Description SSM_PACKAGE_Feb_2009.zip contains all scripts, functions, and sample data needed to fit the state–space correlated random walk models presented in this paper. Following the instructions (SSM_instructions.pdf) should allow readers to reproduce the results and/or fit their own Argos tracking data.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the explanatory variables and outcome by measurement occasion and sex. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t004
Organization logo

Percentage (%) and number (n) of missing values in the explanatory variables and outcome by measurement occasion and sex.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
May 29, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.

Search
Clear search
Close search
Google apps
Main menu