16 datasets found
  1. Replication Data & Code - Large-scale land acquisitions exacerbate local...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Nov 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan A. Sullivan; Jonathan A. Sullivan; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal (2023). Replication Data & Code - Large-scale land acquisitions exacerbate local land inequalities in Tanzania [Dataset]. http://doi.org/10.5281/zenodo.10152116
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonathan A. Sullivan; Jonathan A. Sullivan; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Tanzania
    Description

    Reference

    Sullivan J.A., Samii, C., Brown, D., Moyo, F., Agrawal, A. 2023. Large-scale land acquisitions exacerbate local farmland inequalities in Tanzania. Proceedings of the National Academy of Sciences 120, e2207398120. https://doi.org/10.1073/pnas.2207398120

    Abstract

    Land inequality stalls economic development, entrenches poverty, and is associated with environmental degradation. Yet, rigorous assessments of land-use interventions attend to inequality only rarely. A land inequality lens is especially important to understand how recent large-scale land acquisitions (LSLAs) affect smallholder and indigenous communities across as much as 100 million hectares around the world. This paper studies inequalities in land assets, specifically landholdings and farm size, to derive insights into the distributional outcomes of LSLAs. Using a household survey covering four pairs of land acquisition and control sites in Tanzania, we use a quasi-experimental design to characterize changes in land inequality and subsequent impacts on well-being. We find convincing evidence that LSLAs in Tanzania lead to both reduced landholdings and greater farmland inequality among smallholders. Households in proximity to LSLAs are associated with 21.1% (P = 0.02) smaller landholdings while evidence, although insignificant, is suggestive that farm sizes are also declining. Aggregate estimates, however, hide that households in the bottom quartiles of farm size suffer the brunt of landlessness and land loss induced by LSLAs that combine to generate greater farmland inequality. Additional analyses find that land inequality is not offset by improvements in other livelihood dimensions, rather farm size decreases among households near LSLAs are associated with no income improvements, lower wealth, increased poverty, and higher food insecurity. The results demonstrate that without explicit consideration of distributional outcomes, land-use policies can systematically reinforce existing inequalities.

    Replication Data

    We include anonymized household survey data from our analysis to support open and reproducible science. In particular, we provide i) an anoymized household dataset collected in 2018 (n=994) for households nearby (treatment) and far-away from (control) LSLAs and ii) a household dataset collected in 2019 (n=165) within the same sites. For the 2018 surveys, several anonymized extracts are provided including an imputed (n=10) dataset to fill in missing data that was used for the main analysis. This data can be found in the hh_data folder and includes:

    • hh_imputed10_2018: anonymized household dataset for 2018 with variables used for the main analysis where missing data was imputed 10 times
    • hh_compensation_2018: anonymized household extract for 2018 representing household benefits and compensation directly received from LSLAs
    • hh_migration_2018: anonymized household extract for 2018 representing household migration behavior following LSLAs
    • hh_rsdata_2018: extracted remote sensing data at the household geo-location for 2018
    • hh_land_2019: anonymized household extract for 2019 of land variables

    Our analysis also incorporates data from the Living Standards Measurement Survey (LSMS) collected by the World Bank (found in lsms_data folder). We've provide sub-modules from the LSMS dataset relevant to our analysis but the full datasets can be access through the World Bank's Microdata Library (https://microdata.worldbank.org/index.php/home).

    Across several analyses we use the LSLA boundaries for our four selected sites. We provide a shapefile for the LSLA boundaries in the gis_data folder.

    Finally, our data replication includes several model outputs (found in mod_outputs), particularly those that are lengthy to run in R. These datasets can optionally be loaded into R rather than re-running analysis using our main_analysis.Rmd script.

    Replication Code

    We provide replication code in the form of R Markdown (.Rmd) or R (.R) files. Alongside the replication data, this can be used to reproduce main figures, table, supplementary materials, and results reported in our article. Scripts include:

    • main_analysis.Rmd: main analysis supporting the finding, graphs, and tables reported in our main manuscript
    • compensation.R: analysis of benefits and compensation received directly by households from LSLAs
    • landvalue.R: analysis of household land values as a function of distance from LSLAs
    • migration.R: analysis of migration behavior following LSLAs
    • selection_bias.R: analysis of LSLA selection bias between control and treatment enumeration areas

  2. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Samoilova, Evgenia (Zhenya)
    Loist, Skadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

  3. m

    Data from: A simple approach for maximizing the overlap of phylogenetic and...

    • figshare.mq.edu.au
    • borealisdata.ca
    • +7more
    bin
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell (2023). Data from: A simple approach for maximizing the overlap of phylogenetic and comparative data [Dataset]. http://doi.org/10.5061/dryad.5d3rq
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Macquarie University
    Authors
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.

    Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv

  4. Data from: Reconstructing phylogeny from reduced-representation genome...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated May 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huan Fan; Anthony R. Ives; Yann Surget-Groba; Huan Fan; Anthony R. Ives; Yann Surget-Groba (2022). Data from: Reconstructing phylogeny from reduced-representation genome sequencing data without assembly or alignment [Dataset]. http://doi.org/10.5061/dryad.r0hq0
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Huan Fan; Anthony R. Ives; Yann Surget-Groba; Huan Fan; Anthony R. Ives; Yann Surget-Groba
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Reduced-representation genome sequencing such as RADseq aids the analysis of genomes by reducing the quantity of data, thereby lowering both sequencing costs and computational burdens. RADseq was initially designed for studying genetic variation across genomes at the population level, but has also proved to be suitable for interspecific phylogeny reconstruction. RADseq data pose challenges for standard phylogenomic methods, however, due to incomplete coverage of the genome and large amounts of missing data. Alignment-free methods are both efficient and accurate for phylogenetic reconstructions with whole genomes and are especially practical for non-model organisms; nonetheless, alignment-free methods have not been applied with reduced genome sequencing data. Here, we test a full-genome assembly and alignment-free method, AAF, in application to RADseq data and propose two procedures for reads selection to remove reads from restriction sites that were not found in taxa being compared. We validate these methods using both simulations and real datasets. Reads selection improved the accuracy of phylogenetic construction in every simulated scenario and the two real datasets, making AAF as good or better than a comparable alignment-based method, even though AAF had much lower computational burdens. We also investigated the sources of missing data in RADseq and their effects on phylogeny reconstruction using AAF. The AAF pipeline modified for RADseq or other reduced-representation sequencing data, phyloRAD, is available on github (https://github.com/fanhuan/phyloRAD).

  5. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    ACS Publications
    Authors
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  6. f

    Average performance of imputation approaches across performance measures for...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu-Hua Yeh; Allison N. Tegge; Roberta Freitas-Lemos; Joel Myerson; Leonard Green; Warren K. Bickel (2023). Average performance of imputation approaches across performance measures for the 27-item MCQ. [Dataset]. http://doi.org/10.1371/journal.pone.0292258.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yu-Hua Yeh; Allison N. Tegge; Roberta Freitas-Lemos; Joel Myerson; Leonard Green; Warren K. Bickel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Average performance of imputation approaches across performance measures for the 27-item MCQ.

  7. e

    A global database of long-term changes in insect assemblages

    • knb.ecoinformatics.org
    • search-dev.test.dataone.org
    • +4more
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2022). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F1ZC817H
    Explore at:
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
    Time period covered
    Jan 1, 1925 - Jan 1, 2018
    Area covered
    Pacific Ocean, North Pacific Ocean
    Variables measured
    End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 63 more
    Description

    UPDATED on October 15 2020 After some mistakes in some of the data were found, we updated this data set. The changes to the data are detailed on Zenodo (http://doi.org/10.5281/zenodo.4061807), and an Erratum has been submitted. This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 165 data sources, representing a total of 1668 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).

  8. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    • b2find.dkrz.de
    zip
    Updated Feb 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset paper (public preprint)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also available when an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (0), and where the interpolated value is used (1) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the Soil Moisture Climate Data Records from satellites community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  9. m

    Dataset on potential risk factors to COVID19 disease among Health Workers in...

    • data.mendeley.com
    Updated Jun 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Kiragu (2023). Dataset on potential risk factors to COVID19 disease among Health Workers in Kenya [Dataset]. http://doi.org/10.17632/x47k6prsv8.4
    Explore at:
    Dataset updated
    Jun 26, 2023
    Authors
    John Kiragu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Kenya
    Description

    Occupational characteristics, as well as personal and health systems characteristics of the health workers, were hypothesized to be associated with an increased risk of COVID-19 disease within the Kenyan tertiary-level hospital. Therefore, data collection was done using a researcher-administered and literature-based questionnaire via phone interviews on self-reported occupational and personal characteristics of health workers who worked in Kenyatta National Hospital between November 2021 to December 2021. The responses in the dataset therefore were treated as potential explanatory exposure variables for the study while the COVID-19 status was the study outcome. The participants consented to participation and their consent was documented before questionnaire administration. The collection of the data was approved by the Kenyatta National Hospital-University of Nairobi Ethics Review Committee(P462/06/2021), permission to conduct the study was also given by the administration of Kenyatta National Hospital and the study licence was also given by the National Commission For Science, Technology and Innovation for Kenya. The participants' identifier information was removed and de-identified, first, by anonymizing the questionnaire responses, second, the contact information database used during phone interviews was strictly kept confidential, restricted and password-protected and used for the purpose of this study only. The dataset was then cleaned in Ms EXCEL to remove obvious errors and exported into R statistical software for analysis. Missingness of data was acknowledged prior to analysis. Aggregate variables of interest were derived based on the primary variables and multiple imputation of the dataset was applied to address missing data bias. This data was analysed by regression methods and future researchers can apply similar methods to prove or disapprove their hypotheses based on the dataset.

  10. Correlations (above diagonal), standard deviations (diagonal) and...

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males.

  11. Data from: QA/QC-ed Groundwater Level Time Series in PLM-1 and PLM-6...

    • osti.gov
    • knb.ecoinformatics.org
    • +1more
    Updated Jan 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carroll, Rosemary; Dong, Wenming; Faybishenko, Boris; O'Ryan, Dylan; Tokunaga, Tetsu; Versteeg, Roelof; Williams, Kenneth (2023). QA/QC-ed Groundwater Level Time Series in PLM-1 and PLM-6 Monitoring Wells, East River, Colorado (2016-2022) [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1866836-qa-qc-ed-groundwater-level-time-series-plm-plm-monitoring-wells-east-river-colorado
    Explore at:
    Dataset updated
    Jan 1, 2023
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States)
    39.034,-106.88|38.88,-106.88|38.88,-107.05|39.034,-107.05|39.034,-106.88
    Authors
    Carroll, Rosemary; Dong, Wenming; Faybishenko, Boris; O'Ryan, Dylan; Tokunaga, Tetsu; Versteeg, Roelof; Williams, Kenneth
    Area covered
    East River
    Description

    This data set contains QA/QC-ed (Quality Assurance and Quality Control) water level data for the PLM1 and PLM6 wells. PLM1 and PLM6 are location identifiers used by the Watershed Function SFA project for two groundwater monitoring wells along an elevation gradient located along the lower montane life zone of a hillslope near the Pumphouse location at the East River Watershed, Colorado, USA. These wells are used to monitor subsurface water and carbon inventories and fluxes, and to determine the seasonally dependent flow of groundwater under the PLM hillslope. The downslope flow of groundwater in combination with data on groundwater chemistry (see related references) can be used to estimate rates of solute export from the hillslope to the floodplain and river. QA/QC analysis of measured groundwater levels in monitoring wells PLM-1 and PLM-6 included identification and flagging of duplicated values of timestamps, gap filling of missing timestamps and water levels, removal of abnormal/bad and outliers of measured water levels. The QA/QC analysis also tested the application of different QA/QC methods and the development of regular (5-minute, 1-hour, and 1-day) time series datasets, which can serve as a benchmark for testing other QA/QC techniques, and will be applicable for ecohydrological modeling. Themore » package includes a Readme file, one R code file used to perform QA/QC, a series of 8 data csv files (six QA/QC-ed regular time series datasets of varying intervals (5-min, 1-hr, 1-day) and two files with QA/QC flagging of original data), and three files for the reporting format adoption of this dataset (InstallationMethods, file level metadata (flmd), and data dictionary (dd) files).QA/QC-ed data herein were derived from the original/raw data publication available at Williams et al., 2020 (DOI: 10.15485/1818367). For more information about running R code file (10.15485_1866836_QAQC_PLM1_PLM6.R) to reproduce QA/QC output files, see README (QAQC_PLM_readme.docx). This dataset replaces the previously published raw data time series, and is the final groundwater data product for the PLM wells in the East River. Complete metadata information on the PLM1 and PLM6 wells are available in a related dataset on ESS-DIVE: Varadharajan C, et al (2022). https://doi.org/10.15485/1660962. These data products are part of the Watershed Function Scientific Focus Area collection effort to further scientific understanding of biogeochemical dynamics from genome to watershed scales.2022/09/09 Update: Converted data files using ESS-DIVE’s Hydrological Monitoring Reporting Format. With the adoption of this reporting format, the addition of three new files (v1_20220909_flmd.csv, V1_20220909_dd.csv, and InstallationMethods.csv) were added. The file-level metadata file (v1_20220909_flmd.csv) contains information specific to the files contained within the dataset. The data dictionary file (v1_20220909_dd.csv) contains definitions of column headers and other terms across the dataset. The installation methods file (InstallationMethods.csv) contains a description of methods associated with installation and deployment at PLM1 and PLM6 wells. Additionally, eight data files were re-formatted to follow the reporting format guidance (er_plm1_waterlevel_2016-2020.csv, er_plm1_waterlevel_1-hour_2016-2020.csv, er_plm1_waterlevel_daily_2016-2020.csv, QA_PLM1_Flagging.csv, er_plm6_waterlevel_2016-2020.csv, er_plm6_waterlevel_1-hour_2016-2020.csv, er_plm6_waterlevel_daily_2016-2020.csv, QA_PLM6_Flagging.csv). The major changes to the data files include the addition of header_rows above the data containing metadata about the particular well, units, and sensor description.2023/01/18 Update: Dataset updated to include additional QA/QC-ed water level data up until 2022-10-12 for ER-PLM1 and 2022-10-13 for ER-PLM6. Reporting format specific files (v2_20230118_flmd.csv, v2_20230118_dd.csv, v2_20230118_InstallationMethods.csv) were updated to reflect the additional data. R code file (QAQC_PLM1_PLM6.R) was added to replace the previously uploaded HTML files to enable execution of the associated code. R code file (QAQC_PLM1_PLM6.R) and ReadMe file (QAQC_PLM_readme.docx) were revised to clarify where original data was retrieved from and to remove local file paths.« less

  12. Z

    LimnoSat-US: A Remote Sensing Dataset for U.S. Lakes from 1984-2020

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiao Yang (2020). LimnoSat-US: A Remote Sensing Dataset for U.S. Lakes from 1984-2020 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4139694
    Explore at:
    Dataset updated
    Oct 29, 2020
    Dataset provided by
    Matthew R.V. Ross
    Simon Topp
    Xiao Yang
    Tamlin Pavelsky
    John Gardner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    LimnoSat-US is an analysis-ready remote sensing database that includes reflectance values spanning 36 years for 56,792 lakes across > 328,000 Landsat scenes. The database comes pre-processed with cross-sensor standardization and the effects of clouds, cloud shadows, snow, ice, and macrophytes removed. In total, it contains over 22 million individual lake observations with an average of 393 +/- 233 (mean +/- standard deviation) observations per lake over the 36 year period. The data and code contained within this repository are as follows:

    HydroLakes_DP.shp: A shapefile containing the deepest points for all U.S. lakes within HydroLakes. For more information on the deepest point see https://doi.org/10.5281/zenodo.4136754 and Shen et al (2015).

    LakeExport.py: Python code to extract reflectance values for U.S. lakes from Google Earth Engine.

    GEE_pull_functions.py: Functions called within LakeExport.py

    01_LakeExtractor.Rmd: An R Markdown file that takes the raw data from LakeExport.py and processes it for the final database.

    SceneMetadata.csv: A file containing additional information such as scene cloud cover and sun angle for all Landsat scenes within the database. Can be joined to the final database using LandsatID.

    srCorrected_us_hydrolakes_dp_20200628: The final LimnoSat-US database containing all cloud free observations of U.S. lakes from 1984-2020. Missing values for bands not shared between sensors (Aerosol and TIR2) are denoted by -99. dWL is the dominant wavelength calculated following Wang et al. (2015). pCount_dswe1 represents the number of high confidence water pixels within 120 meters of the deepest point. pCount_dswe3 represents the number of vegetated water pixels within 120 meters and can be used as a flag for potential reflectance noise. All reflectance values represent the median value of high confidence water pixels within 120 meters. The final database is provided in both as a .csv and .feather formats. It can be linked to SceneMetadata.cvs using LandsatID. All reflectance values are derived from USGS T1-SR Landsat scenes.

  13. o

    ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Feb 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. M. De Felice; K. K. Kavvadias (2022). ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions (1980-2021) [Dataset]. http://doi.org/10.5281/zenodo.2650190
    Explore at:
    Dataset updated
    Feb 2, 2022
    Authors
    M. M. De Felice; K. K. Kavvadias
    Description

    ERA-NUTS (1980-2021) This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository. This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems. An example of the analysis that can be performed with ERA-NUTS is shown in this video. Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository. ## Data The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries). This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure): - t2m: 2-meter temperature (2m_temperature, Celsius degrees) - ssrd: Surface solar radiation (surface_solar_radiation_downwards, Watt per square meter) - ssrdc: Surface solar radiation clear-sky (surface_solar_radiation_downward_clear_sky, Watt per square meter) - ro: Runoff (runoff, millimeters) - sd: Snow depth (sd, meters) There are also a set of derived variables: - ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind and 10m_v_component_of_wind, meters per second) - ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind and 100m_v_component_of_wind, meters per second) - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky) - RH: Relative Humidity (computed following Lawrence, BAMS 2005 and Alduchov & Eskridge, 1996) - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition. For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2). The data is provided in two formats: - NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as int16 type using a scale_factor to minimise the size of the files. - Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly) All the CSV files are stored in a zipped file for each variable. ## Methodology The time-series have been generated using the following workflow: 1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset 2. The data is read in R with the climate4r packages and aggregated using the function /get_ts_from_shp from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders. 3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R 4. The NetCDF are created using xarray in Python 3.8. ## Example notebooks In the folder notebooks on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray and how to visualise them in several ways by using matplotlib or the enlopy package. There are currently two notebooks: - exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them. - ERA-NUTS-explore-with-widget: explorer interactively the datasets with jupyter and ipywidgets. The notebook exploring-ERA-NUTS is also available rendered as HTML. ## Additional files In the folder additional fileson the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region. ## License This dataset is released under CC-BY-4.0 license. ## Changelog 2022-04-08 Added Relative Humidity (RH) 2022-03-07 Added the missing month in CDD/HDD 2022-02-08 Updated the wind speed and temperature data due to missing months.

  14. Sarnet Search And Rescue Dataset

    • universe.roboflow.com
    zip
    Updated Jun 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roboflow Public (2022). Sarnet Search And Rescue Dataset [Dataset]. https://universe.roboflow.com/roboflow-public/sarnet-search-and-rescue
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2022
    Dataset provided by
    Roboflow
    Authors
    Roboflow Public
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    SaR Bounding Boxes
    Description

    Description from the SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery GitHub Repository * The "Note" was added by the Roboflow team.

    Satellite Imagery for Search And Rescue Dataset - ArXiv

    This is a single class dataset consisting of tiles of satellite imagery labeled with potential 'targets'. Labelers were instructed to draw boxes around anything they suspect may a paraglider wing, missing in a remote area of Nevada. Volunteers were shown examples of similar objects already in the environment for comparison. The missing wing, as it was found after 3 weeks, is shown below.

    https://michaeltpublic.s3.amazonaws.com/images/anomaly_small.jpg" alt="anomaly">

    The dataset contains the following:

    SetImagesAnnotations
    Train18083048
    Validate490747
    Test254411
    Total25524206

    The data is in the COCO format, and is directly compatible with faster r-cnn as implemented in Facebook's Detectron2.

    Getting hold of the Data

    Download the data here: sarnet.zip

    Or follow these steps

    # download the dataset
    wget https://michaeltpublic.s3.amazonaws.com/sarnet.zip
    
    # extract the files
    unzip sarnet.zip
    

    ***Note* with Roboflow, you can download the data here** (original, raw images, with annotations): https://universe.roboflow.com/roboflow-public/sarnet-search-and-rescue/ (download v1, original_raw-images) * Download the dataset in COCO JSON format, or another format of choice, and import them to Roboflow after unzipping the folder to get started on your project.

    Getting started

    Get started with a Faster R-CNN model pretrained on SaRNet: SaRNet_Demo.ipynb

    Source Code for Paper

    Source code for the paper is located here: SaRNet_train_test.ipynb

    Cite this dataset

    @misc{thoreau2021sarnet,
       title={SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery}, 
       author={Michael Thoreau and Frazer Wilson},
       year={2021},
       eprint={2107.12469},
       archivePrefix={arXiv},
       primaryClass={eess.IV}
    }
    

    Acknowledgment

    The source data was generously provided by Planet Labs, Airbus Defence and Space, and Maxar Technologies.

  15. Data from: A systematic evaluation of normalization methods and probe...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Universidade de São Paulo
    University of Toronto
    Hospital for Sick Children
    Authors
    H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
    Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
    Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

    Study Participants and Samples

    The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

    All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

    Blood Collection and Processing

    Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

    Characterization of DNA Methylation using the EPIC array

    Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

    Processing and Analysis of DNA Methylation Data

    The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

    Normalization Methods Evaluated

    The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

  16. UAV Canyelles Vineyard Dataset 2024-04-26

    • zenodo.org
    tar
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Vera; Aldo Sollazzo; Chirag Rangholia; Esther Vera; Aldo Sollazzo; Chirag Rangholia (2025). UAV Canyelles Vineyard Dataset 2024-04-26 [Dataset]. http://doi.org/10.5281/zenodo.15025154
    Explore at:
    tarAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esther Vera; Aldo Sollazzo; Chirag Rangholia; Esther Vera; Aldo Sollazzo; Chirag Rangholia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Canyelles
    Description

    The dataset includes the following components:

    • RGB: contains 479 RGB images captured using a DJI Mavic 3M UAV. The images are stored in JPG format, with metadata that includes location information, camera settings and capture dates.
    • SHAPE: contains the shape file used in Metashape Agisoft to crop the pointcloud, orthomosaics and DEM, ensuring a clear visualization of the vineyard rows.
    • POINTCLOUDS: includes both the raw and cropped pointclouds in RGB, NIR, G, RE, R and NDVI. All of them were processed using Metashape Agisoft and stored in XYZ format (.txt).
    • ORTHOMOSAICS: includes the original and cropped orthomosaics in RGB, NIR, G, RE, R and NDVI. All of them were generated using Metashape Agisoft and are in TIF format.
    • DEM: contains the original and cropped DEM images, also processed using Metashape Agisoft, and stored in TIF format.

    This data is aligned with the rest of UAV Canyelles Vineyard Datasets uploaded by Noumena (UC1), so different orthomosaics / pointclouds / DEM of different dates can be analyzed jointly for cropped and uncropped files.

    Data collection took place on April 26th, 2024, in Canyelles, Catalonia, Spain. The UAV was automatically flown at an altitude of 12 meters, ensuring sufficient frontal and side overlap between images. In this dataset, a part of the vineyards area is missing.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jonathan A. Sullivan; Jonathan A. Sullivan; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal (2023). Replication Data & Code - Large-scale land acquisitions exacerbate local land inequalities in Tanzania [Dataset]. http://doi.org/10.5281/zenodo.10152116
Organization logo

Replication Data & Code - Large-scale land acquisitions exacerbate local land inequalities in Tanzania

Explore at:
zipAvailable download formats
Dataset updated
Nov 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jonathan A. Sullivan; Jonathan A. Sullivan; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal; Cyrus Samii; Daniel G. Brown; Francis Moyo; Arun Agrawal
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Tanzania
Description

Reference

Sullivan J.A., Samii, C., Brown, D., Moyo, F., Agrawal, A. 2023. Large-scale land acquisitions exacerbate local farmland inequalities in Tanzania. Proceedings of the National Academy of Sciences 120, e2207398120. https://doi.org/10.1073/pnas.2207398120

Abstract

Land inequality stalls economic development, entrenches poverty, and is associated with environmental degradation. Yet, rigorous assessments of land-use interventions attend to inequality only rarely. A land inequality lens is especially important to understand how recent large-scale land acquisitions (LSLAs) affect smallholder and indigenous communities across as much as 100 million hectares around the world. This paper studies inequalities in land assets, specifically landholdings and farm size, to derive insights into the distributional outcomes of LSLAs. Using a household survey covering four pairs of land acquisition and control sites in Tanzania, we use a quasi-experimental design to characterize changes in land inequality and subsequent impacts on well-being. We find convincing evidence that LSLAs in Tanzania lead to both reduced landholdings and greater farmland inequality among smallholders. Households in proximity to LSLAs are associated with 21.1% (P = 0.02) smaller landholdings while evidence, although insignificant, is suggestive that farm sizes are also declining. Aggregate estimates, however, hide that households in the bottom quartiles of farm size suffer the brunt of landlessness and land loss induced by LSLAs that combine to generate greater farmland inequality. Additional analyses find that land inequality is not offset by improvements in other livelihood dimensions, rather farm size decreases among households near LSLAs are associated with no income improvements, lower wealth, increased poverty, and higher food insecurity. The results demonstrate that without explicit consideration of distributional outcomes, land-use policies can systematically reinforce existing inequalities.

Replication Data

We include anonymized household survey data from our analysis to support open and reproducible science. In particular, we provide i) an anoymized household dataset collected in 2018 (n=994) for households nearby (treatment) and far-away from (control) LSLAs and ii) a household dataset collected in 2019 (n=165) within the same sites. For the 2018 surveys, several anonymized extracts are provided including an imputed (n=10) dataset to fill in missing data that was used for the main analysis. This data can be found in the hh_data folder and includes:

  • hh_imputed10_2018: anonymized household dataset for 2018 with variables used for the main analysis where missing data was imputed 10 times
  • hh_compensation_2018: anonymized household extract for 2018 representing household benefits and compensation directly received from LSLAs
  • hh_migration_2018: anonymized household extract for 2018 representing household migration behavior following LSLAs
  • hh_rsdata_2018: extracted remote sensing data at the household geo-location for 2018
  • hh_land_2019: anonymized household extract for 2019 of land variables

Our analysis also incorporates data from the Living Standards Measurement Survey (LSMS) collected by the World Bank (found in lsms_data folder). We've provide sub-modules from the LSMS dataset relevant to our analysis but the full datasets can be access through the World Bank's Microdata Library (https://microdata.worldbank.org/index.php/home).

Across several analyses we use the LSLA boundaries for our four selected sites. We provide a shapefile for the LSLA boundaries in the gis_data folder.

Finally, our data replication includes several model outputs (found in mod_outputs), particularly those that are lengthy to run in R. These datasets can optionally be loaded into R rather than re-running analysis using our main_analysis.Rmd script.

Replication Code

We provide replication code in the form of R Markdown (.Rmd) or R (.R) files. Alongside the replication data, this can be used to reproduce main figures, table, supplementary materials, and results reported in our article. Scripts include:

  • main_analysis.Rmd: main analysis supporting the finding, graphs, and tables reported in our main manuscript
  • compensation.R: analysis of benefits and compensation received directly by households from LSLAs
  • landvalue.R: analysis of household land values as a function of distance from LSLAs
  • migration.R: analysis of migration behavior following LSLAs
  • selection_bias.R: analysis of LSLA selection bias between control and treatment enumeration areas

Search
Clear search
Close search
Google apps
Main menu