71 datasets found
  1. h

    dataset-pinkball-first-merge

    • huggingface.co
    Updated Dec 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas R (2025). dataset-pinkball-first-merge [Dataset]. https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge
    Explore at:
    Dataset updated
    Dec 1, 2025
    Authors
    Thomas R
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset was created using LeRobot.

      Dataset Structure
    

    meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.

  2. Reddit's /r/Gamestop

    • kaggle.com
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit's /r/Gamestop [Dataset]. https://www.kaggle.com/datasets/thedevastator/gamestop-inc-stock-prices-and-social-media-senti
    Explore at:
    zip(186464492 bytes)Available download formats
    Dataset updated
    Nov 28, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit's /r/Gamestop

    Merge this dataset with gamestop price data to study how the chat impacted

    By SocialGrep [source]

    About this dataset

    The stonks movement spawned by this is a very interesting one. It's rare to see an Internet meme have such an effect on real-world economy - yet here we are.

    This dataset contains a collection of posts and comments mentioning GME in their title and body text respectively. The data is procured using SocialGrep. The posts and the comments are labelled with their score.

    It'll be interesting to see how this effects the stock market prices in the aftermath with this new dataset

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The file contains posts from Reddit mentioning GME and their score. This can be used to analyze how the sentiment on GME affected its stock prices in the aftermath

    Research Ideas

    • To study how social media affects stock prices
    • To study how Reddit affects stock prices
    • To study how the sentiment of a subreddit affects stock prices

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: six-months-of-gme-on-reddit-comments.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | body | The body of the post or comment. (String) | | sentiment | The sentiment of the post or comment. (String) | | score | The score of the post or comment. (Integer) |

    File: six-months-of-gme-on-reddit-posts.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | score | The score of the post or comment. (Integer) | | domain | The domain of the post or comment. (String) | | url | The URL of the post or comment. (String) | | selftext | The selftext of the post or comment. (String) | | title | The title of the post or comment. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

  3. ESA Soil Moisture Climate Change Initiative (Soil_Moisture_cci): COMBINED...

    • catalogue.ceda.ac.uk
    Updated Oct 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wouter Dorigo; Wolfgang Preimesberger; S. Hahn; R. Van der Schalie; R. De Jeu; R. Kidd; N. Rodriguez-Fernandez; M. Hirschi; P. Stradiotti; T. Frederikse; A. Gruber; D. Duchemin (2024). ESA Soil Moisture Climate Change Initiative (Soil_Moisture_cci): COMBINED product, Version 09.1 [Dataset]. https://catalogue.ceda.ac.uk/uuid/0e346e1e1e164ac99c60098848537a29
    Explore at:
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Wouter Dorigo; Wolfgang Preimesberger; S. Hahn; R. Van der Schalie; R. De Jeu; R. Kidd; N. Rodriguez-Fernandez; M. Hirschi; P. Stradiotti; T. Frederikse; A. Gruber; D. Duchemin
    License

    https://artefacts.ceda.ac.uk/licences/specific_licences/esacci_soilmoisture_terms_and_conditions_v2.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/esacci_soilmoisture_terms_and_conditions_v2.pdf

    Time period covered
    Nov 1, 1978 - Dec 31, 2023
    Area covered
    Earth
    Variables measured
    time, latitude, longitude
    Description

    The Soil Moisture CCI COMBINED dataset is one of three datasets created as part of the European Space Agency's (ESA) Soil Moisture Essential Climate Variable (ECV) Climate Change Initiative (CCI) project. The COMBINED product has been created by directly merging Level 2 scatterometer ('active' remote sensing) and radiometer ('passive' remote sensing) soil moisture products derived from the AMI-WS, ASCAT, SMMR, SSM/I, TMI, AMSR-E, WindSat, FY-3B, FY-3C, FY3D, AMSR2, SMOS, GPM and SMAP satellite instruments. PASSIVE and ACTIVE products have also been created.

    The v09.1 COMBINED product, provided as global daily images in NetCDF-4 classic file format, presents a global coverage of surface soil moisture at a spatial resolution of 0.25 degrees. It is provided in volumetric units [m3 m-3] and covers the period (yyyy-mm-dd) 1978-11-01 to 2023-12-31. For information regarding the theoretical and algorithmic base of the product, please see the Algorithm Theoretical Baseline Document. Additional reference documents and information relating to the dataset can also be found on the CCI Soil Moisture project website.

    The data set should be cited using the following references:

    1. Gruber, A., Scanlon, T., van der Schalie, R., Wagner, W., and Dorigo, W. (2019). Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth Syst. Sci. Data, 11, 717–739, https://doi.org/10.5194/essd-11-717-2019

    2. Dorigo, W.A., Wagner, W., Albergel, C., Albrecht, F., Balsamo, G., Brocca, L., Chung, D., Ertl, M., Forkel, M., Gruber, A., Haas, E., Hamer, D. P. Hirschi, M., Ikonen, J., De Jeu, R. Kidd, R. Lahoz, W., Liu, Y.Y., Miralles, D., Lecomte, P. (2017). ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions. In Remote Sensing of Environment, 2017, ISSN 0034-4257, https://doi.org/10.1016/j.rse.2017.07.001

    3. Preimesberger, W., Scanlon, T., Su, C. -H., Gruber, A. and Dorigo, W., "Homogenization of Structural Breaks in the Global ESA CCI Soil Moisture Multisatellite Climate Data Record," in IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4, pp. 2845-2862, April 2021, doi: 10.1109/TGRS.2020.3012896.

  4. KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/korus-aq-aircraft-merge-data-files-9bba5
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.

  5. Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  6. f

    Data from: Automated Annotation of Untargeted All-Ion Fragmentation LC–MS...

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gonçalo Graça; Yuheng Cai; Chung-Ho E. Lau; Panagiotis A. Vorkas; Matthew R. Lewis; Elizabeth J. Want; David Herrington; Timothy M. D. Ebbels (2023). Automated Annotation of Untargeted All-Ion Fragmentation LC–MS Metabolomics Data with MetaboAnnotatoR [Dataset]. http://doi.org/10.1021/acs.analchem.1c03032.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    ACS Publications
    Authors
    Gonçalo Graça; Yuheng Cai; Chung-Ho E. Lau; Panagiotis A. Vorkas; Matthew R. Lewis; Elizabeth J. Want; David Herrington; Timothy M. D. Ebbels
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Untargeted metabolomics and lipidomics LC–MS experiments produce complex datasets, usually containing tens of thousands of features from thousands of metabolites whose annotation requires additional MS/MS experiments and expert knowledge. All-ion fragmentation (AIF) LC–MS/MS acquisition provides fragmentation data at no additional experimental time cost. However, analysis of such datasets requires reconstruction of parent–fragment relationships and annotation of the resulting pseudo-MS/MS spectra. Here, we propose a novel approach for automated annotation of isotopologues, adducts, and in-source fragments from AIF LC–MS datasets by combining correlation-based parent–fragment linking with molecular fragment matching. Our workflow focuses on a subset of features rather than trying to annotate the full dataset, saving time and simplifying the process. We demonstrate the workflow in three human serum datasets containing 599 features manually annotated by experts. Precision and recall values of 82–92% and 82–85%, respectively, were obtained for features found in the highest-rank scores (1–5). These results equal or outperform those obtained using MS-DIAL software, the current state of the art for AIF data annotation. Further validation for other biological matrices and different instrument types showed variable precision (60–89%) and recall (10–88%) particularly for datasets dominated by nonlipid metabolites. The workflow is freely available as an open-source R package, MetaboAnnotatoR, together with the fragment libraries from Github (https://github.com/gggraca/MetaboAnnotatoR).

  7. ESA Soil Moisture Climate Change Initiative (Soil_Moisture_cci):...

    • catalogue.ceda.ac.uk
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wouter Dorigo; Wolfgang Preimesberger; L Moesinger; Adam Pasik; T. Scanlon; S. Hahn; R. Van der Schalie; M. Van der Vliet; R. De Jeu; R. Kidd; N. Rodriguez-Fernandez; M. Hirschi (2024). ESA Soil Moisture Climate Change Initiative (Soil_Moisture_cci): Experimental Break-Adjusted COMBINED Product, Version 07.1 [Dataset]. https://catalogue.ceda.ac.uk/uuid/0ae6b18caf8a4aeba7359f11b8ad49ae
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Wouter Dorigo; Wolfgang Preimesberger; L Moesinger; Adam Pasik; T. Scanlon; S. Hahn; R. Van der Schalie; M. Van der Vliet; R. De Jeu; R. Kidd; N. Rodriguez-Fernandez; M. Hirschi
    License

    https://artefacts.ceda.ac.uk/licences/specific_licences/esacci_soilmoisture_terms_and_conditions.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/esacci_soilmoisture_terms_and_conditions.pdf

    Time period covered
    Nov 1, 1978 - Dec 31, 2021
    Area covered
    Earth
    Variables measured
    latitude, longitude, soil_moisture_content, soil_moisture_content status_flag
    Description

    An experimental break-adjusted soil-moisture product has been generated by the ESA Soil Moisture Climate Change Initiative (Soil_Moisture_cci) project for their v07.1 data release. The product attempts to reduce breaks in the final CCI product by matching the statistics of the datasets between merging periods. At v07.1, the break-adjustment process (explained in Preimesberger et al. 2020) is applied only to the COMBINED product, using ERA5 soil moisture as a reference. The Soil Moisture CCI COMBINED dataset is one of three datasets created as part of the European Space Agency's (ESA) Soil Moisture Essential Climate Variable (ECV) Climate Change Initiative (CCI) project. The product has been created by directly merging Level 2 scatterometer and radiometer soil moisture products derived from the AMI-WS, ASCAT, SMMR, SSM/I, TMI, AMSR-E, WindSat, FY-3B, FY-3C, FY3D, AMSR2, SMOS, GPM and SMAP satellite instruments. PASSIVE and ACTIVE products have also been created.

    The v07.1 COMBINED break-adjusted product, provided as global daily images in NetCDF-4 classic file format, presents a global coverage of surface soil moisture at a spatial resolution of 0.25 degrees. It is provided in volumetric units [m3 m-3] and covers the period (yyyy-mm-dd) 1978-11-01 to 2021-12-31. For information regarding the theoretical and algorithmic base of the product, please see the Algorithm Theoretical Baseline Document and Preimesberger et al. 2020. Additional reference documents and information relating to the dataset can also be found on the CCI Soil Moisture project website.

    The data set should be cited using all of the following references:

    1. Gruber, A., Scanlon, T., van der Schalie, R., Wagner, W., and Dorigo, W. (2019). Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth Syst. Sci. Data, 11, 717–739, https://doi.org/10.5194/essd-11-717-2019

    2. Dorigo, W.A., Wagner, W., Albergel, C., Albrecht, F., Balsamo, G., Brocca, L., Chung, D., Ertl, M., Forkel, M., Gruber, A., Haas, E., Hamer, D. P. Hirschi, M., Ikonen, J., De Jeu, R. Kidd, R. Lahoz, W., Liu, Y.Y., Miralles, D., Lecomte, P. (2017). ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions. In Remote Sensing of Environment, 2017, ISSN 0034-4257, https://doi.org/10.1016/j.rse.2017.07.001

    3. Preimesberger, W., Scanlon, T., Su, C. -H., Gruber, A. and Dorigo, W., "Homogenization of Structural Breaks in the Global ESA CCI Soil Moisture Multisatellite Climate Data Record," in IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4, pp. 2845-2862, April 2021, doi: 10.1109/TGRS.2020.3012896.

  8. d

    Data from: Data release for Linking land and sea through an...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data release for Linking land and sea through an ecological-economic model of coral reef recreation [Dataset]. https://catalog.data.gov/dataset/data-release-for-linking-land-and-sea-through-an-ecological-economic-model-of-coral-reef-r
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    Coastal zones are popular recreational areas that substantially contribute to social welfare. Managers can use information about specific environmental features that people appreciate, and how these might change under different management scenarios, to spatially target actions to areas of high current or potential value. We explored how snorkelers’ experience would be affected by separate and combined land and marine management actions in West Maui, Hawaiʻi, using a Bayesian Belief Network (BBN) and a spatially explicit ecosystem services model. The BBN simulates recreational attractiveness by combining snorkelers’ preferences for coastal features with experts’ opinions on ecological dynamics, snorkeler behavior, and management actions. A choice experiment with snorkelers elucidated their preferences for sites with better ecological and water-quality conditions. Linking the economic elicitation to the spatially explicit BBN to evaluate land-sea management scenarios provides specific guidance on where and how to act in West Maui to maximize ecosystem service returns. Improving coastal water quality through sediment runoff and cesspool effluent reductions, and enhancing coral reef ecosystem conditions, positively affected overall snorkeling attractiveness across the study area, but with differential results at specific sites. The highest improvements were attained through joint land-sea management, driven by strong effects of efforts to increase fish abundance and reduce sediment, however, management priorities at individual beaches varied.

  9. n

    ReCount - A multi-experiment resource of analysis-ready RNA-seq gene count...

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Jun 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ReCount - A multi-experiment resource of analysis-ready RNA-seq gene count datasets [Dataset]. http://identifiers.org/RRID:SCR_001774
    Explore at:
    Dataset updated
    Jun 14, 2021
    Description

    RNA-seq gene count datasets built using the raw data from 18 different studies. The raw sequencing data (.fastq files) were processed with Myrna to obtain tables of counts for each gene. For ease of statistical analysis, they combined each count table with sample phenotype data to form an R object of class ExpressionSet. The count tables, ExpressionSets, and phenotype tables are ready to use and freely available. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.

  10. Harmonized global datasets of soil carbon and heterotrophic respiration from...

    • zenodo.org
    bin, nc, txt
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina (2025). Harmonized global datasets of soil carbon and heterotrophic respiration from data-driven estimates, with derived turnover time and Q10 [Dataset]. http://doi.org/10.5281/zenodo.17282577
    Explore at:
    nc, txt, binAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shoji Hashimoto; Shoji Hashimoto; Akihiko Ito; Akihiko Ito; Kazuya Nishina; Kazuya Nishina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We collected all available global soil carbon (C) and heterotrophic respiration (RH) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.

    Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and RH at 0.5° resolution.

    Grid cells with fewer than three soil C estimates or fewer than four RH estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and RH information over land.

    Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:

    τ = CS / RH

    where CS is soil carbon stock and RH is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:

    τmax = CS+ / RH  τmin = CS / RH+

    where CS+ and CS are the maximum and minimum soil C values, and RH+ and RH are the maximum and minimum RH values, respectively.

    To calculate the temperature sensitivity of decomposition (Q10)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q10 (maximum and minimum values) was derived using τmax and τmin, respectively.

    All files are provided in NetCDF format. The SOC file includes the following variables:
    · longitude, latitude
    · soc: mean soil C stock (kg C m⁻²)
    · soc_median: median soil C (kg C m⁻²)
    · soc_n: number of estimates per grid cell
    · soc_max, soc_min: maximum and minimum soil C (kg C m⁻²)
    · soc_max_id, soc_min_id: study IDs corresponding to the maximum and minimum values
    · soc_range: range of soil C values
    · soc_sd: standard deviation of soil C (kg C m⁻²)
    · soc_cv: coefficient of variation (%)
    The RH file includes:
    · longitude, latitude
    · rh: mean RH (g C m⁻² yr⁻¹)
    · rh_median, rh_n, rh_max, rh_min: as above
    · rh_max_id, rh_min_id: study IDs for max/min
    · rh_range, rh_sd, rh_cv: analogous variables for RH
    The mean, maximum, and minimum values of soil C turnover time are provided as separate files. The Q10 files contain estimates derived from the mean values of soil C and RH, along with associated uncertainty values.

    The harmonized dataset files available in the repository are as follows:

    · harmonized-RH-hdg.nc: global soil heterotrophic respiration map

    · harmonized-SOC100-hdg.nc: global soil C map for 0–100 cm

    · harmonized-SOC30-hdg.nc: global soil C map for 0–30 cm

    · Q10.nc: global Q10 map

    · Turnover-time_max.nc: global soil C turnover time estimated using maximum soil C and minimum RH

    · Turnover-time_min.nc: global soil C turnover time estimated using minimum soil C and maximum RH

    · Turnover-time_mean.nc: global soil C turnover time estimated using mean soil C and RH

    · Turnover-time30_mean.nc: global soil C turnover time estimated using the soil C map for 0-30 cm

    Version history
    Version 1.1: Median values were added. Bug fix for SOC30 (n>2 was inactive in the former version)


    More details are provided in: Hashimoto S. Ito, A. & Nishina K. (in revision) Harmonized global soil carbon and respiration datasets with derived turnover time and temperature sensitivity. Scientific Data

    Reference

    Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).

    Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.

    <td style="width:

    Dataset

    Repository/References (Dataset name)

    Depth

    ID in NetCDF file***

    Global soil C

    Global soil data task 2000 (IGBP-DIS)1

    0–100

    3,-

    Shangguan et al. 2014 (GSDE)2,3

    0–100, 0–30*

    1,1

    Batjes 2016 (WISE30sec)4,5

    0–100, 0–30

    6,7

    Sanderman et al. 2017 (Soil-Carbon-Debt) 6,7

    0–100, 0–30

    5,5

    Soilgrids team and Hengl et al. 2017 (SoilGrids)8,9

    0–30**

    -,6

    Hengl and Wheeler 2018 (LandGIS)10

    0–100, 0–30

    4,4

    FAO 2022 (GSOC)11

    0–30

    -,2

    FAO 2023 (HWSD2)12

    0–100, 0–30

    2,3

    Circumpolar soil C

    Hugelius et al. 2013 (NCSCD)13–15

    0–100, 0–30

    7,8

    Global RH

    Hashimoto et al. 201516,17

    -

    1

    Warner et al. 2019 (Bond-Lamberty equation based)18,19

    -

    2

    Warner et al. 2019 (Subke equation based)18,19

    -

    3

    Tang et al. 202020,21

    -

    4

    Lu et al. 202122,23

    -

    5

    Stell et al. 202124,25

    -

  11. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Networkhttp://www.hptn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  12. s

    Data from: RAW data from Towards Holistic Environmental Policy Assessment:...

    • research.science.eus
    • data.europa.eu
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber (2024). RAW data from Towards Holistic Environmental Policy Assessment: Multi-Criteria Frameworks and recommendations for modelers paper [Dataset]. https://research.science.eus/documentos/685699066364e456d3a65172
    Explore at:
    Dataset updated
    2024
    Authors
    Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber; Borges, Cruz E.; Ferrón, Leandro; Soimu, Oxana; Mugarra, Aitziber
    Description

    Name: Data used to rate the relevance of each dimension necessary for a Holistic Environmental Policy Assessment.

    Summary: This dataset contains answers from a panel of experts and the public to rate the relevance of each dimension on a scale of 0 (Nor relevant at all) to 100 (Extremely relevant).

    License: CC-BY-SA

    Acknowledge: These data have been collected in the framework of the DECIPHER project. This project has received funding from the European Union’s Horizon Europe programme under grant agreement No. 101056898.

    Disclaimer: Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

    Collection Date: 2024-1 / 2024-04

    Publication Date: 22/04/2025

    DOI: 10.5281/zenodo.13909413

    Other repositories: -

    Author: University of Deusto

    Objective of collection: This data was originally collected to prioritise the dimensions to be further used for Environmental Policy Assessment and IAMs enlarged scope.

    Description:

    Data Files (CSV)

    decipher-public.csv : Public participants' general survey results in the framework of the Decipher project, including socio demographic characteristics and overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

    decipher-risk.csv : Contains individual survey responses regarding prioritisation of dimensions in risk situations. Includes demographic and opinion data from a targeted sample.

    decipher-experts.csv : Experts’ opinions collected on risk topics through surveys in the framework of Decipher Project, targeting professionals in relevant fields.

    decipher-modelers.csv: Answers given by the developers of models about the characteristics of the models and dimensions covered by them.

    prolific_export_risk.csv : Exported survey data from Prolific, focusing specifically on ratings in risk situations. Includes response times, demographic details, and survey metadata.

    prolific_export_public_{1,2}.csv : Public survey exports from Prolific, gathering prioritisation of dimensions necessary for environmental policy assessment.

    curated.csv : Final cleaned and harmonized dataset combining multiple survey sources. Designed for direct statistical analysis with standardized variable names.

    Scripts files (R)

    decipher-modelers.R: Script to assess the answers given modelers about the characteristics of the models.

    joint.R: Script to clean and joint the RAW answers from the different surveys to retrieve overall perception of each dimension necessary for a Holistic Environmental Policy Assessment.

    Report Files

    decipher-modelers.pdf: Diagram with the result of the

    full-Country.html : Full interactive report showing dimension prioritisation broken down by participant country.

    full-Gender.html : Visualization report displaying differences in dimension prioritisation by gender.

    full-Education.html : Detailed breakdown of dimension prioritisation results based on education level.

    full-Work.html : Report focusing on participant occupational categories and associated dimension prioritisation.

    full-Income.html : Analysis report showing how income level correlates with dimension prioritisation.

    full-PS.html : Report analyzing Political Sensitivity scores across all participants.

    full-type.html : Visualization report comparing participant dimensions prioritisation (public vs experts) in normal and risk situations.

    full-joint-Country.html : Joint analysis report integrating multiple dimensions of country-based dimension prioritisation in normal and risk situations. Combines demographic and response patterns.

    full-joint-Gender.html : Combined gender-based analysis across datasets, exploring intersections of demographic factors and dimensions prioritisation in normal and risk situations.

    full-joint-Education.html : Education-focused report merging various datasets to show consistent or divergent patterns of dimensions prioritisation in normal and risk awareness.

    full-joint-Work.html : Cross-dataset analysis of occupational groups and their dimensions prioritisation in normal and risk situation

    full-joint-Income.html : Income-stratified joint analysis, merging public and expert datasets to find common trends and significant differences during dimensions prioritisation in normal and risks situations.

    full-joint-PS.html : Comprehensive Political Sensitivity score report from merged datasets, highlighting general patterns and subgroup variations in normal and risk situations.

    5 star: ⭐⭐⭐

    Preprocessing steps: The data has been re-coded and cleaned using the scripts provided.

    Reuse: NA

    Update policy: No more updates are planned.

    Ethics and legal aspects: Names of the persons involved have been removed.

    Technical aspects:

    Other:

  13. Data from: Estimating Biofuel Contaminant Concentration from 4D ERT with...

    • catalog.data.gov
    • datasets.ai
    Updated Aug 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Estimating Biofuel Contaminant Concentration from 4D ERT with Mixing Models [Dataset]. https://catalog.data.gov/dataset/estimating-biofuel-contaminant-concentration-from-4d-ert-with-mixing-models
    Explore at:
    Dataset updated
    Aug 14, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The data are ground penetrating radar, electrical resistivity, and interpretation data. This dataset is not publicly accessible because: Secondary data, not EPA data. It can be accessed through the following means: Contact first author, Dan Glaser. Format: Data generated and stored by USACE and Rutgers University - Newark. This dataset is associated with the following publication: Glaser, D., R. Henderson, D. Werkema, T. Johnson, and R. Versteeg. Estimating Biofuel Contaminant Concentration from 4D ERT with Mixing Models. JOURNAL OF CONTAMINANT HYDROLOGY. Elsevier Science Ltd, New York, NY, USA, 248: 104027, (2022).

  14. Additional file 4 of mtDNAcombine: tools to combine sequences from multiple...

    • springernature.figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eleanor F. Miller; Andrea Manica (2023). Additional file 4 of mtDNAcombine: tools to combine sequences from multiple studies [Dataset]. http://doi.org/10.6084/m9.figshare.14189969.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Eleanor F. Miller; Andrea Manica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4. Code to create the plots in this paper presented as a R markdown file.

  15. ESA Soil Moisture Climate Change Initiative (Soil_Moisture_cci): Ancillary...

    • catalogue.ceda.ac.uk
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wouter Dorigo; Wolfgang Preimesberger; S. Hahn; R. Van der Schalie; R. De Jeu; R. Kidd; N. Rodriguez-Fernandez; M. Hirschi; P. Stradiotti; T. Frederikse; A. Gruber; R. Madelon (2024). ESA Soil Moisture Climate Change Initiative (Soil_Moisture_cci): Ancillary data used for the ACTIVE, PASSIVE and COMBINED products, Version 08.1 [Dataset]. https://catalogue.ceda.ac.uk/uuid/010243ea38f3473a885d2ccd9cfb77ab
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Wouter Dorigo; Wolfgang Preimesberger; S. Hahn; R. Van der Schalie; R. De Jeu; R. Kidd; N. Rodriguez-Fernandez; M. Hirschi; P. Stradiotti; T. Frederikse; A. Gruber; R. Madelon
    License

    https://artefacts.ceda.ac.uk/licences/specific_licences/esacci_soilmoisture_terms_and_conditions.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/esacci_soilmoisture_terms_and_conditions.pdf

    Area covered
    Earth
    Variables measured
    latitude, longitude
    Description

    These ancillary datasets were used in the production of the ACTIVE, PASSIVE and COMBINED soil moisture data products, created as part of the European Space Agency's (ESA) Soil Moisture Climate Change Initiative (CCI) project. The set of ancillary datasets include datasets of Average Vegetation Optical Depth data from AMSR-E, Soil Porosity, Topographic Complexity and Wetland fraction, as well as a Land Mask. This version of the ancillary datasets were used in the production of the v08.1 Soil Moisture CCI data.

    The ACTIVE, PASSIVE and COMBINED soil moisture products which these data were used to develop are fusions of scatterometer (i.e. active remote sensing) and radiometer (i.e. passive remote sensing) soil moisture products, derived from the AMI-WS, ASCAT, SMMR, SSM/I, TMI, AMSR-E, WindSat, FY-3B, FY-3C, FY3D, AMSR2, SMOS, GPM and SMAP satellite instruments. To access these products or for further details on them please see their dataset records. Additional reference documents and information relating to them can also be found on the CCI Soil Moisture project website.

    Soil moisture CCI data should be cited using the following references:

    1. Gruber, A., Scanlon, T., van der Schalie, R., Wagner, W., and Dorigo, W. (2019). Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth Syst. Sci. Data, 11, 717–739, https://doi.org/10.5194/essd-11-717-2019

    2. Dorigo, W.A., Wagner, W., Albergel, C., Albrecht, F., Balsamo, G., Brocca, L., Chung, D., Ertl, M., Forkel, M., Gruber, A., Haas, E., Hamer, D. P. Hirschi, M., Ikonen, J., De Jeu, R. Kidd, R. Lahoz, W., Liu, Y.Y., Miralles, D., Lecomte, P. (2017). ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions. In Remote Sensing of Environment, 2017, ISSN 0034-4257, https://doi.org/10.1016/j.rse.2017.07.001

    3. Preimesberger, W., Scanlon, T., Su, C. -H., Gruber, A. and Dorigo, W., "Homogenization of Structural Breaks in the Global ESA CCI Soil Moisture Multisatellite Climate Data Record," in IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4, pp. 2845-2862, April 2021, doi: 10.1109/TGRS.2020.3012896.

  16. Data from: A dataset to model Levantine landcover and land-use change...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Kempf; Michael Kempf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 16, 2023
    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R


    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.


    2_MERGE_MODIS_tiles.R


    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").


    3_CROP_MODIS_merged_tiles.R


    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
    The repository provides the already clipped and merged NDVI datasets.


    4_TREND_analysis_NDVI.R


    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
    To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.


    5_BUILT_UP_change_raster.R


    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.


    6_POPULATION_numbers_plot.R


    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.


    7_YIELD_plot.R


    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.


    8_GLDAS_read_extract_trend


    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
    Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
    From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
    From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

  17. K

    Replication code and data for: Delineating Neighborhoods: An approach...

    • rdr.kuleuven.be
    bin, html, png +6
    Updated Apr 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anirudh Govind; Anirudh Govind; Ate Poorthuis; Ate Poorthuis; Ben Derudder; Ben Derudder (2024). Replication code and data for: Delineating Neighborhoods: An approach combining urban morphology with point and flow datasets [Dataset]. http://doi.org/10.48804/NBDJE3
    Explore at:
    bin(53097), bin(50966), bin(147612), bin(147578), bin(52480), bin(49368), bin(139501), bin(213372), bin(146148), bin(52067), bin(52787), bin(204172), bin(142661), bin(53685), bin(51000), bin(43353), bin(24797), bin(48525), bin(55462), bin(53414), bin(138211), bin(40237), bin(49287), bin(38305), bin(206647), bin(53709), bin(141717), bin(145572), bin(141359), bin(50849), bin(51735), bin(143396), bin(145720), bin(142573), bin(55221), bin(2041), bin(49573), bin(49488), bin(214030), bin(141368), bin(208984), bin(56007), bin(52961), bin(49288), bin(146282), bin(53219), bin(50493), bin(38350), bin(40783), bin(56015), bin(51175), bin(50738), bin(221169), bin(52136), bin(213413), bin(51068), bin(51952), bin(142899), bin(137284), bin(52355), bin(212813), bin(55979), bin(54004), bin(51524), bin(56049), bin(137335), bin(147651), bin(51726), txt(22712040), bin(55448), bin(55655), bin(53385), bin(53598), bin(52251), bin(209474), bin(143633), bin(54861), bin(138176), text/markdown(2163), bin(143027), bin(55891), bin(51135), bin(56010), bin(53124), bin(142144), bin(41853), bin(142249), bin(51420), bin(53784), bin(53493), bin(143810), bin(206759), bin(52307), bin(52700), bin(1972), bin(138766), bin(49406), bin(51400), bin(49286), bin(50744), bin(52946), bin(138189), bin(139798), bin(217747), bin(52050), bin(140803), bin(142079), bin(52253), bin(38310), bin(50904), bin(207544), bin(53879), bin(17224), bin(53260), bin(147440), bin(55999), bin(208342), bin(55245), bin(56013), bin(53085), bin(38302), bin(210432), bin(137686), bin(50806), bin(139282), bin(41535), bin(50854), bin(55535), bin(147653), bin(52263), txt(14907132), bin(49575), bin(51665), bin(259103), bin(56114), bin(52275), bin(53262), bin(51181), bin(53650), bin(141117), bin(51829), bin(54385), bin(142241), bin(55907), bin(56116), png(281559), bin(53352), bin(138350), bin(54090), bin(53569), bin(145279), bin(141720), bin(213937), txt(8033151), bin(51685), bin(53596), bin(53665), bin(51095), bin(219618), png(11184), bin(50968), bin(136793), bin(53301), bin(142761), text/markdown(8426), bin(140005), bin(213601), bin(41754), bin(51456), bin(2609152), bin(221253), bin(144761), bin(51327), bin(51111), bin(52954), bin(138425), bin(56113), bin(141899), html(629525), bin(138727), bin(51304), bin(145596), bin(136839), xml(12517), bin(142458), bin(139337), png(1824580), bin(54561), bin(211160), bin(52614), bin(142035), text/markdown(4324), bin(141834), bin(140983), bin(1990382), bin(201216), bin(50664), bin(53193), bin(892685), bin(52529), bin(51292), bin(137136), bin(51364), bin(138553), bin(879044), bin(147601), bin(52569), bin(220854), bin(142328), bin(147348), bin(54974), bin(38536), txt(185821), bin(53970), png(22653), bin(139071), bin(201774), bin(52860), bin(55995), bin(53795), bin(40592), bin(137756), bin(55432), bin(143929), bin(52730), txt(19402), bin(221023), bin(49446), bin(42199), bin(51100), bin(54899), bin(53046), bin(56012), bin(43397), bin(50901), bin(142461), bin(51566), bin(52410), bin(147442), bin(52113), bin(49574), bin(56112), bin(49458), bin(51214), bin(55489), bin(127600), bin(49652), bin(56115), bin(52539), bin(50820), bin(212328), bin(41979), bin(55005), bin(51798), bin(55966), bin(55360), bin(53368), bin(139415), bin(50954), bin(138864), bin(56117), bin(38301), bin(51105), bin(213149), bin(51967), bin(146694), bin(138390), bin(51935), bin(38318), bin(140037), bin(41415), bin(50956), bin(56098), bin(143171), bin(51810), bin(205510), bin(146000), bin(210712), bin(52738), bin(42380), bin(146066), bin(53198), bin(143080), bin(213791), bin(55423), bin(228152), bin(147649), bin(53457), bin(138367), bin(142352), bin(53786), bin(138951), bin(202842), bin(212117), bin(51543), bin(55862), bin(139076), bin(43268), bin(56055), bin(56014), bin(50884), bin(1785), bin(55874), bin(51871), bin(38303), bin(51079), bin(51624), bin(55800), bin(53519), bin(203135), bin(45351), bin(147057), bin(142817), bin(54418), bin(53197), bin(49559), bin(252577), bin(49435), bin(138171), bin(144034), bin(221259), bin(53120), bin(202188), bin(144398), bin(208682), bin(50972), bin(2828), bin(51348), bin(54197), bin(142779), bin(221262), bin(49568), bin(52459), bin(55902), bin(51441), bin(54572), bin(38304), bin(56043), bin(142984), bin(146568), bin(51341), bin(140660), bin(142610), bin(55600), bin(52927), bin(52747), bin(51690), bin(51559), bin(144296), bin(52990), bin(52154), bin(147727), bin(50797), bin(220197), bin(51412), text/markdown(26450), bin(53065), bin(55980), txt(195933), bin(49570), bin(55867), bin(142327), bin(51379), bin(139790), bin(53152), bin(140189), bin(140557), bin(56056), bin(209949), bin(49453), bin(51724), bin(53416), bin(293751), bin(142003), bin(140312), bin(139711), bin(53393), bin(20992), bin(51506), bin(147650), bin(141560), bin(142878), bin(53321), bin(53380), bin(55110), bin(56044), png(22286), bin(51331), bin(38322), bin(52850), bin(136625), bin(51049), bin(295386), bin(43160), bin(52313), bin(144802), bin(51936), bin(52187), bin(50965), bin(53137), bin(143492), bin(52262), bin(205919), bin(208494), bin(53466), txt(3940140), bin(211759), bin(51737), bin(53765), bin(141796), bin(19402), bin(51707), bin(146826), bin(143383), bin(19016), bin(141785), bin(140975), bin(49282), bin(203748), bin(214285), text/comma-separated-values(144), png(67020), bin(52116), bin(146790), bin(209312), bin(53719), bin(53694), png(67574), bin(210897), bin(49450), bin(52097), bin(51578), bin(212929), bin(55772), bin(53560), bin(51294), bin(51460), png(10386), type/x-r-syntax(2745), bin(54501), bin(38306), bin(51295), bin(147249), bin(322754), png(1565337), bin(56040), bin(137824), bin(206944), bin(28966912), bin(38327), bin(54581), bin(141877), bin(1844), bin(42621), bin(140629), text/markdown(1245), bin(51050), bin(40846), bin(52986), bin(141608), bin(211945), bin(147693), bin(212470), bin(53491), bin(207896), bin(41656), bin(140301), bin(41215), bin(51344), bin(147168), bin(55469), bin(221260), bin(3227), bin(41012), bin(51473), bin(55716), zip(71461515), bin(147726), bin(141483), bin(38298), bin(144128), bin(147509), bin(138527), bin(38562), bin(49550), bin(147720), bin(55811), bin(147588), bin(141189), bin(51919), bin(52581), bin(52574), bin(138931), bin(136488), bin(52442), bin(145456), bin(55946), bin(38426), bin(221261), bin(53305), bin(54212), bin(53933), bin(137981), bin(215864), bin(38458), bin(15669), bin(147072), bin(52029), bin(52373), png(22050), bin(171981), bin(51092), bin(38338), bin(143048), bin(147589), bin(137229), bin(201515), bin(145844), bin(51239), bin(207075), bin(141516), bin(56100), bin(53479), bin(142206), bin(202428), bin(146503), text/markdown(2537), bin(52348), bin(210816), bin(138985), bin(141883), bin(215142), bin(145346), bin(212602), bin(53040), bin(210271), html(11382799), bin(54251), text/markdown(2927), bin(220628), bin(51572), bin(146160), bin(53290), bin(53341), bin(51182), bin(147124), bin(138076), bin(147646), bin(51443), bin(205), bin(141333), bin(50795), bin(145150), bin(52694), bin(144314), bin(56054), bin(140497), bin(54762), text/markdown(18667), bin(143217), bin(42805), bin(207324), bin(147725), bin(146383), bin(147372), bin(52869), bin(10753), bin(51409), bin(69349), bin(147652), bin(146964), bin(56011), bin(49926), text/markdown(1946), text/markdown(2094), txt(22443850), bin(139563), bin(38308), bin(51689), bin(55714), bin(51099), bin(1123), bin(142747), bin(49783), bin(50996), bin(147590), bin(137521), bin(53435), bin(55823), bin(55329), bin(52053), bin(55931), bin(54915), bin(51267), bin(136986), bin(53233), bin(38321), bin(214403), bin(55335), text/markdown(1639), bin(137289), bin(50861), bin(55993), bin(53177), bin(54335), bin(142023), bin(10506), bin(43412), bin(141453), bin(49361), bin(218967), bin(56045), bin(54839), bin(54588), bin(4303), bin(51536), bin(137530), bin(204934), bin(53679), bin(50397), bin(138237), bin(51752), bin(142127), bin(145031), bin(147556), bin(138813), bin(55989), bin(51168), bin(54978), bin(147638), bin(142467), bin(51302), bin(53767), bin(52008), bin(54703), bin(52854), bin(138637), bin(143450), bin(7668), bin(53728), bin(49377), bin(51202), bin(50060), bin(55187), bin(56027), bin(55267), bin(53147), bin(56009), bin(49558), bin(51838), text/markdown(17298), text/comma-separated-values(849), bin(208065), bin(42983), bin(54152), bin(53032), bin(55670), bin(207805), bin(207382), bin(139319), png(40776), bin(139593), bin(52905), bin(211534), bin(52752), bin(53985), bin(221257), bin(38333), bin(53618), bin(141010), bin(216948), bin(53261), bin(214535), bin(52879), bin(52046), bin(38377), type/x-r-syntax(12363), bin(52921), bin(138081), bin(53590), bin(53561), bin(52458), bin(52685), bin(52509), bin(144626), bin(50882), bin(49285), bin(52662), text/markdown(11222), bin(51639), bin(51929), bin(49551), bin(55047), bin(38362), bin(55378), bin(221225), txt(6846843), bin(146471), bin(140097), bin(49561)Available download formats
    Dataset updated
    Apr 10, 2024
    Dataset provided by
    KU Leuven RDR
    Authors
    Anirudh Govind; Anirudh Govind; Ate Poorthuis; Ate Poorthuis; Ben Derudder; Ben Derudder
    License

    https://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.48804/NBDJE3https://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.48804/NBDJE3

    Description

    This repository contains the R code and aggregated data needed to replicate the analysis in our paper "Delineating Neighborhoods: An approach combining urban morphology with point and flow datasets". The enclosed renv.lock file provides details of the R packages used. Aside from these packages, an installation of the Infomap algorithm (freely available through standalone installations and Docker images) is also necessary but is not included in this repository. All code is organized in computational notebooks, arranged sequentially. Data required to execute these notebooks is stored in the data/ folder. For further details, please refer to the enclosed 'README' file and the original publication.

  18. IMDB Movies From 1920 to 2025

    • kaggle.com
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raed Addala (2025). IMDB Movies From 1920 to 2025 [Dataset]. https://www.kaggle.com/datasets/raedaddala/imdb-movies-from-1960-to-2023
    Explore at:
    zip(46688739 bytes)Available download formats
    Dataset updated
    Mar 27, 2025
    Authors
    Raed Addala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Over 60,000 Movies, 100+ Years of Data, and Rich Metadata!

    Links:

    For details about the scraping process, explore the complete code repository on GitHub.

    About the Dataset

    This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history. Each year’s data is divided into three CSV files for flexibility and ease of use:
    - imdb_movies_[year].csv: Basic movie details.
    - advanced_movies_details_[year].csv: Comprehensive metadata and financial details.
    - merged_movies_data_[year].csv: A unified dataset combining both files.

    File Descriptions

    1. imdb_movies_[year].csv

    Essential movie information, including:
    - Title: Movie title. - Description: Movie Description. - méta_score: IMDB's meta score. - Movie Link: IMDb URL for the movie.
    - Year: Year of release.
    - Duration: Runtime (in minutes).
    - MPA: Motion Picture Association rating (e.g., PG, R).
    - Rating: IMDb rating (scale of 1–10).
    - Votes: Total user votes on IMDb.

    2. advanced_movies_details_[year].csv

    Detailed movie metadata:
    - Link: IMDb URL (for linking with other data).
    - budget: Production budget (in USD).
    - grossWorldWide: Global box office revenue.
    - gross_US_Canada: North American box office earnings.
    - opening_weekend_Gross: Opening weekend revenue.
    - directors: List of directors.
    - writers: List of writers.
    - stars: Main cast members.
    - genres: Movie genres.
    - countries_origin: Countries of production.
    - filming_locations: Primary filming locations.
    - production_companies: Associated production companies.
    - Languages: Languages spoken in the movie.
    - Award_information: Information about awards, nominations and wins.
    - release_date: Official release date.

    3. merged_movies_data_[year].csv

    A unified dataset combining all columns from the previous two files:
    - Basic Details: Title, Year, Rating, Votes.
    - Advanced Features: budget, grossWorldWide, directors, genres, and awards.

    Data Structure

    Template Columns:
    - imdb_movies_[year].csv:
    Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link

    • advanced_movies_details_[year].csv:
      link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    • merged_movies_data_[year].csv:
      Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    Updates

    The dataset is updated annually in December to include the latest data.

    Applications

    This dataset is ideal for:
    - Trend Analysis: Explore changes in the movie industry over six decades.
    - Predictive Modeling: Build models to forecast box office revenue, ratings, or awards.
    - Recommendation Systems: Use attributes like genres, cast, and ratings for personalized recommendations.
    - Comparative Analysis: Study differences across eras, genres, or regions.

    Dataset Features

    • Over 60,000 Movies: Detailed data from 1920 to 2025.
    • Rich Metadata: Financial, creative, and recognition-related attributes.
    • User-friendly: Modular files for tailored use or comprehensive merged files.
    • Consistency: Uniform structure enables seamless analysis.

    Notes

    • For issues, suggestions, or feature requests, please feel free to contact me: send me an email or open an issue on GitHub. Your input is highly appreciated.
  19. NHANES 1988-2018

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nguyenvy (2025). NHANES 1988-2018 [Dataset]. https://www.kaggle.com/datasets/nguyenvy/nhanes-19882018
    Explore at:
    zip(917955003 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Authors
    nguyenvy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables convey 1. demographics (281 variables), 2. dietary consumption (324 variables), 3. physiological functions (1,040 variables), 4. occupation (61 variables), 5. questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood), 6. medications (29 variables), 7. mortality information linked from the National Death Index (15 variables), 8. survey weights (857 variables), 9. environmental exposure biomarker measurements (598 variables), and 10. chemical comments indicating which measurements are below or above the lower limit of detection (505 variables).

    csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file. - The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. - "dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES. - "dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables. - “dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes. - “nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.

    R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. - “w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data. - “m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.

    Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order. - “example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together. - “example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model. - “example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design. - “example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  20. Reddit: /r/AskScience

    • kaggle.com
    zip
    Updated Dec 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/AskScience [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-secrets-of-reddit-s-askscience-com
    Explore at:
    zip(241607 bytes)Available download formats
    Dataset updated
    Dec 17, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Unlocking the Secrets of Reddit's AskScience Community

    Popular Science Questions and Answers

    By Reddit [source]

    About this dataset

    This data reveals the hidden corners of Reddit's AskScience subreddit, providing a revealing and intimate look at one of the most beloved science communities on the web. With insightful tidbits about post titles, scores, ids, urls, comment counts, creation times, post bodies, and timestamps all included in this dataset; this information provides key insights into what makes AskScience such an engaging community filled with passionate science enthusiasts. By combining all this data we can get a better understanding of discussants' interests and preferences & take valuable lessons to build more robust online communities filled with enthusiastic members who can spark exciting conversations centered around their favorite scientific topics

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Research Ideas

    • Analyzing the kinds of questions, tags and topics that receive the most upvotes.
    • Comparing engagement levels in the AskScience subreddit over time.
    • Examining how formatting affects post engagement and popularity (e.g., bolding titles, using images, etc.)

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: askscience.csv | Column name | Description | |:--------------|:-----------------------------------------------------------| | title | The title of a post. (String) | | score | The number of upvotes a post has received. (Integer) | | url | The URL of a post. (String) | | comms_num | The number of comments a post has received. (Integer) | | created | The date and time that a post was created. (DateTime) | | body | The body text associated with a post. (String) | | timestamp | The date and time when a post was last updated. (DateTime) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Thomas R (2025). dataset-pinkball-first-merge [Dataset]. https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge

dataset-pinkball-first-merge

treitz/dataset-pinkball-first-merge

Explore at:
Dataset updated
Dec 1, 2025
Authors
Thomas R
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This dataset was created using LeRobot.

  Dataset Structure

meta/info.json: { "codebase_version": "v3.0", "robot_type": "so101_follower", "total_episodes": 40, "total_frames": 10385, "total_tasks": 1, "chunks_size": 1000, "data_files_size_in_mb": 100, "video_files_size_in_mb": 200, "fps": 30, "splits": { "train": "0:40" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/treitz/dataset-pinkball-first-merge.

Search
Clear search
Close search
Google apps
Main menu