32 datasets found
  1. u

    NSF/NCAR GV HIAPER 1 Minute Data Merge

    • data.ucar.edu
    ascii
    Updated Aug 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Chen; Jennifer R. Olson; Michael Shook (2025). NSF/NCAR GV HIAPER 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/R1RA-JHKZ-W913
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Gao Chen; Jennifer R. Olson; Michael Shook
    Time period covered
    May 18, 2012 - Jun 30, 2012
    Area covered
    Description

    This data set contains NSF/NCAR GV HIAPER 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 30 June 2012. These are updated merges from the NASA DC3 archive that were made available 13 June 2014. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-gV_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.

  2. f

    Merge script

    • figshare.com
    txt
    Updated May 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Aguasvivas; Jon Andoni Duñabeitia (2018). Merge script [Dataset]. http://doi.org/10.6084/m9.figshare.5924797.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2018
    Dataset provided by
    figshare
    Authors
    Jose Aguasvivas; Jon Andoni Duñabeitia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Script in R language to load the databases and merge them into an unique file. This script may be modified and extended as needed.

  3. KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/korus-aq-aircraft-merge-data-files-9bba5
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.

  4. Z

    Data from: A dataset to model Levantine landcover and land-use change...

    • data.niaid.nih.gov
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kempf, Michael (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10396147
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset authored and provided by
    Kempf, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R

    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

    2_MERGE_MODIS_tiles.R

    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

    3_CROP_MODIS_merged_tiles.R

    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS. The repository provides the already clipped and merged NDVI datasets.

    4_TREND_analysis_NDVI.R

    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

    5_BUILT_UP_change_raster.R

    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

    6_POPULATION_numbers_plot.R

    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

    7_YIELD_plot.R

    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

    8_GLDAS_read_extract_trend

    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection). Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

    (9_workflow_diagramme) this simple code can be used to plot a workflow diagram and is detached from the actual analysis.

    Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, and Funding acquisition: Michael

  5. u

    NASA DC-8 1 Minute Data Merge

    • data.ucar.edu
    ascii
    Updated Aug 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Chen; Jennifer R. Olson; Michael Shook (2025). NASA DC-8 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/VM9C-1C16-H003
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Gao Chen; Jennifer R. Olson; Michael Shook
    Time period covered
    May 1, 2012 - Jun 30, 2012
    Area covered
    Description

    This dataset contains NASA DC-8 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. This dataset contains updated data provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-dc8_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This dataset is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and dataset comments. For more information on updates to this dataset, please see the readme file.

  6. u

    NASA DC-8 10 Second Data Merge

    • data.ucar.edu
    ascii
    Updated Aug 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Chen; Jennifer R. Olson; Michael Shook (2025). NASA DC-8 10 Second Data Merge [Dataset]. http://doi.org/10.26023/CHJ0-RYQ4-GR10
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Gao Chen; Jennifer R. Olson; Michael Shook
    Time period covered
    May 18, 2012 - Jun 22, 2012
    Area covered
    Description

    This data set contains NASA DC-8 10 Second Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. These merges are an updated version provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg10-dc8_merge_YYYYMMdd_R*_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments. For the latest information on the updates to this dataset, please see the readme file.

  7. f

    Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  8. d

    Health and Retirement Study (HRS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D

  9. e

    Merger of BNV-D data (2008 to 2019) and enrichment

    • data.europa.eu
    zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick VINCOURT (2025). Merger of BNV-D data (2008 to 2019) and enrichment [Dataset]. https://data.europa.eu/data/datasets/5f1c3eca9d149439e50c740f
    Explore at:
    zip(18530465)Available download formats
    Dataset updated
    Jan 16, 2025
    Dataset authored and provided by
    Patrick VINCOURT
    Description

    Merging (in Table R) data published on https://www.data.gouv.fr/fr/datasets/ventes-de-pesticides-par-departement/, and joining two other sources of information associated with MAs: — uses: https://www.data.gouv.fr/fr/datasets/usages-des-produits-phytosanitaires/ — information on the “Biocontrol” status of the product, from document DGAL/SDQSPV/2020-784 published on 18/12/2020 at https://agriculture.gouv.fr/quest-ce-que-le-biocontrole

    All the initial files (.csv transformed into.txt), the R code used to merge data and different output files are collected in a zip. enter image description here NB: 1) “YASCUB” for {year,AMM,Substance_active,Classification,Usage,Statut_“BioConttrol”}, substances not on the DGAL/SDQSPV list being coded NA. 2) The file of biocontrol products shall be cleaned from the duplicates generated by the marketing authorisations leading to several trade names.
    3) The BNVD_BioC_DY3 table and the output file BNVD_BioC_DY3.txt contain the fields {Code_Region,Region,Dept,Code_Dept,Anne,Usage,Classification,Type_BioC,Quantite_substance)}

  10. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  11. Supporting data for "Northwestern European arthropod life histories and...

    • zenodo.org
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garben Logghe; Garben Logghe; Femke Batsleer; Femke Batsleer; Dirk Maes; Dirk Maes; Tristan Permentier; Matty P. Berg; Dimitri Brosens; Stijn Cooleman; Stijn Cooleman; Pallieter De Smedt; Pallieter De Smedt; Jonas Hagge; Jonas Hagge; Jorg Lambrechts; Jorg Lambrechts; Marc Pollet; Marc Pollet; Fons Verheyde; Fons Verheyde; Dries Bonte; Dries Bonte; Tristan Permentier; Matty P. Berg; Dimitri Brosens (2025). Supporting data for "Northwestern European arthropod life histories and ecological traits" [Dataset]. http://doi.org/10.5281/zenodo.13379714
    Explore at:
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Garben Logghe; Garben Logghe; Femke Batsleer; Femke Batsleer; Dirk Maes; Dirk Maes; Tristan Permentier; Matty P. Berg; Dimitri Brosens; Stijn Cooleman; Stijn Cooleman; Pallieter De Smedt; Pallieter De Smedt; Jonas Hagge; Jonas Hagge; Jorg Lambrechts; Jorg Lambrechts; Marc Pollet; Marc Pollet; Fons Verheyde; Fons Verheyde; Dries Bonte; Dries Bonte; Tristan Permentier; Matty P. Berg; Dimitri Brosens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Northwestern Europe
    Description

    These are the raw data and code that were used to create the dataset "Northwestern European arthropod life histories and ecological traits" on GBIF (https://www.gbif.org/dataset/a684ca81-66f3-4d38-863e-463f79220f03).

    For a more detailed description of this dataset and its purpose, see the manuscript by Logghe et al. (2024): https://doi.org/10.1101/2024.12.17.628846.

    The dataset published on GBIF resulted from merging the raw datasets found in "Raw datasets.7z" with each other and with the realised thermal niche data found in "Calculating thermal niches.7z".

    "Calculating thermal niches.7z" contains data files on realised thermal niches, calculated by overlaying species distribution data derived from GBIF with WorldClim climate data in QGIS. Every order has its own separate txt file, with estimations of mean, minimum and maximum temperature niche for every species. Additionally, the ZIP file contains the code that was used for downloading GBIF occurences in R and calculating the thermal niches in QGIS.

    "Raw datasets.7z" contains raw datasets for every order. Within each dataset, one row contains trait data derived for a single species from a single source. This means that the same species might occur multiple times within the dataset. A metadata file is provided to explain the different columns.

    "Merging raw datasets.Rmd" is the R script that was used to merge all raw datasets with each other and with the calculated thermal niches, resulting in the "Northwestern European arthropod life histories and ecological traits" dataset on GBIF.

  12. u

    NSF/NCAR GV HIAPER 10 Second Data Merge

    • data.ucar.edu
    ascii
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Chen; Jennifer R. Olson; Michael Shook (2025). NSF/NCAR GV HIAPER 10 Second Data Merge [Dataset]. http://doi.org/10.26023/XJSD-81JE-6J0C
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Gao Chen; Jennifer R. Olson; Michael Shook
    Time period covered
    May 18, 2012 - Jun 30, 2012
    Area covered
    Description

    This data set contains NSF/NCAR GV HIAPER 10 Second Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 30 June 2012. These are updatd merges available through the NASA DC3 archive as of 13 June 2014. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg10-gV_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.

  13. Z

    Designing Types for R, Empirically (Dataset)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Goel, Aviral (2024). Designing Types for R, Empirically (Dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4091817
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Krikava, Filip
    Vitek, Jan
    Turcotte, Alexi
    Goel, Aviral
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is intended to accompany the paper "Designing Types for R, Empirically" (@ OOPSLA'20, link to paper). This data was obtained by running the Typetracer (aka propagatr) dynamic analysis tool (link to tool) on the test, example, and vignette code of a corpus of >400 extensively used R packages.

    Specifically, this dataset contains:

    function type traces for >400 R packages (raw-traces.tar.gz);

    trace data processed into a more readable/usable form (processed-traces.tar.gz), which was used in obtaining results in the paper;

    inferred type declarations for the >400 R packages using various strategies to merge the processed traces (see type-declarations-* directories), and finally;

    contract assertion data from running the reverse dependencies of these packages and checking function usage against the declared types (contract-assertion-reverse-dependencies.tar.gz).

    A preprint of the paper is also included, which summarizes our findings.

    Fair warning Re: data size: the raw traces, once uncompressed, take up nearly 600GB. The already processed traces are in the 10s of GB, which should be more manageable for a consumer-grade computer.

  14. e

    Merging galaxies in COSMOS to z~1 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Merging galaxies in COSMOS to z~1 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/f72a6849-c5ab-5760-9dd0-9e8f37e66d6f
    Explore at:
    Dataset updated
    Oct 20, 2023
    Description

    The role of major mergers in galaxy and black hole formation is not well-constrained. To help address this, we develop an automated method to identify late-stage galaxy mergers before coalescence of the galactic cores. The resulting sample of mergers is distinct from those obtained using pair-finding and morphological indicators. Our method relies on median-filtering of high-resolution images to distinguish two concentrated galaxy nuclei at small separations. This method does not rely on low surface brightness features to identify mergers, and is therefore reliable to high redshift. Using mock images, we derive statistical contamination and incompleteness corrections for the fraction of late-stage mergers. The mock images show that our method returns an uncontaminated (<10%) sample of mergers with projected separations between 2.2 and 8kpc out to z~1. We apply our new method to a magnitude-limited (m_FW814_10.6 and 0.25<z<=1.00, we find ~5% of systems are late-stage mergers. Correcting for incompleteness and contamination, the fractional merger rate increases strongly with redshift as {\bf R}merge\propto(1+z)^3.8+/-0.9^, in agreement both with earlier studies and with dark matter halo merger rates. Separating the sample into star-forming and quiescent galaxies shows that the merger rate for star-forming galaxies increases strongly with redshift, (1+z)^4.5+/-1.3^, while the merger rate for quiescent galaxies is consistent with no evolution, (1+z)^1.1+/-1.2^. The merger rate also becomes steeper with decreasing stellar mass. Limiting our sample to galaxies with spectroscopic redshifts from zCOSMOS, we find that the star formation rates and X-ray selected active galactic nucleus (AGN) activity in likely late-stage mergers are higher by factors of ~2 relative to those of a control sample. Combining our sample with more widely separated pairs, we find that 8+/-5% of star formation and 20+/-8% of AGN activity are triggered by close encounters (<143kpc) or mergers, providing additional evidence that major mergers are not the only channels for star formation and black hole growth.

  15. f

    Scripts for Analysis

    • figshare.com
    txt
    Updated Jul 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sneddon Lab UCSF (2018). Scripts for Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6783569.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 18, 2018
    Dataset provided by
    figshare
    Authors
    Sneddon Lab UCSF
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Scripts used for analysis of V1 and V2 Datasets.seurat_v1.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, PCA analysis, clustering, tSNE visualization. Used for v1 datasets. merge_seurat.R - merge two or more seurat objects into one seurat object. Perform linear regression to remove batch effects from separate objects. Used for v1 datasets. subcluster_seurat_v1.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA. Used for v1 datasets.seurat_v2.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, and PCA analysis. Used for v2 datasets. clustering_markers_v2.R - clustering and tSNE visualization for v2 datasets. subcluster_seurat_v2.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA analysis. Used for v2 datasets.seurat_object_analysis_v1_and_v2.R - downstream analysis and plotting functions for seurat object created by seurat_v1.R or seurat_v2.R. merge_clusters.R - merge clusters that do not meet gene threshold. Used for both v1 and v2 datasets. prepare_for_monocle_v1.R - subcluster cells of interest and perform linear regression, but not scaling in order to input normalized, regressed values into monocle with monocle_seurat_input_v1.R monocle_seurat_input_v1.R - monocle script using seurat batch corrected values as input for v1 merged timecourse datasets. monocle_lineage_trace.R - monocle script using nUMI as input for v2 lineage traced dataset. monocle_object_analysis.R - downstream analysis for monocle object - BEAM and plotting. CCA_merging_v2.R - script for merging v2 endocrine datasets with canonical correlation analysis and determining the number of CCs to include in downstream analysis. CCA_alignment_v2.R - script for downstream alignment, clustering, tSNE visualization, and differential gene expression analysis.

  16. Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katharina Zinke; Katharina Zinke (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. http://doi.org/10.5281/zenodo.14196539
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Katharina Zinke; Katharina Zinke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

    This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

    The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

    ## Data sources

    Folder 01_SourceData/

    - PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

    - ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

    - ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

    - Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

    ## Automatic classification

    Folder 02_AutomaticClassification/

    - (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

    - (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

    - PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

    - oddpub_results_wDOIs.csv (results file of the ODDPub classification)

    - PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

    ## Manual coding

    Folder 03_ManualCheck/

    - CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

    - ManualCheck_2023-06-08.csv (Manual coding results file)

    - PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

    ## Explorative analysis for the discoverability of open data

    Folder04_FurtherAnalyses

    Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

    ## R-Script

    Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)

  17. NCES Academic Library Survey Dataset 1996 - 2020 -- alsMERGE_2020.csv

    • figshare.com
    txt
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Starr Hoffman (2024). NCES Academic Library Survey Dataset 1996 - 2020 -- alsMERGE_2020.csv [Dataset]. http://doi.org/10.6084/m9.figshare.25007429.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Starr Hoffman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data from the National Center for Education Statistics' Academic Library Survey, which was gathered every two years from 1996 - 2014, and annually in IPEDS starting in 2014 (this dataset has continued to only merge data every two years, following the original schedule). This data was merged, transformed, and used for research by Starr Hoffman and Samantha Godbey.This data was merged using R; R scripts for this merge can be made available upon request. Some variables changed names or definitions during this time; a view of these variables over time is provided in the related Figshare Project. Carnegie Classification changed several times during this period; all Carnegie classifications were crosswalked to the 2000 classification version; that information is also provided in the related Figshare Project. This data was used for research published in several articles, conference papers, and posters starting in 2018 (some of this research used an older version of the dataset which was deposited in the University of Nevada, Las Vegas's repository).SourcesAll data sources were downloaded from the National Center for Education Statistics website https://nces.ed.gov/. Individual datasets and years accessed are listed below.[dataset] U.S. Department of Education, National Center for Education Statistics, Academic Libraries component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Academic Libraries Survey (ALS) Public Use Data File, Library Statistics Program, (2012, 2010, 2008, 2006, 2004, 2002, 2000, 1998, 1996), https://nces.ed.gov/surveys/libraries/aca_data.asp[dataset] U.S. Department of Education, National Center for Education Statistics, Institutional Characteristics component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Fall Enrollment component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014, 2012, 2010, 2008, 2006, 2004, 2002, 2000, 1998, 1996), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Human Resources component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014, 2012, 2010, 2008, 2006), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Employees Assigned by Position component, Integrated Postsecondary Education Data System (IPEDS), (2004, 2002), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Fall Staff component, Integrated Postsecondary Education Data System (IPEDS), (1999, 1997, 1995), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7

  18. d

    Replication Data for: Trajectories of mental health problems in childhood...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Girard, Lisa-Christine; Okolikj, Martin (2023). Replication Data for: Trajectories of mental health problems in childhood and adult voting behaviour: Evidence from the 1970s British Cohort Study [Dataset]. http://doi.org/10.7910/DVN/S6UUBF
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Girard, Lisa-Christine; Okolikj, Martin
    Description

    This file describes the replication material for: Trajectories of mental health problems in childhood and adult voting behaviour: Evidence from the 1970s British Cohort Study. Authors: Lisa-Christine Girard & Martin Okolikj. Accepted in Political Behavior. This dataverse holds the following 4 replication files: 1. data_cleaning_traj.R - This file is designed to load, merge and clean the datasets for the estimation of trajectories along with the rescaling of the age 10 Rutter scale. This file was prepared using R-4.1.1 version. 2. traj_estimation.do - With the dataset merged from data_cleaning_traj.R, we run this file in STATA to create and estimate trajectories, to be included in the full dataset. This file was prepared using STATA 17.0 version. 3. data_cleaning.R - This is the file designed to load, merge and clean all datasets in one for preparation of the main analysis following the trajectory estimation. This file was prepared using R-4.1.1 version. 4. POBE Analysis.do - The analysis file is designed to generate the results from the tables in the published paper along with all supplementary materials. This file was prepared using STATA 17.0 version. The data can be accessed at the following address. It requires user registration under special licence conditions: http://discover.ukdataservice.ac.uk/series/?sn=200001. If you have any questions or spot any errors please contact g.lisachristine@gmail.com or martin.okolic@gmail.com.

  19. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Sep 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(158340417603 bytes)Available download formats
    Dataset updated
    Sep 25, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  20. e

    WFDE5 over land merged with ERA5 over the ocean (W5E5) - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Dec 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). WFDE5 over land merged with ERA5 over the ocean (W5E5) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e0eb4476-e544-5dc4-953b-0e89d5df7c4d
    Explore at:
    Dataset updated
    Dec 20, 2019
    Description

    The W5E5 dataset was compiled to support the bias adjustment of climate input data for the impact assessments carried out in phase 3b of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP3b). Version 1.0 of the W5E5 dataset covers the entire globe at 0.5° horizontal and daily temporal resolution from 1979 to 2016. Data sources of W5E5 are version 1.0 of WATCH Forcing Data methodology applied to ERA5 data (WFDE5; Weedon et al., 2014; Cucchi et al., 2020), ERA5 reanalysis data (Hersbach et al., 2019), and precipitation data from version 2.3 of the Global Precipitation Climatology Project (GPCP; Adler et al., 2003). Variables (with short names and units in brackets) included in the W5E5 dataset are Near Surface Relative Humidity (hurs, %), Near Surface Specific Humidity (huss, kg kg-1), Precipitation (pr, kg m-2 s-1), Snowfall Flux (prsn, kg m-2 s-1), Surface Air Pressure (ps, Pa), Sea Level Pressure (psl, Pa), Surface Downwelling Longwave Radiation (rlds, W m-2), Surface Downwelling Shortwave Radiation (rsds, W m-2), Near Surface Wind Speed (sfcWind, m s-1), Near-Surface Air Temperature (tas, K), Daily Maximum Near Surface Air Temperature (tasmax, K), Daily Minimum Near Surface Air Temperature (tasmin, K), Surface Altitude (orog, m), and WFDE5-ERA5 Mask (mask, 1). W5E5 is a merged dataset. It combines WFDE5 data over land with ERA5 data over the ocean. The mask used for the merge is included in the dataset. The mask is equal to 1 over land and equal to 0 over the ocean. Over land, orog is the surface altitude used for elevation corrections in WFDE5. For all other variables already included in WFDE5 (huss, prsn, ps, rlds, rsds, sfcWind, tas), W5E5 data over land are equal to the daily mean values of the corresponding hourly WFDE5 data. W5E5 hurs over land is the daily mean of hourly hurs computed from hourly WFDE5 huss, ps, and tas using the equations of Buck (1981) as described in Weedon et al. (2010). W5E5 pr over land is the daily mean of the sum of hourly WFDE5 rainfall and snowfall. Note that W5E5 pr and prsn over land are based on WFDE5 rainfall and snowfall bias-adjusted using GPCC monthly precipitation totals. W5E5 psl over land is the daily mean of hourly psl computed from hourly WFDE5 orog, ps, and tas according to psl = ps * exp((g * orog) / (r * tas)), where g is gravity, and r is the specific gas constant of dry air. Lastly, W5E5 tasmax and tasmin over land are the daily maximum and minimum, respectively, of hourly WFDE5 tas. Over the ocean, W5E5 data are based on temporally (from hourly to daily resolution) and spatially (from 0.25° to 0.5° horizontal resolution) aggregated ERA5 data. The spatial aggregation using first-order conservative remapping was always done after the temporal aggregation. For tasmax and tasmin, hourly tas values were aggregated to daily maximum and minimum values, respectively. For all other variables, hourly values were aggregated to daily mean values. Variables unavailable in ERA5 (huss, hurs, sfcWind, orog) were first derived from available variables at hourly temporal and 0.25° horizontal resolution and then aggregated like all other variables. huss and hurs were derived from Near Surface Dewpoint Temperature, ps, and tas using the equations of Buck (1981) as described in Buck (2010). sfcWind was derived from Eastward Near-Surface Wind (uas) and Northward Near-Surface Wind (vas) according to sfcWind = sqrt(uas * uas + vas * vas). orog is equal to surface geopotential divided by gravity. Lastly, pr and prsn were bias-adjusted such that monthly W5E5 precipitation totals match GPCP version 2.3 values over the ocean. Monthly rescaling factors used for this purpose were computed following the scale-selective rescaling procedure described by Balsamo et al. (2010).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gao Chen; Jennifer R. Olson; Michael Shook (2025). NSF/NCAR GV HIAPER 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/R1RA-JHKZ-W913

NSF/NCAR GV HIAPER 1 Minute Data Merge

Explore at:
asciiAvailable download formats
Dataset updated
Aug 1, 2025
Authors
Gao Chen; Jennifer R. Olson; Michael Shook
Time period covered
May 18, 2012 - Jun 30, 2012
Area covered
Description

This data set contains NSF/NCAR GV HIAPER 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 30 June 2012. These are updated merges from the NASA DC3 archive that were made available 13 June 2014. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-gV_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.

Search
Clear search
Close search
Google apps
Main menu