Facebook
TwitterThis data set contains NSF/NCAR GV HIAPER 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 30 June 2012. These are updated merges from the NASA DC3 archive that were made available 13 June 2014. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-gV_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.
Facebook
TwitterThis data set contains DLR Falcon 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 29 May 2012 through 14 June 2012. These merges were created using data in the NASA DC3 archive as of September 25, 2013. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg06-falcon_merge_YYYYMMdd_R2_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.
Facebook
TwitterMerging (in Table R) data published on https://www.data.gouv.fr/fr/datasets/ventes-de-pesticides-par-departement/, and joining two other sources of information associated with MAs: — uses: https://www.data.gouv.fr/fr/datasets/usages-des-produits-phytosanitaires/ — information on the “Biocontrol” status of the product, from document DGAL/SDQSPV/2020-784 published on 18/12/2020 at https://agriculture.gouv.fr/quest-ce-que-le-biocontrole
All the initial files (.csv transformed into.txt), the R code used to merge data and different output files are collected in a zip.
enter image description here
NB:
1) “YASCUB” for {year,AMM,Substance_active,Classification,Usage,Statut_“BioConttrol”}, substances not on the DGAL/SDQSPV list being coded NA.
2) The file of biocontrol products shall be cleaned from the duplicates generated by the marketing authorisations leading to several trade names.
3) The BNVD_BioC_DY3 table and the output file BNVD_BioC_DY3.txt contain the fields {Code_Region,Region,Dept,Code_Dept,Anne,Usage,Classification,Type_BioC,Quantite_substance)}
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
The stonks movement spawned by this is a very interesting one. It's rare to see an Internet meme have such an effect on real-world economy - yet here we are.
This dataset contains a collection of posts and comments mentioning GME in their title and body text respectively. The data is procured using SocialGrep. The posts and the comments are labelled with their score.
It'll be interesting to see how this effects the stock market prices in the aftermath with this new dataset
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The file contains posts from Reddit mentioning GME and their score. This can be used to analyze how the sentiment on GME affected its stock prices in the aftermath
- To study how social media affects stock prices
- To study how Reddit affects stock prices
- To study how the sentiment of a subreddit affects stock prices
If you use this dataset in your research, please credit the original authors. Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: six-months-of-gme-on-reddit-comments.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | body | The body of the post or comment. (String) | | sentiment | The sentiment of the post or comment. (String) | | score | The score of the post or comment. (Integer) |
File: six-months-of-gme-on-reddit-posts.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | type | The type of post or comment. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether the subreddit is NSFW. (Boolean) | | created_utc | The time the post or comment was created. (Timestamp) | | permalink | The permalink of the post or comment. (String) | | score | The score of the post or comment. (Integer) | | domain | The domain of the post or comment. (String) | | url | The URL of the post or comment. (String) | | selftext | The selftext of the post or comment. (String) | | title | The title of the post or comment. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
TwitterKORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023
This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.
The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).
## Data sources
Folder 01_SourceData/
- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)
- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)
- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)
- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)
## Automatic classification
Folder 02_AutomaticClassification/
- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)
- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)
- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)
- oddpub_results_wDOIs.csv (results file of the ODDPub classification)
- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)
## Manual coding
Folder 03_ManualCheck/
- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)
- ManualCheck_2023-06-08.csv (Manual coding results file)
- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)
## Explorative analysis for the discoverability of open data
Folder04_FurtherAnalyses
Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German
## R-Script
Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
What is the relationship between environment and democracy? The framework of cultural evolution suggests that societal development is an adaptation to ecological threats. Pertinent theories assume that democracy emerges as societies adapt to ecological factors such as higher economic wealth, lower pathogen threats, less demanding climates, and fewer natural disasters. However, previous research confused within-country processes with between-country processes and erroneously interpreted between-country findings as if they generalize to within-country mechanisms. In this article, we analyze a time-series cross-sectional dataset to study the dynamic relationship between environment and democracy (1949-2016), accounting for previous misconceptions in levels of analysis. By separating within-country processes from between-country processes, we find that the relationship between environment and democracy not only differs by countries but also depends on the level of analysis. Economic wealth predicts increasing levels of democracy in between-country comparisons, but within-country comparisons show that democracy declines as countries become wealthier over time. This relationship is only prevalent among historically wealthy countries but not among historically poor countries, whose wealth also increased over time. By contrast, pathogen prevalence predicts lower levels of democracy in both between-country and within-country comparisons. Our longitudinal analyses identifying temporal precedence reveal that not only reductions in pathogen prevalence drive future democracy, but also democracy reduces future pathogen prevalence and increases future wealth. These nuanced results contrast with previous analyses using narrow, cross-sectional data. As a whole, our findings illuminate the dynamic process by which environment and democracy shape each other.
Methods Our Time-Series Cross-Sectional data combine various online databases. Country names were first identified and matched using R-package “countrycode” (Arel-Bundock, Enevoldsen, & Yetman, 2018) before all datasets were merged. Occasionally, we modified unidentified country names to be consistent across datasets. We then transformed “wide” data into “long” data and merged them using R’s Tidyverse framework (Wickham, 2014). Our analysis begins with the year 1949, which was occasioned by the fact that one of the key time-variant level-1 variables, pathogen prevalence was only available from 1949 on. See our Supplemental Material for all data, Stata syntax, R-markdown for visualization, supplemental analyses and detailed results (available at https://osf.io/drt8j/).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
These Kaggle datasets offer a comprehensive analysis of the US real estate market, leveraging data sourced from Redfin via an unofficial API. It contains weekly snapshots stored in CSV files, reflecting the dynamic nature of property listings, prices, and market trends across various states and cities, except for Wyoming, Montana, and North Dakota, and with specific data generation for Texas cities. Notably, the dataset includes a prepared version, USA_clean_unique, which has undergone initial cleaning steps as outlined in the thesis. These datasets were part of my thesis; other two countries were France and UK.
These steps include: - Removal of irrelevant features for statistical analysis. - Renaming variables for consistency across international datasets. - Adjustment of variable value ranges for a more refined analysis.
Unique aspects such as Redfin’s “hot” label algorithm, property search status, and detailed categorizations of property types (e.g., single-family residences, condominiums/co-ops, multi-family homes, townhouses) provide deep insights into the market. Additionally, external factors like interest rates, stock market volatility, unemployment rates, and crime rates have been integrated to enrich the dataset and offer a multifaceted view of the real estate market's drivers.
The USA_clean_unique dataset represents a key step before data normalization/trimming, containing variables both in their raw form and categorized based on predefined criteria, such as property size, year of construction, and number of bathrooms/bedrooms. This structured approach aims to capture the non-linear relationships between various features and property prices, enhancing the dataset's utility for predictive modeling and market analysis.
See columns from USA_clean_unique.csv and my Thesis (Table 2.8) for exact column descriptions.
Table 2.4 and Section 2.2.3, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.
If you want to continue generating datasets yourself, see my Github Repository for code inspiration.
Let me know if you want to see how I got from raw data to USA_clean_unique.csv. Multiple steps include cleaning in Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming columns for consistency.
Facebook
Twitteranalyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.
The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.
For exact column descriptions, see columns for France_clean_unique.csv and my thesis.
Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.
If you want to continue generating datasets yourself, see my Github Repository for code inspiration.
Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The scripts in this folder weer used to combine all call statistic files per day into one file, resulting in nine files containing all call statistics per data. The script ‘merging_dataset.R’ was used to combine all days worth of call statistics and create subsets of two frequency ranges (18-32 and 32-96). The script ‘camera_data’ was used to combine all camera and observation data.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
These are the streamwise velocity time series measured in the wakes of two sets of porous discs in side-by-side setting as used in the manuscript ``Wake merging and turbulence transition downstream of side-by-side porous discs´´ which is accepted by Journal of Fluid Mechanics. Data was obtained by means of hot-wire anemometry in the Large Scale Wind Tunnel at the Norwegian University of Science and Technology in near-laminar inflow (background turbulence intensity of approximately 0.3%) at an inflow velocity of 10m/s (diameter-based Reynolds number 125000). Two types of porous discs with diameters D = 0.2m, one with uniform blockage and one with radially changing blockage, were used. Three spacings, namely 1.5D, 2D and 3D, were investigated. Span-wise profiles were measured at 8D and 30D downstream for each case, and a streamwise profile along the centerline between the discs was additionally obtained. In addition, measurements downstream of both disc types (singe disc setting) are provided as comparison. The scope of these experiments was to study the merging mechanisms of the turbulence when the two wakes are meeting.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R Script used to assess if the merging of sequence data from different sequencing centers was successful, and to reorganize count table data columns and average technical replicates.
Facebook
TwitterThis data set contains NASA DC-8 1 Second Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. These merges are an updated version that were provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. No "grand merge" has been provided for the 1-second data on the DC8 aircraft due to its prohibitive size (~1.5GB). In most cases, downloading the individual merge files for each day and simply concatenating them should suffice. This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments. For more information on the updates to this dataset, please see the readme file.
Facebook
TwitterThis data set contains merged bottle data from the SBI cruise on the United States Coast Guard Cutter (USCGC) Nathaniel B. Palmer (NBP03-04a). During this cruise rosette casts were conducted and a bottle data file was generated by the Scripps Service group from these water samples. Additional groups were funded to measure supplementary parameters from these same water samples. This data set is the first version of the merging of the Scripps Service group bottle data file with these data gathered by these additional groups.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
**Introduction ** This case study will be based on Cyclistic, a bike sharing company in Chicago. I will perform tasks of a junior data analyst to answer business questions. I will do this by following a process that includes the following phases: ask, prepare, process, analyze, share and act.
Background Cyclistic is a bike sharing company that operates 5828 bikes within 692 docking stations. The company has been around since 2016 and separates itself from the competition due to the fact that they offer a variety of bike services including assistive options. Lily Moreno is the director of the marketing team and will be the person to receive these insights from this analysis.
Case Study and business task Lily Morenos perspective on how to generate more income by marketing Cyclistics services correctly includes converting casual riders (one day passes and/or pay per ride customers) into annual riders with a membership. Annual riders are more profitable than casual riders according to the finance analysts. She would rather see a campaign targeting casual riders into annual riders, instead of launching campaigns targeting new costumers. So her strategy as the manager of the marketing team is simply to maximize the amount of annual riders by converting casual riders.
In order to make a data driven decision, Moreno needs the following insights: - A better understanding of how casual riders and annual riders differ - Why would a casual rider become an annual one - How digital media can affect the marketing tactics
Moreno has directed me to the first question - how do casual riders and annual riders differ?
Stakeholders Lily Moreno, manager of the marketing team Cyclistic Marketing team Executive team
Data sources and organization Data used in this report is made available and is licensed by Motivate International Inc. Personal data is hidden to protect personal information. Data used is from the past 12 months (01/04/2021 – 31/03/2022) of bike share dataset.
By merging all 12 monthly bike share data provided, an extensive amount of data with 5,400,000 rows were returned and included in this analysis.
Data security and limitations: Personal information is secured and hidden to prevent unlawful use. Original files are backed up in folders and subfolders.
Tools and documentation of cleaning process The tools used for data verification and data cleaning are Microsoft Excel and R programming. The original files made accessible by Motivate International Inc. are backed up in their original format and in separate files.
Microsoft Excel is used to generally look through the dataset and get a overview of the content. I performed simple checks of the data by filtering, sorting, formatting and standardizing the data to make it easily mergeable.. In Excel, I also changed data type to have the right format, removed unnecessary data if its incomplete or incorrect, created new columns to subtract and reformat existing columns and deleting empty cells. These tasks are easily done in spreadsheets and provides an initial cleaning process of the data.
R will be used to perform queries of bigger datasets such as this one. R will also be used to create visualizations to answer the question at hand.
Limitations Microsoft Excel has a limitation of 1,048,576 rows while the data of the 12 months combined are over 5,500,000 rows. When combining the 12 months of data into one table/sheet, Excel is no longer efficient and I switched over to R programming.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) state derived from satellite observations in the microwave domain.
The operational (ACTIVE, PASSIVE, COMBINED) ESA CCI SM products are available at https://catalogue.ceda.ac.uk/uuid/c256fcfeef24460ca6eb14bf0fe09572/
Understanding whether the soil surface is frozen or thawed is crucial for interpreting satellite-based soil moisture measurements and for many Earth system applications. The physical state of water in the soil strongly affects its dielectric properties, which in turn determine how satellites sense moisture content. Current ESA CCI Soil Moisture products exclude data when the surface is likely frozen, as reliable retrievals are not possible under such conditions. Yet, the freeze/thaw state itself carries valuable environmental information: it reflects the changing energy and water exchange between land and atmosphere, shapes seasonal hydrological cycles, and influences agriculture, ecosystems, and climate feedbacks across much of the Northern Hemisphere.
This dataset provides global estimates of the soil moisture freeze/thaw state for the period from 11-1978 to 12-2024 derived from PASSIVE (radiometer) satellite observations within the ESA CCI Soil Moisture framework. These radiometers, operating in the K-band frequency range, are sensitive to surface temperature, enabling the detection of frozen versus thawed conditions at daily temporal and ~25 km spatial sampling. Data from L-band missions (e.g., SMAP, SMOS) are not included, resulting in a total number of 12 satellites.
The classification algorithm, described in Van der Vliet et al. (2020), was originally developed to flag frozen conditions in soil moisture retrievals and has since evolved into a dedicated data product. It applies a decision-tree approach using multi-frequency satellite measurements to classify the surface state for each sensor. Individual classifications are then merged into a single spatiotemporal record using a conservative unanimity rule—if any contributing satellite detects a frozen surface, the merged product is classified as “frozen.”
While this approach ensures reliability, it may lead to some over-flagging, which could be refined in future versions. The current product achieves an estimated accuracy of 75% against in situ surface temperature observations and 92% compared to ERA5 reanalysis data.
You can use command-line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/m3g2x-a6958/files"
# Loop through years 1978 to 2024 and download & extract data
for year in {1978..2024}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1978-2024 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD) and month (MM) of that year in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name follows the convention:
ESACCI-SOILMOISTURE-L3S-FT-YYYYMMDD000000-fv09.2.nc
Each netCDF file contains 3 coordinate variables
and the following data variables
Additional information for each variable is given in the netCDF attributes.
Changes in v9.2 (first released version):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data from the National Center for Education Statistics' Academic Library Survey, which was gathered every two years from 1996 - 2014, and annually in IPEDS starting in 2014 (this dataset has continued to only merge data every two years, following the original schedule). This data was merged, transformed, and used for research by Starr Hoffman and Samantha Godbey.This data was merged using R; R scripts for this merge can be made available upon request. Some variables changed names or definitions during this time; a view of these variables over time is provided in the related Figshare Project. Carnegie Classification changed several times during this period; all Carnegie classifications were crosswalked to the 2000 classification version; that information is also provided in the related Figshare Project. This data was used for research published in several articles, conference papers, and posters starting in 2018 (some of this research used an older version of the dataset which was deposited in the University of Nevada, Las Vegas's repository).SourcesAll data sources were downloaded from the National Center for Education Statistics website https://nces.ed.gov/. Individual datasets and years accessed are listed below.[dataset] U.S. Department of Education, National Center for Education Statistics, Academic Libraries component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Academic Libraries Survey (ALS) Public Use Data File, Library Statistics Program, (2012, 2010, 2008, 2006, 2004, 2002, 2000, 1998, 1996), https://nces.ed.gov/surveys/libraries/aca_data.asp[dataset] U.S. Department of Education, National Center for Education Statistics, Institutional Characteristics component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Fall Enrollment component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014, 2012, 2010, 2008, 2006, 2004, 2002, 2000, 1998, 1996), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Human Resources component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014, 2012, 2010, 2008, 2006), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Employees Assigned by Position component, Integrated Postsecondary Education Data System (IPEDS), (2004, 2002), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Fall Staff component, Integrated Postsecondary Education Data System (IPEDS), (1999, 1997, 1995), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7
Facebook
TwitterThis data set contains NSF/NCAR GV HIAPER 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 30 June 2012. These are updated merges from the NASA DC3 archive that were made available 13 June 2014. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-gV_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.