Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides monthly totals of US airline passengers from 1949 to 1960. This dataset is taken from a built-in dataset of R called AirPassengers. Analysts often employ various statistical techniques, such as decomposition, smoothing, and forecasting models, to analyze patterns, trends, and seasonal fluctuations within the data. Due to its historical nature and consistent temporal granularity, the Air Passengers dataset serves as a valuable resource for researchers, practitioners, and students in the fields of statistics, econometrics, and transportation planning.
Facebook
TwitterThe CO2 dataframe is a dataset built into R showing the results of an experiment on the cold tolerance of grass. Grass samples from two regions were grown in either a chilled or nonchilled environment, and their CO2 uptake rate was tested.The dataset has been downloaded as a .csv file.
The two types of region chosen for this experiment are Quebec and Mississippi. Each type has three different plants used for this experiment. Each plant has been treated in either a chilled and nonchilled environment. Average concentration is found to be constant for all categories. Thus, plots are made with respect to variation in average CO2 uptake.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.
Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.
In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.
You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes
- Finding correlations between different types of datasets
- Determining which datasets are most popular on Reddit
- Analyzing the sentiments of post and comments on Reddit's /r/datasets board
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |
File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
TwitterZooplankton biomass data collected from North Atlantic Ocean in 1958 - 1959 years received from NMFS.
Facebook
TwitterTemperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 02/02/1996)in the North Pacific, Coastal N Pacific and other locations in 1999 (NODC Accession 0000857). Data were submitted by the Japan Agency for Marine-Earth Science and Technology (JAMSTEC).
Facebook
TwitterThe data presented in this data file is a product of a journal publication. The dataset contains PCB sorption concentrations on encapsulants, PCB concentrations in the air and in wipe samples, model simulation of the PCB concentration gradient in the source and encapsulant layers on exposed surfaces of encapsulants and in room air at different times, the ranking of encapsulants’ performance. This dataset is associated with the following publication: Liu , X., Z. Guo, K. Krebs , N. Roache, R. Stinson, J. Nardin, R. Pope, C. Mocka, and R. Logan. Laboratory evaluation of PCBs encapsulation method. Indoor and Built Environment. Sage Publications, THOUSAND OAKS, CA, USA, 25(6): 895-915, (2016).
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Facebook
TwitterRelease of ships of opportunity XBT data compiled by the Maritime Environment Information Centre at the UKHO. The data represents 5 years (1995 - 2000) of archived SOOP XBT observations. The dataset contains 7214 observations from various vessels (see vessel summary), the data ranges from years 1984 to 2000. Previously to 1995 the data was released annually to the oceanographic community through ICES and US NODC, World Data Centre A.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets... In a way, the Kaggle community is built around them. You can't analyze data without having it. Here, we aim to create a meta-corpus of datasets posted to Reddit. A dataset dataset, if you will.
The following dataset is the comprehensive corpus of all the posts and comments made on Reddit's /r/datasets board, from its inception all the way to the first of March, 2022.
The dataset was procured using SocialGrep.
To preserve users' anonymity and to prevent targeted harassment, the data does not include usernames.
We would like to thank Chris Liverani for generously providing the cover image for this dataset.
Datasets are nice - we like our data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
trainingDataset.csv - data to train the decision tree model (DT) built in the research testDataset.csv - data to test the DT model rulesDataset: data to generate the association rules for the research
DT_model_script.r - R script to build the DT model and test it Rules_script.r - R script to build association rules for the research
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AWC to 60cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 60cm (mm of water to 60cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 60cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 60cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
Facebook
TwittermyPhyloDB is an open-source software package aimed at developing a user-friendly web-interface for accessing and analyzing all of your laboratory's microbial ecology data (currently supported project types: soil, air, water, microbial, and human-associated). The storage and handling capabilities of myPhyloDB archives users' raw sequencing files, and allows for easy selection of any combination of projects/samples from all of your projects using the built-in SQL database. The data processing capabilities of myPhyloDB are also flexible enough to allow the upload, storage, and analysis of pre-processed data or raw (454 or Illumina) data files using the built-in versions of Mothur and R. myPhyloDB is designed to run as a local web-server, which allows a single installation to be accessible to all of your laboratory members, regardless of their operating system or other hardware limitations. myPhyloDB includes an embedded copy of the popular Mothur program and uses a customizable batch file to perform sequence editing and processing. This allows myPhyloDB to leverage the flexibility of Mothur and allow for greater standardization of data processing and handling across all of your sequencing projects. myPhyloDB also includes an embedded copy of the R software environment for a variety of statistical analyses and graphics. Currently, myPhyloDB includes analysis for factor or regression-based ANcOVA, principal coordinates analysis (PCoA), differential abundance analysis (DESeq), and sparse partial least-squares regression (sPLS). Resources in this dataset:Resource Title: Website Pointer to myPhyloDB. File Name: Web Page, url: https://myphylodb.azurecloudgov.us/myPhyloDB/home/ Provides information and links to download latest version, release history, documentation, and tutorials including type of analysis you would like to perform (Univariate: ANCOVA/GLM; Multivariate: DiffAbund, PcoA, or sPLS).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the code and data files for "Biodiversity on public lands: how citizen science can help". Included in this are the following:
1. taxataxi_opt_multiple_reports.R which is code for running multiple iNaturalist Bioblitz datasets with built in re-setting mechanisms for the various reports run.
2. iNat_NPS_comparison_code.R which is code for running a single iNaturalist Bioblitz dataset. This is identical to (1) except for the re-setting mechanisms and some re-organization of parts of the code.
3. synonym_links & taxonomic_units which are used in conjunction with (1) & (2) for the synonym identification.
4. All_iNat.csv is all of the iNaturalist entries used in the initial study named above.
5. all_parks.csv contains all of the filtered data from the park units through (1) for the initial study named above.
6. invasives_for_all_table1.R is code used in the initial study to identify introduced species.
7. migration.R is code used in the initial study to identify previously unrecorded species that were recorded less than 50 times in BISON in the USA state that the park unit is located in.
8. table3_all_parks.csv is the data used in the migration.R code for identifying these species.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data were collected to characterize whole-house mechanical ventilation (WHMV) and indoor air quality (IAQ) in 55 homes in the Marine climate of Oregon and Cold-Dry climate of Colorado in the U.S. Sixteen homes were monitored for two weeks, with and without WHMV operating. Ventilation airflows; airtightness; time-resolved CO2, PM2.5 and radon; and time-integrated NO2, NOX and formaldehyde were measured. Participants provided information about IAQ-impacting activities, perceptions and ventilation use. All homes had operational cooktop ventilation and bathroom exhaust. Thirty homes had equipment that could meet the ASHRAE 62.2-2010 standard with continuous or controlled runtime and 34 had some WHMV operating as found. Thirty-five of 46 participants with WHMV reported they did not know how to operate it, and only half of the systems were properly labeled. Two-week homes had lower formaldehyde, radon, CO2, and NO (NOX-NO2) when operated with WHMV; and also had faster PM2.5 decays following indoor emission events. Overall IAQ satisfaction was similar in Oregon and Colorado, but more Colorado participants (19 vs. 3%) felt their IAQ could be improved and more reported dryness as a problem (58 vs. 14%). The collected data indicate that there are benefits of operating WHMV, even when continuous use may not be needed because outdoor pollutant concentrations are low and indoor sources do not present substantial challenges.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please note this dataset is the most recent version of the Administrative Boundaries (AB). For previous versions of the AB please go to this url: https://data.gov.au/data/dataset/previous-versions-of-the-geoscape-administrative-boundaries\r \r ----------------------------------\r \r Geoscape Administrative Boundaries is Australia’s most comprehensive national collection of boundaries, including government, statistical and electoral boundaries. It is built and maintained by Geoscape Australia using authoritative government data. Further information about contributors to Administrative Boundaries is available here.\r \r This dataset comprises seven Geoscape products:\r \r * Localities\r * Local Government Areas (LGAs)\r * Wards\r * Australian Bureau of Statistics (ABS) Boundaries\r * Electoral Boundaries\r * State Boundaries and\r * Town Points\r \r Updated versions of Administrative Boundaries are published on a quarterly basis.\r \r Users have the option to download datasets with feature coordinates referencing either GDA94 or GDA2020 datums.\r \r Notable changes in the November 2025 release\r \r * There have been spatial changes (area) greater than 1 km2 to the local government areas ‘Derwent Valley Council’ and ‘Brighton Council’ in Tasmania.\r \r * ‘Kairabah’ locality in ‘Logan City’ local government area has been retired in Queensland.\r \r * There have been spatial changes (area) greater than 1 km2 to the localities ‘Neilrex’, ‘Daruka’ and ‘Merrygoen’ in New South Wales, ‘Yarrabilba’, ‘Petrie’ and ‘Kallangur’ in Queensland, and ‘Flinders Ranges’ and ‘Angorigina’ in South Australia.\r \r \r Further information on Administrative Boundaries, including FAQs on the data, is available here or through Geoscape Australia’s network of partners. They provide a range of commercial products based on Administrative Boundaries, including software solutions, consultancy and support.\r \r \r \r Users must also note the following attribution requirements:\r \r Preferred attribution for the Licensed Material:\r \r
Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International license (CC BY 4.0).\r \r Preferred attribution for Adapted Material:\r \r Incorporates or developed using Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).\r \r
What to Expect When You Download Administrative Boundaries\r
\r Administrative Boundaries is large dataset (around 1.5GB unpacked), made up of seven themes each containing multiple layers.\r \r Users are advised to read the technical documentation including the product change notices and the individual product descriptions before downloading and using the product.\r \r Please note this dataset is the most recent version of the Administrative Boundaries (AB). For previous versions of the AB please go to this url: https://data.gov.au/dataset/ds-dga-b4ad5702-ea2b-4f04-833c-d0229bfd689e/details?q=previous\r \r
License Information\r
\r The Australian Government has negotiated the release of Administrative Boundaries to the whole economy under an open CCBY 4.0 license.\r \r Users must only use the data in ways that are consistent with the Australian Privacy Principles issued under the Privacy Act 1988 (Cth).\r \r
Facebook
TwitterAttribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
\r The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement.\r \r The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.\r \r
\r Stream network constructed and defined using datasets shown in the Lineage.\r \r The stream network constructed using surface water nodes to define reaches and the the classification was assigned by using the data from the stream network from the lineage and then assigned the following classfication: \r \r 1.\tsurface water change due to hydrology\r \r 2.\tno change modelled at link node within PAE\r \r 3.\tmodelled no change at link node\r \r 4.\tmodelled change at link node\r \r 5. assumed change due to proximity to mine pit\r \r 6. assumed change due to hydrology\r \r Further tie-breaks were decide based on stream order or stream segment length.\r \r
\r Bioregional Assessment Programme (2017) GAL Surface Water Reaches for Risk and Impact Analysis 20180803. Bioregional Assessment Derived Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/64c4d16f-bdfa-4fd6-bd72-c459503003bd.\r \r
\r * Derived From Onsite and offsite mine infrastructure for the Carmichael Coal Mine and Rail Project, Adani Mining Pty Ltd 2012\r \r * Derived From Alpha Coal Project Environmental Impact Statement\r \r * Derived From Geofabric Surface Cartography - V2.1\r \r * Derived From QLD Exploration and Production Tenements (20140728)\r \r * Derived From China Stone Coal Project initial advice statement\r \r * Derived From Kevin's Corner Project Environmental Impact Statement\r \r * Derived From Galilee surface water modelling nodes\r \r * Derived From Geoscience Australia GEODATA TOPO series - 1:1 Million to 1:10 Million scale\r \r * Derived From China First Galilee Coal Project Environmental Impact Assessment\r \r * Derived From GEODATA TOPO 250K Series 3\r \r * Derived From Seven coal mines included in Galilee surface water modelling\r \r
Facebook
TwitterThe simulated community datasets were built using the virtualspecies V1.5.1 R package (Leroy et al., 2016), which generates spatially-explicit presence/absence matrices from habitat suitability maps. We simulated these suitability maps using Gaussian fields neutral landscapes produced using the NLMR V1.0 R package (Sciaini et al., 2018). To allow for some level of overlap between species suitability maps, we divided the γ-diversity (i.e., the total number of simulated species) by an adjustable correlation value to create several species groups that share suitability maps. Using a full factorial design, we developed 81 presence/absence maps varying across four axes (see Supplemental Table 1 and Supplemental Figure 1): 1) landscape size, representing the number of sites in the simulated landscape; 2) γ-diversity; 3) the level of correlation among species suitability maps, with greater correlations resulting in fewer shared species groups among suitability maps; and 4) the habitat suitabil...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.