Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterThis module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m2s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.
Info
ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3
barplot_R.R -> code to generate bar plot in R statistic 3.3.3
boxplotv2.R -> code to generate boxplot in R statistic 3.3.3
pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.
who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.
who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.
Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content
ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.
Consider citing our work.
Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433
Facebook
TwitterThis dataset is a clean CSV file with the most recent estimates of the population of the countries according to Wolrdometer. The data is taken from the following link: https://www.worldometers.info/world-population/population-by-country/
The data has been generated by websraping the aforementioned link on the 16th August 2021. Below is the code used to make CSV data in Python 3.8:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.worldometers.info/world-population/population-by-country/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
countries = soup.find_all("table")[0]
dataframe = pd.read_html(str(countries))[0]
dataframe.to_csv("countries_by_population_2021.csv", index=False)
The creation of this dataset would not be possible without a team of Worldometers, a data aggregation website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
• aidsSystemData-DIB.xlsx contains the first combination of the USDA ERS data into a single tabular format. It also contains the formulas for the calculation of derived data. • aidsSystemData3.csv is the .xlsx file converted to .csv for import into R. • aidsvdataInbrief.R contains all code used to estimate and calculate elasticities. The following packages must be installed for the script: 'tidyverse', 'lubridate', 'micEconAids', and 'broom'. • Results.xlsx contains all results. Tabs are labeled as “results_” lags between price and quantity, and “h” or “m” to indicate Hicksian or Marshallian. • all.Rdata contains all results and intermediate objects.
Facebook
TwitterPhytoplankton cells span a large size range, from picoplankton (<2µm), nanoplankton (2 to 20µm), microplankton (20 to 200µm) to macroplankton (200 to <2000µm). Cell size interacts with multiple selective pressures, including cellular metabolic rate, light absorption, nutrient uptake, cell nutrient quotas, trophic interactions and diffusional exchanges with the environment. Beyond simple size, cells of different shapes differ in surface area to volume ratio. For example, more elongated cells, such as pennate diatoms, have a larger surface area to volume ratio compared to more rounded cells, such as centric diatoms, of equivalent biovolume, which can in turn influence diffusional exchanges between cells and their environment. We assembled metadata on diverse marine phytoplankters, in parallel with genomic or transcriptomic data annotations to identify genes encoding enzymes, to facilitate analyses of genomic patterns of encoded enzymes across diverse taxa, sizes, growth forms and or..., MetaData.csv contains information on the site and latitude of origin, cell size, genome size, presence of flagella and colony formation assembled for 146 diverse marine phytoplankters from citations listed in MetaData.csv. We used an automated pipeline implemented through Snakemake to pass gene sequences from the downloaded genomes and/or transcriptomes from the 146 phytoplankters, in .fasta format, to the eggNOG 5.0 database. We then used eggNOG-Mapper 2.0.6 and the DIAMOND algorithm to annotate potential orthologs in each analyzed genome or transcriptome, using the following parameters: seed_ortholog_evalue = 0.001, seed_ortholog_score = 60, tax_scope = "auto", go_evidence = "non-electronic", query_cover = 20 and subject_cover = 0.
The output of automatically annotated orthologs, from each genome or transcriptome, from the bioinformatic pipeline was compiled into combinedHits.csv. Definitions of annotations and variable names are listed in CombinedHitsDataDictionary.csv. Definitio..., In MetaData.csv, combinedHits.csv, MetaDataDictionary.csv and CombinedHitsDataDictionary.csv, missing values are left blank to facilitate direct import into a dataframe under 'R'. .csv files are accessible through many software platforms including LibreOffice, but combinedHits.csv is large, so importing directly into 'R' is advisable. Files and Directory Structures:
GitHub repository:Â Â https://github.com/FundyPhytoPhys/ROS_bioinfo/tree/master/ROSGenomicPatternsAcrossMarinePhytoplankton
contains the code used to generate:Omar NM, Fleury K, Beardsall B, Prášil O, Campbell DA, 2022, Reactive Oxygen Species Production and Scavenging; Genomic Patterns Across Marine Phytoplankton, in review
R/ROSDataImport.Rmd imports data from the Snakemake pipeline, to generate CombinedHits.csv.
R/ROSModels.Rmd imports CombinedHits.csv and MetaData.csv, and generates the Poisson and Quasi-Poisson models used in the manuscript.
R/ROSGenomicPatternsBySubstrate.Rmd generates the manuscript.
The metada...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
csv file for import into R as data framefor the analysis:Dependent Variable - dplooks - difference in proportion of looks to NP1 and NP2 pictures (averaged across 6720ms into display until the end of the display - 10sec after initial presentation)Independent VariableVBias - causal bias of verb (note that consequentiality bias is to the other NP)Sources of random effectsPart - participantsItem - itemsIV is numerical, other variables are factorsOther variables are included in the file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data files and Python and R scripts are provided for Case Study 1 of the openENTRANCE project. The data covers 10 residential devices on the NUTS2 level for the EU27 + UK +TR + NO + CH from 2020-2050. The devices included are full battery electric vehicles (EV), storage heater (SH), water heater with storage capabilitites (WH), air conditiong (AC), heat circulation pump (CP), air-to-air heat pump (HP), refrigeration (includes refrigerators (RF) and freezers (FR)), dish washer (DW), washing machine (WM), and tumble drier (TD). The data for the study uses represenative hours to describe load expectations and constraints for each residential device - hourly granularity from 2020 to 2050 for a representative day for each month (i.e. 24 hours for an average day in each month). The aggregated final results are in Full_potential.V9.csv and acheivable_NUTS2_summary.csv. The file metaData.Full_Potential.csv is provided to guide users on the nomenclature in Full_potential.V9.csv and the disaggregated data sets.The disaggregated loads can be found in d_ACV8.csv, d_CPV6.csv, d_DWV6.csv, d_EVV7.csv, d_FRV5.csv, d_HPV4.csv, d_RFV5.csv, d_SHV7.csv, d_TDV6.csv, d_WHV7.csv, d_WMV6.csv while the disaggregated maximum capacities p_ACV8.csv, p_CPV6.csv, p_DWV6.csv, p_EVV7.csv, p_FRV5.csv, p_HPV4.csv, p_RFV5.csv, p_SHV7.csv, p_TDV6.csv, p_WHV7.csv, p_WMV6.csv. Full_potential.V9.csv shows the NUTS2 level unadjusted loads for the residential devices using representative hours from 2020-2050. The loads provided here have not been adjusted with the direct load participation rates (see paper for more details). More details on the dataset can be found in the metaData.Full_Potential.csv file. The acheivable_NUTS2_summary.csv shows the NUTS2 level acheivable direct load control potentials for the average hour in the respective year (years - 2020, 2022,2030,2040, 2050). These summaries have allready adjusted the disaggregated loads with direct load participation rates from participation_rates_country.csv. A detailed overview of the data files are provided below. Where possible, a brief description, input data, and script use to generate the data is provided. If questions arise, first refer to the publication. If something still needs clarification, send an email to ryano18@vt.edu. Description of data provided Achievable_NUTS2_summary.csv Description Average hourly achievable direct load potentials for each NUTS2 region and device for 2020, 2022, 2030,2040, 2050 Data input Full_potential.V9.csv participation_rates_country.csv P_inc_SH.csv P_inc_WH.csv P_inc_HP.csv P_inc_DW.csv P_inc_WM.csv P_inc_TD.csv Script NUTS2_acheivable.R COP_.1deg_11-21_V1.csv Description NUTS2 average coefficient of performance estimates from 2011-2021 daily temperature Data tg_ens_mean_0.1deg_reg_2011-2021_v24.0e.nc NUTS_RG_01M_2021_3857.shp nhhV2.csv Script COP_from_E-OBS.R Country dd projections.csv Description Assumptions for annual change in CDD and HDD Spinoni, J., Vogt, J. V., Barbosa, P., Dosio, A., McCormick, N., Bigano, A., & Füssel, H. M. (2018). Changes of heating and cooling degree‐days in Europe from 1981 to 2100. International Journal of Climatology, 38, e191-e208. Expectations for future HDD and CDD used the long-run averages and country level expected changes in the rcp45 scenario EV NUTS projectionsV5.csv Description NUTS2 level EV projections 2018-2050 Data input EV projectionsV5_ave.csv Country level EV projections NUTS 2 regional share of national vehicle fleet Eurostat - Vehicle Nuts.xlsx Script EVprojections_NUTS_V5.py EV_NVF_EV_path.xlsx Description Country level – EV share of new passenger vehicle fleet From: Mathieu, L., & Poliscanova, J. (2020). Mission (almost) accomplished. Carmakers’ Race to Meet the, 21. EV_parameters.xlsx Description Parameters used to calculate future loads from EVs Wunit_EV – represents annual kWh per EV evLIFE_150kkm number of years represents usable life if EV only lasted 150 thousand km. Hence, 150,000/average km traveled per year with respect to country (this variable is dropped and not used for estimation). Average age/#years assuming 150k life – represents Number of years Average between evLIFE_150kkm and average age of vehicle with respect to the country full_potentialV9.csv Description Final data that shows hourly demand (Maximum Reduction) and (Maximum Dispatch for each device, region, and year. This data has not been adjusted with participation_rates_country.csv Maximum dispatch is equal to max capacity – hourly demand with respect to the device, region, year, and hour. Script Full_potentialV9.py gils projection assumptions.xlsx Description Data from: Gils, H. C. (2015). Balancing of intermittent renewable power generation by demand response and thermal energy storage. A linear extrapolation was used to determine values for every year and country 2020-2050. AC – Air Conditioning, SH – Storage Heater, WH – Water heater with storage capability, CP – heat circulation pump, TD – Tumble Drier, WM – Washing
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Original data, R script (code) and code output for the paper published on Journal of Dairy Science. For best use, replicate analysis using R. Importing data using the .csv file may cause some variables (columns of the spreadsheet) to be imported with the wrong format. Any issues, do not hesitate in contact. Happy coding!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive historical record of stock prices for Toyota Motor Corporation (TM) from March 1, 1950, to January 31, 2025. The data includes daily stock prices, adjusted close prices, and trading volumes, making it a valuable resource for financial analysis, time series forecasting, and machine learning applications in the stock market domain.
Date: The trading date for the stock data.
Open: The opening price of the stock on the given date.
High: The highest price of the stock during the trading day.
Low: The lowest price of the stock during the trading day.
Close: The closing price of the stock on the given date.
Adj Close: The adjusted closing price, which accounts for corporate actions such as dividends, stock splits, and rights offerings.
Volume: The number of shares traded during the day.
The data was scraped using the Yahoo Finance API, a reliable and widely used source for financial market data. The dataset has been cleaned and structured to ensure accuracy and ease of use.
Time Series Analysis: Analyze trends, seasonality, and patterns in Toyota's stock prices over time.
Machine Learning: Build predictive models to forecast future stock prices or trading volumes.
Financial Research: Conduct research on stock market behavior, volatility, and the impact of external factors on stock prices.
Portfolio Management: Use the data to backtest trading strategies or optimize investment portfolios.
Educational Purposes: Ideal for teaching financial modeling, data analysis, and machine learning in a real-world context.
Time Period: March 1, 1950, to January 31, 2025.
Frequency: Daily.
Number of Rows: Approximately 27,000+ rows (depending on the exact date range).
File Format: CSV (Comma-Separated Values).
File Name: TM_1950-03-01_2025-01-31.csv.
Columns Description: Date: The date of the stock data in YYYY-MM-DD format.
Open: The opening price of the stock on the given date.
High: The highest price reached during the trading day.
Low: The lowest price reached during the trading day.
Close: The closing price of the stock on the given date.
Adj Close: The adjusted closing price, which accounts for corporate actions.
Volume: The total number of shares traded during the day.
The dataset has been carefully cleaned to ensure there are no missing or erroneous values.
Adjusted Close prices are provided to account for corporate actions, ensuring consistency in historical data.
This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share, adapt, and use the data for any purpose, provided you give appropriate credit.
Data sourced from Yahoo Finance.
Special thanks to the Yahoo Finance API for providing access to historical stock data.
If you use this dataset in your work, please cite it as follows:
Toyota Motor Corporation (TM) Historical Stock Data (1950-2025). How to Use: Download the Dataset: The dataset is available in CSV format, which can be easily imported into Python, R, or any other data analysis tool.
Exploratory Data Analysis (EDA): Start by visualizing the data to understand trends, volatility, and key patterns.
Modeling: Use the data to build predictive models using machine learning or statistical techniques.
Backtesting: Test trading strategies using historical data to evaluate their performance.
Example Use Case: python ``` import pandas as pd import matplotlib.pyplot as plt
df = pd.read_csv('TM_1950-03-01_2025-01-31.csv')
plt.figure(figsize=(10, 6)) plt.plot(df['Date'], df['Close'], label='Closing Price') plt.title('Toyota Motor Corporation (TM) Stock Price Over Time') plt.xlabel('Date') plt.ylabel('Price (USD)') plt.legend() plt.show() ```
If you have any feedback, suggestions, or would like to contribute to improving this dataset, please feel free to reach out or submit a pull request on the Kaggle dataset page.
This dataset is a valuable resource for anyone interested in financial analysis, stock market research, or machine learning applications. Whether you're a data scientist, financial analyst, or student, this dataset provides a robust foundation for exploring and understanding Toyota's stock performance over the decades.
This Data is completely scrape by Muhammad Atif Latif. You can get more Datasets toCLICK HERE
Facebook
TwitterThis dataset is necessary to reproduce results in the Manuscript"Evaluation of acaricide treatments to experimentally reduce winter tick load on moose" by De Pierre, Déry, Asselin, Leighton, Côté & Tremblay, 2024.See read_me.csv for variable names and details, and metadata. Additional details available in the manuscript and in R script.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A comprehensive, daily-updated dataset of US Dollar to Iranian Rial exchange rates (USD/IRR) with historical data from November 2011 to present. This dataset is ideal for financial analysis, economic research, forecasting, and machine learning projects.
The CSV file contains the following columns:
| Column | Description | Format | Example |
|---|---|---|---|
| Open Price | Opening price of the day | Integer | 1012100 |
| Low Price | Lowest price of the day | Integer | 1011700 |
| High Price | Highest price of the day | Integer | 1034100 |
| Close Price | Closing price of the day | Integer | 1029800 |
| Change Amount | Price change amount | String | 15400 |
| Change Percent | Price change percentage | String | 1.52% |
| Gregorian Date | Gregorian date | YYYY/MM/DD | 2025/09/06 |
| Persian Date | Persian/Shamsi date | YYYY/MM/DD | 1404/06/15 |
This live dataset, scraper source code and workflow is available on GitHub where you can explore, download, and use it directly.
"https://kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset/" target="_blank">
https://raw.githubusercontent.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset/main/assets/img/IntractiveChart.png">
Interactive charts and dataset overview are available at:
kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset
import pandas as pd
# Load dataset
df = pd.read_csv('data/Dollar_Rial_Price_Dataset.csv')
# Convert date column to datetime
df['Gregorian Date'] = pd.to_datetime(df['Gregorian Date'], format='%Y/%m/%d')
# Price columns are already integers
price_columns = ['Open Price', 'Low Price', 'High Price', 'Close Price']
print(df[price_columns].dtypes) # All should be int64
# pip install kagglehub[hf-datasets]
import kagglehub
df = kagglehub.load_dataset(
"kooroshkz/dollar-rial-toman-live-price-dataset",
adapter="huggingface",
file_path="Dollar_Rial_Price_Dataset.csv",
pandas_kwargs={"parse_dates": ["Gregorian Date"]}
)
print(df.head())
# Load dataset
data <- read.csv("data/Dollar_Rial_Price_Dataset.csv", stringsAsFactors = FALSE)
# Convert date column
data$Gregorian.Date <- as.Date(data$Gregorian.Date, format = "%Y/%m/%d")
# View structure
str(data)
This dataset is maintained using an automated web scraping system that:
If you find data inconsistencies or have suggestions for improvements, please open an issue in the GitHub repository.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this dataset in your research or projects, please cite:
Dollar-Rial-Toman Live Price Dataset
Author: Koorosh Komeili Zadeh
Source: https://github.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset
Data Source: TGJU.org (Tehran Gold & Jewelry Union)
Date Range: November 2011 - Present
USD to Rial dataset, Dollar to Toman dataset, Iran exchange rate CSV, USD/IRR daily price, foreign exchange Iran dataset, TGJU data, time series currency dataset
This ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single-cell RNA-seq data of the parasitic nematode Brugia malayi in the microfilariae development stage. Data includes untreated cellular transcriptional states and the transcriptional response to ivermectin (1 µM).
The unfiltered gene expression matrix is provided as both a Seurat and AnnData object. Filtered datasets are provided as .csv files for direct import into R.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List collyer_adams_Rcode.txt -- R code for running analysis collyer_adams_example_data.csv -- example data to input into R routine collyer_adams_example_xmat.csv -- coding for the design matrix used collyer_adams_ESA_supplement.zip -- all files at once
Description The collyer_adams_Rcode.txt file contains a procedure for performing the analysis described in Appendix A, using R. The procedure imports data and a design matrix (collyer_adams_example_data.csv and collyer_adams_example_xmat.csv are provided, and correspond to the example in Appendix A). The default number of permutations is 999, but can be changed. A matrix of random values (distances, contrasts, angles) and a results summary are created from the program. Users should be aware that importing different data sets will require altering some of the R code to accommodate their data (e.g., matrix dimensions would need to be changed).
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?