22 datasets found

Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
w
Dataset of book subjects that contain The economics of immigration :...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain The economics of immigration : selected papers of Barry R. Chiswick [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=The+economics+of+immigration+:+selected+papers+of+Barry+R.+Chiswick&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 1 row and is filtered where the books is The economics of immigration : selected papers of Barry R. Chiswick. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Video game pricing analytics dataset
kaggle.com
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivi Deveshwar (2023). Video game pricing analytics dataset [Dataset]. https://www.kaggle.com/datasets/shivideveshwar/video-game-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shivi Deveshwar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The review dataset for 3 video games - Call of Duty : Black Ops 3, Persona 5 Royal and Counter Strike: Global Offensive was taken through a web scrape of SteamDB [https://steamdb.info/] which is a large repository for game related data such as release dates, reviews, prices, and more. In the initial scrape, each individual game has two files - customer reviews (Count: 100 reviews) and price time series data.

To obtain data on the reviews of the selected video games, we performed web scraping using R software. The customer reviews dataset contains the date that the review was posted and the review text, while the price dataset contains the date that the price was changed and the price on that date. In order to clean and prepare the data we first start by sectioning the data in excel. After scraping, our csv file fits each review in one row with the date. We split the data, separating date and review, allowing them to have separate columns. Luckily scraping the price separated price and date, so after the separating we just made sure that every file had similar column names.

After, we use R to finish the cleaning. Each game has a separate file for prices and review, so each of the prices is converted into a continuous time series by extending the previously available price for each date. Then the price dataset is combined with its respective in R on the common date column using left join. The resulting dataset for each game contains four columns - game name, date, reviews and price. From there, we allow the user to select the game they would like to view.

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
doi.org
+1more
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
DataCitehttps://www.datacite.org/
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
U
Water-column environmental variables and accompanying discrete CTD...
data.usgs.gov
catalog.data.gov
Updated Jul 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nancy Prouty; Miranda Baker (2022). Water-column environmental variables and accompanying discrete CTD measurements collected off California and Oregon during NOAA Ship Lasker R-19-05 (USGS field activity 2019-672-FA) from October to November 2019 (ver. 2.0, July 2022) [Dataset]. http://doi.org/10.5066/P9JKYWQU
Explore at:
Unique identifier
https://doi.org/10.5066/P9JKYWQU
Dataset updated
Jul 13, 2022
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Nancy Prouty; Miranda Baker
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Oct 9, 2019 - Nov 5, 2019
Area covered
California
Description
Various water column variables, including salinity, dissolved inorganic nutrients, pH, total alkalinity, dissolved inorganic carbon, radio-carbon isotopes were measured in samples collected using a Niskin-bottle rosette at selected depths from sites offshore of California and Oregon from October to November 2019 during NOAA Ship Lasker R-19-05 (USGS field activity 2019-672-FA). CTD (Conductivity Temperature Depth) data were also collected at each depth that a Niskin-bottle sample was collected and are presented along with the water sample data. This data release supersedes version 1.0, published in August 2020 at https://doi.org/10.5066/P9ZS1JX8. Versioning details are documented in the accompanying VersionHistory_P9JKYWQU.txt file.
Kickastarter Campaigns
kaggle.com
zip
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Cantara (2024). Kickastarter Campaigns [Dataset]. https://www.kaggle.com/datasets/alessiocantara/kickastarter-project/discussion
Explore at:
zip(2233314 bytes)Available download formats
Dataset updated
Jan 25, 2024
Authors
Alessio Cantara
Description
Welcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.

ASK

Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?

PREPARE

I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign

Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.

Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.

PROCESS

I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries: R library(tidyverse) library(lubridate) library(ggplot2) library(dplyr) library(tidyr) I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:

csv_files <- list.files(pattern = "\\.csv$") data_frames <- list() for (file in csv_files) { variable_name <- sub("\\.csv$", "", file) assign(variable_name, read.csv(file)) data_frames[[variable_name]] <- get(variable_name) }

Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.

data_frames <- lapply(data_frames, function(df) { df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_pledged <- as.numeric(df$usd_pledged) return(df) })

In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):

all_nov_2023 = bind_rows(data_frames) all_dec_2023 = bind_rows(data_frames) all_jan_2024 = bind_rows(data_frames)`

After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.

filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>% filter(created_at != 0 & deadline != 0 & launched_at != 0) %>% mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>% mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>% mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>% mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>% select(-category, -deadline, -launched_at, -created_at) %>% relocate(created, launched, setted_deadline, .before = goal) write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)

The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
Data from: Candidate selective sweeps in U.S. wheat populations
data.niaid.nih.gov
agdatacommons.nal.usda.gov
+1more
zip
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sajal Sthapit; Travis Ruff; Marcus Hooker; Deven See (2024). Candidate selective sweeps in U.S. wheat populations [Dataset]. http://doi.org/10.5061/dryad.ghx3ffbx0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ghx3ffbx0
Dataset updated
Nov 6, 2024
Dataset provided by
Washington State University
The Land Institute
USDA-ARS Wheat Health, Genetics, and Quality Research
Authors
Sajal Sthapit; Travis Ruff; Marcus Hooker; Deven See
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United States
Description
Exploration of novel alleles from ex situ collection is still limited in modern plant breeding as these alleles exist in genetic backgrounds of landraces that are not adapted to modern production environments. The practice of backcross breeding results in the preservation of the adapted background of elite parents but leaves little room for novel alleles from landraces to be incorporated. The selection of adaptation-associated linkage blocks instead of the entire adapted background may allow breeders to incorporate more of the landrace’s genetic background and to observe and evaluate novel alleles. Important adaptation-associated linkage blocks would have been selected over multiple cycles of breeding and hence are likely to exhibit signatures of positive selection or selective sweeps. We conducted a genome-wide scan for candidate selective sweeps (CSS) using Fst, Rsb, and xpEHH in state, regional, spring, winter, and market class population pairs and report 446 CSS in 19 population pairs over time and 1033 CSS in 44 population pairs across geography and class. Further validation of these candidate selective sweeps in specific breeding programs may lead to the identification of sets of loci that can be selected to restore population-specific adaptation without multiple backcrossing. Methods Folder Structure

The dataset has the following folder structure

./ or the root folder has the scripts used for analysis in R Markdown files as well as the corresponding .html output from running these scripts.

./data/ has the raw data and the intermediate data saves from the analysis

./functions/ has one file "functions_for_selection_sweep_analysis.R" that has the custom functions written for the analysis in the manuscript.

./output/ has the analysis results and figures used in the manuscript

./output/mapchart/ has the MapChart input files for drawing linkage maps of canddiate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2 standard deviations

./output/mapchart_sd2.5/ has the MapChart input files for drawing linkage maps of candidate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2.5 standard deviations.

./rehh_files/ has two subfolders /genotype and /map that store the intermediate files generated by the R package 'rehh' to calcualte Rsb and xpEHH.

Raw data files

The analysis in the manuscript uses the following raw data files. Data files not in this list are all intermediate files created by the analysis scripts.

./data/90k_SNP_type.txt

A tab-delimmited file with 4 columns as described below:

Index: serial number of genetic markers/loci on the 90K wheat SNP chip.

Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.

SNP: Alleles present in the single nucleotide polymorphism (SNP) marker/loci.

SNPTYPE: Same information as in column SNP but in a format without square brackets and /

./data/KIM_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

A tab-delimmited filed with information on known informative markers (KIM) recorded in 8 columns described below.

Marker: Name of the marker to be used as the label in the linkage maps in Supplemental Figures.

Chromosome: Chromosome label for wheat.

Start1.0: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 1.0. This information was not used in the current study.

Start: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 2.1.

Prop: Proportion sequence match for the marker to the reference genome sequence version 2.1.

SNP_ID: Alternative name for the marker. This information was not used in the current study.

Gene: Name of the gene.

Function: Function of the gene.

./data/R-generated-genotype-for-analysis-imputed-AB-format.csv

Raw 90K wheat SNP chip data after quality filtering and imputation uisng LinkImpute as described in Sthapit et al. The dataset includes the 7 information column described below, followed by 753 columns with genotype information in the AB format.

Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.

SNPid: Unique IWA and IWB SNP names of the genetic markers/loci on the 90K wheat SNP chip.

Chrom: Wheat chromosome labels.

Ord: Order of the marker. This information was not used for analysis.

cM: Centimorgan position of the marker. This information was not used for analysis.

Comment: Notes on manual classification of genotype calls in GenomeStudio.

Remaining columns have variety names and their corresponding genotype calls in AB format.

./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv

Same information as in ./data/R-generated-genotype-for-analysis-imputed-AB-format.csv but the genotype information in the last 753 columns are recorded in the nucleotide (ACGT) format.

./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

Contains physical base pair positions on the 'Chinese Spring' wheat reference sequence version 2.1 for the 90K SNP chip markers. The file has 5 columns without column headers. The column descriptions are given below.

First column has unique names of the genetic markers/loci on the 90K Wheat SNP chip.

Second column has wheat chromosome labels.

Third column has the starting base pair position of the marker on the reference sequence version 2.1.

Fourth column has the ending base pair position of the marker on the reference sequence version 2.1.

Fifth column has the mid-point of the third and fourth column, which was used at the SNP position for the marker in this study.

./data/variety_details.txt

Contains information about the 753 wheat varieties used as the diversity panel for this study. The file contains 12 columns, which are described below:

GS.Sample.ID: Names of the samples/varieties as they were in the raw output from the Illumina SNP calling software Genome Studio.

Corrected.Sample.ID: Names of the samples/varieties after they were corrected for typos (for example, 'Eric' to 'Erik') and removal of the prefix "varname" for varieties for varieties that only have numbers in their names ('varname2154' to '2154').

ACNO: Accession number of the varieties from the NPGS-GRIN database.

Habit: Growth habit (spring or winter) of the varieties.

Region: U.S. wheat growing regions: EAS, Eastern; GPL, Great Plains; NOR, Northern; PAC, Pacific; PNW, Pacific Northwest. Description of how states were assigned to these regions are in the methods section of the manuscript.

State: U.S. state the varieties are from.

Year: The year the variety was released in the U.S.

MC: Market class of the wheat variety: HRS, hard red spring; HRW, hard red winter; SRW, soft red winter; SWS, soft white spring; SWW, soft white winter.

HeadType: Designates if the spike or head of the wheat is club or common.

Sector: Was the variety from the public or private sector. Information in this column is incomplete and hence was not used for any analysis in the manuscript.

Decade: Decade the variety was released.

BP: Breeding period the variety was released.

Description of Scripts

Here we describe the scripts in order along with the input data files used and the output files these scripts produced.

./00_import_RefSeqv2.1_physical_positions.Rmd ./00_import_RefSeqv2.1_physical_positions.html (R Markdown output html)

The study uses genotype data generated from our previous study (https://doi.org/10.1002/tpg2.20196) that had marker physical positions based on wheat reference sequence version 1. This script updates the marker physical positions to the wheat reference sequence version 2.1 and saves the updated genotype files for subsequent analyses.

Input files:

./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt ./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv ./data/R-generated-genotype-for-analysis-imputed-AB-format.csv

Output files:

./data/genotype_AB_format_13995_loci_imputed.txt ./data/genotype_nucleotide_format_13995_loci_imputed.txt

01_define_populations.Rmd 01_define_populations.html (R Markdown output html)

The script assigns what varieties go into what sub-populations as described in the methods section of the manuscript.

Input files:

./functions/functions_for_selection_sweep_analysis.R ./data/variety_details.txt

Output files:

./data/populations.rds./output/first_last_varieties.csv 02_calculate_iHH_iES_inES.Rmd 02_calculate_iHH_iES_inES.html (R Markdown output html)

This script uses the 'rehh' package function 'scan_hh' called through the custom function 'scan_population' to calculate the integrated extended haplotype homozygosity (iHH), integrated site-specific extended haplotype homozygosity (iES), and integrated normalized site-specific extended haplotype homozygosity (inES) for all markers of all 21 chromosomes and all wheat sub-populations in the study. The intermediate files needed to run these calculations were written to the folders ./rehh_files/genotype and ./rehh_files/map. The output is saved as an RDS file to be used as input for subsequent scripts.

Input files:

./functions/functions_for_selection_sweep_analysis.R ./data/genotype_nucleotide_format_13995_loci_imputed.txt ./data/populations.rds

Output files:

./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds 03_calculate_allele_freq_Fst_Rsb_xpEHH.Rmd 03_calculate_allele_freq_Fst_Rsb_xpEHH.html (R Markdown output html)

Script calculates allele frequencies for all the sub-populations and Fst, Rsb, and xpEHH statistics for defined sub-population pairs.

Input files:

./functions/functions_for_selection_sweep_analysis.R ./data/genotype_nucleotide_format_13995_loci_imputed.txt ./data/genotype_AB_format_13995_loci_imputed.txt ./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds

Output files:

./output/allele_freq_Fst_Rsb_xpEHH.Rds
Case study: Cyclistic bike-share analysis
kaggle.com
zip
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge4141 (2022). Case study: Cyclistic bike-share analysis [Dataset]. https://www.kaggle.com/datasets/jorge4141/case-study-cyclistic-bikeshare-analysis
Explore at:
zip(131490806 bytes)Available download formats
Dataset updated
Mar 25, 2022
Authors
Jorge4141
Description
Introduction

This is a case study called Capstone Project from the Google Data Analytics Certificate.

In this case study, I am working as a junior data analyst at a fictitious bike-share company in Chicago called Cyclistic.

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

Scenario

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.

****Primary Stakeholders:****

1: Cyclistic Executive Team

2: Lily Moreno, Director of Marketing and Manager

ASK

How do annual members and casual riders use Cyclistic bikes differently?

Why would casual riders buy Cyclistic annual memberships?

How can Cyclistic use digital media to influence casual riders to become members?

# Prepare

The last four quarters were selected for analysis which cover April 01, 2019 - March 31, 2020. These are the datasets used:

Divvy_Trips_2019_Q2 Divvy_Trips_2019_Q3 Divvy_Trips_2019_Q4 Divvy_Trips_2020_Q1

The data is stored in CSV files. Each file contains one month data for a total of 12 .csv files.

Data appears to be reliable with no bias. It also appears to be original, current and cited.

I used Cyclistic’s historical trip data found here: https://divvy-tripdata.s3.amazonaws.com/index.html

The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

Limitations

Financial information is not available.

Process

Used R to analyze and clean data

After installing the R packages, data was collected, wrangled and combined into a single file.

Columns were renamed.

Looked for incongruencies in the dataframes and converted some columns to character type, so they can stack correctly.

Combined all quarters into one big data frame.

Removed unnecessary columns

Analyze

Inspected new data table to ensure column names were correctly assigned.

Formatted columns to ensure proper data types were assigned (numeric, character, etc).

Consolidated the member_casual column.

Added day, month and year columns to aggregate data.

Added ride-length column to the entire dataframe for consistency.

Deleted trip duration rides that showed as negative and bikes out of circulation for quality control.

Replaced the word "member" with "Subscriber" and also replaced the word "casual" with "Customer".

Aggregated data, compared average rides between members and casual users.

Share

After analysis, visuals were created as shown below with R.

Act

Conclusion:

Data appears to show that casual riders and members use bike share differently.

Casual riders' average ride length is more than twice of that of members.

Members use bike share for commuting, casual riders use it for leisure and mostly on the weekends.

Unfortunately, there's no financial data available to determine which of the two (casual or member) is spending more money.

Recommendations

Offer casual riders a membership package with promotions and discounts.
D
OK, Computer, what are these books about? - data files
ssh.datastations.nl
csv, tsv, txt, zip
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
R Snijder; R Snijder (2020). OK, Computer, what are these books about? - data files [Dataset]. http://doi.org/10.17026/DANS-2Z4-MRGM
Explore at:
txt(2227), tsv(11224592), csv(2586236965), zip(18922), txt(2798), txt(1677)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-2Z4-MRGM
Dataset updated
Jul 9, 2020
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
R Snijder; R Snijder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The core of this experiment is the use of the entity-fishing algorithm, as created and deployed by DARIAH. In the most simple terms: it scans texts for terms that can be linked to Wikipedia pages. Based on the algorithm, new keywords are added to the book descriptions, plus a list of relevant Wikipedia pages.For this experiment, the full text of 4125 books and chapters – available in the OAPEN Library – is scanned, resulting in a data file of over 25 million entries. In other words, on average the algorithm found roughly 6,100 ‘hits’ for each publication. When only the most common terms per publication are selected, does this result in a useful description of its content?The data file OK_Computer_results contains a list of open access books and chapters descriptions found in the OAPEN Library, combined with Wikipedia entries found using the entity-fishing algorithm, plus several actions to filter out only the terms which describe the publication best. Each book or chapter is available in the OAPEN Library (www.oapen.org), see the column HANDLE/The data file nerd_oapen_response_database contains the complete data set. The other text files contain R code to manipulate the file nerd_oapen_response_database.Description of nerd_oapen_response_database:The data is divided into the following columns:Data DescriptionOAPEN_ID Unique ID of the publication in the OAPEN LibraryrawName The entity as it appears in the textnerd_score Disambiguation confidence scorenerd_selection_score Selection confidence score, indicates how certain the disambiguated entity is actually valid for the text mentionwikipediaExternalRef ID of the Wikipedia pagewiki_URL URL of the Wikipedia pagetype NER class of the entitydomains Description of subject domainEach book may contain more than one occurrence of the same entity. The nerd_score and the nerd_selection_score may vary. This allows researchers to count the number of occurrences and use this as an additional method to assess the contents of the book. The OAPEN_ID refers to the identifier of the title in the OAPEN Library.For more information about the entity-fishing query processing service see https://nerd.readthedocs.io/en/latest/restAPI.html#response. Date: 2020-06-03
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
SDSS Galaxy Subset
zenodo.org
application/gzip
Updated Sep 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Ramos Carvalho; Nuno Ramos Carvalho (2022). SDSS Galaxy Subset [Dataset]. http://doi.org/10.5281/zenodo.6696565
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6696565
Dataset updated
Sep 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nuno Ramos Carvalho; Nuno Ramos Carvalho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. This dataset contains a subset of this survey, of 60247 objects classified as galaxies, it includes a CSV file with a collection of information and a set of files for each object, namely JPG image files, FITS and spectra data. This dataset is used to train and explore the astromlp-models collection of deep learning models for galaxies characterisation.

The dataset includes a CSV data file where each row is an object from the SDSS database, and with the following columns (note that some data may not be available for all objects):

objid: unique SDSS object identifier

mjd: MJD of observation

plate: plate identifier

tile: tile identifier

fiberid: fiber identifier

run: run number

rerun: rerun number

camcol: camera column

field: field number

ra: right ascension

dec: declination

class: spectroscopic class (only objetcs with GALAXY are included)

subclass: spectroscopic subclass

modelMag_u: better of DeV/Exp magnitude fit for band u

modelMag_g: better of DeV/Exp magnitude fit for band g

modelMag_r: better of DeV/Exp magnitude fit for band r

modelMag_i: better of DeV/Exp magnitude fit for band i

modelMag_z: better of DeV/Exp magnitude fit for band z

redshift: final redshift from SDSS data z

stellarmass: stellar mass extracted from the eBOSS Firefly catalog

w1mag: WISE W1 "standard" aperture magnitude

w2mag: WISE W2 "standard" aperture magnitude

w3mag: WISE W3 "standard" aperture magnitude

w4mag: WISE W4 "standard" aperture magnitude

gz2c_f: Galaxy Zoo 2 classification from Willett et al 2013

gz2c_s: simplified version of Galaxy Zoo 2 classification (labels set)

Besides the CSV file a set of directories are included in the dataset, in each directory you'll find a list of files named after the objid column from the CSV file, with the corresponding data, the following directories tree is available:

sdss-gs/ ├── data.csv ├── fits ├── img ├── spectra └── ssel

Where, each directory contains:

img: RGB images from the object in JPEG format, 150x150 pixels, generated using the SkyServer DR16 API

fits: FITS data subsets around the object across the u, g, r, i, z bands; cut is done using the ImageCutter library

spectra: full best fit spectra data from SDSS between 4000 and 9000 wavelengths

ssel: best fit spectra data from SDSS for specific selected intervals of wavelengths discussed by Sánchez Almeida 2010

Changelog

v0.0.3 - Increase number of objects to ~80k.

v0.0.2 - Increase number of objects to ~60k.

v0.0.1 - Initial import.
Supplement 2. R code used for wolf analysis.
wiley.figshare.com
html
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Matthiopoulos; Mark Hebblewhite; Geert Aarts; John Fieberg (2023). Supplement 2. R code used for wolf analysis. [Dataset]. http://doi.org/10.6084/m9.figshare.3550839.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3550839.v1
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Jason Matthiopoulos; Mark Hebblewhite; Geert Aarts; John Fieberg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List Wolf code.r – Source code to run wolf analysis Description This is provided for illustration only, the wolf data are not offered online. The code operates on a data frame in which rows correspond to points in space. The data frame contains a column for use (1 for a telemetry observation, 0 for a control point selected from the wolf’s home range). It also contains columns for x and y coordinates of the point, environmental covariates at that location, wolf ID and wolf pack membership. 1. Data frame preparation The data set is first thinned, for computational expediency, the covariates are standardized to improve convergence and the data frame is augmented with columns for wolf-pack-level covariate expectations (required by the GFR approach). 2. Leave-one-out validation The code allows the removal of a single wolf from the data set. Two models (one with just random effects, the second with GFR interactions) are fit to the data and predictions are made for the missing wolf. The function gof() generates goodness-of-fit diagnostics.

Energy Expenditure of Human Physical Activity

kaggle.com

zip

Updated Oct 15, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

David Desquens (2023). Energy Expenditure of Human Physical Activity [Dataset]. https://www.kaggle.com/datasets/anonymousds/energy-expenditure-of-human-physical-activity

Explore at:

zip(5061744 bytes)Available download formats

Dataset updated

Oct 15, 2023

Authors

David Desquens

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

🗂️ Data source

This dataset is built from the data underlying two scientific articles (1)(2).

The individuals were selected via a paper advertisement and they had to meet the following criteria: 1. Be older than 60 years of age; 1. Have a BMI between 23 and 35 kg/m2; 1. Not being restricted in their movements by health conditions; 1. Bring their own bicycle.

The selected participants received €50 for their contribution to the study and agreed to the use of recorded data for scientific purposes, in an anonymised manner.
A video example of the data collection can be found in youtube.

⚙️ Data processing

I have personally consolidated and merged the resulting csv output with an R script.
From the 35 participants only the 31 who used the calorimetry were kept with their associated Energy Expenditure measurements.

💡 Inspiration

I am a passionate individual about physical activity in general and I was very curious to gather and explore some data related to this field that would quantify the Energy Expenditure of different indoor and outdoor activities of daily living with low (lying down, sitting), mid (standing, household activities) and high (walking and cycling) levels of intensity.

🔍 Data overview

The dataset encompasses ~40K records with the participants' attributes and the physical activity features.
Each observation represents a physical activity performed by one of the 31 individuals that used the calorimetry.
The first twelve columns (ID:cosmed) are attributes related to the participant, so are consistent accross observations.
The rest of the columns (EEm:predicted_activity_label) are features related to a single physical activity.

🔢 Columns

Name	Description
ID	participant's ID
trial_date	date and time when data collection started at ID level
gender	sex = male or female
age	in years
weight	in kg
height	in cm
bmi	Body mass index in kg/m
gaAnkle	TRUE if data from GENEActiv on the ankle exist, FALSE otherwise
gaChest	TRUE if data from GENEActiv on the chest exist, FALSE otherwise
gaWrist	TRUE if data from GENEActiv on the wrist exist, FALSE otherwise
equivital	TRUE if data from Equivital exist, FALSE otherwise
cosmed	TRUE if data from COSMED exist, FALSE otherwise
EEm	Energy Expenditure per minute, in Kcal
COSMEDset_row	the original indexes of COSMED data (used for merging)
EEh	Energy Expenditure per hour, in Kcal
EEtot	Total Kcal spent (it is reseted between indoor and outdoor measurements)
METS	Metabolic Equivalent per minute
Rf	Respiratory Frequency (litre/min)
BR	Breath Rate
VT	Tidal Volume in litre
VE	Expiratory Minute Ventilation (litre/min)
VO2	Oxygen Uptake (ml/min)
VCO2	Carbon Dioxide production (ml/min)
O2exp	Volume of O2 expired (ml/min)
CO2exp	Volume of CO2 expired (ml/min)
FeO2	Averaged expiratory concentration of O2 (%)
FeCO2	Averaged expiratory concentration of CO2 (%)
FiO2	Fraction of inspired O2 (%)
FiCO2	Fraction of inspired CO2 (%)
VE.VO2	Ventilatory equivalent for O2
VE.VCO2	Ventilatory equivalent for CO2
R	Respiratory Quotient
Ti	Duration of Inspiration (seconds)
Te	Duration of Expiration (seconds)
Ttot	Duration of Total breathing cycle (seconds)
VO2.HR	Oxygen pulse (ml/beat)
HR	Heart Rate
Qt	Cardiac output (litre)
SV	Stroke volume (litre/min)
original_activity_labels	True activity label as noted from study protocol, NA if is unknown
predicted_activity_label	Predicted activity label by model from [1], NA if is unknown

🔀 Data usage

Exploratory Data Analysis: Which insights we can extract from the data?
Classification: Are you able to better classify the kind of activity versus the original model?
Prediction: Are you capable of improving the accuracy from the original model?
Inference: Which variables explain the Energy Expenditure?

🖲️ Study devices and their body location

https://media.springernature.com/full/springer-static/image/art%3A10.1007%2Fs11257-020-09268-2/MediaObjects/11257_2020_9268_Fig3_HTML.png?as=webp" alt="">

Plotly Dashboard Healthcare

kaggle.com

zip

Updated Jan 4, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

A SURESH (2022). Plotly Dashboard Healthcare [Dataset]. https://www.kaggle.com/datasets/sureshmecad/plotly-dashboard-healthcare

Explore at:

zip(1741234 bytes)Available download formats

Dataset updated

Jan 4, 2022

Authors

A SURESH

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Data Visualization

Content

a. Scatter plot

  i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for 
    any pair of genes.

  ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.

  iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint - 
    visit https://plotly.com/r/, https://plotly.com/python)

  iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
    Gender/Sex column from the metadata file.

b. Boxplot/violin plot

  i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across 
   multiple categories as defined by user selected variable (a column from the metadata file)

 ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Housing Price Prediction using DT and RF in R
kaggle.com
zip
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
Explore at:
zip(629100 bytes)Available download formats
Dataset updated
Aug 31, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Objective: To predict the prices of houses in the City of Melbourne

Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">

Data Cleaning:

Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’

We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’

We remove 11566 records which have missing values

We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’

We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function

Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’

Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">

Average price for 5464 houses is $1084349

Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.

$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">

We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">

We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)

Variables ‘postcode’, longitude and building are the most important variables

Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">

We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">

The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">

Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7

Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">

We tune the model and find mtry = 3 has the lowest out of bag error

We use the caret package and use 5 fold cross validation technique

RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4

We can conclude that Random Forest give us more accurate results as compared to Decision Tree

In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
f
Enriched Tourism Dataset London (POIs)
figshare.com
data.mendeley.com
csv
Updated Oct 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramon Hermoso; Sergio Ilarri; Raquel Trillo-Lado (2025). Enriched Tourism Dataset London (POIs) [Dataset]. http://doi.org/10.6084/m9.figshare.27628029.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27628029.v1
Dataset updated
Oct 24, 2025
Dataset provided by
figshare
Authors
Ramon Hermoso; Sergio Ilarri; Raquel Trillo-Lado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
London
Description
Please, cite the following publication:----R. Hermoso, S. Ilarri, R. Trillo-Lado, & C. Marzo (2017). Recommending Needles in a Haystack: The SURGE Approach. International Journal of Geographical Information Systems, 2025.---This dataset contains the London subset of the Tourpedia dataset, specifically focusing on points of interest (POIs) categorized as attractions (dataset available at http://tour-pedia.org/download/london-attraction.csv). The original dataset comprises 20,727 entries that encompass a variety of attractions across London, providing details on several attributes for each POI. These attributes include a unique identifier, POI name, category, location information (address), latitude, longitude, specific details, and user-generated reviews. The review fields contain textual feedback from users, aggregated from platforms such as Google Places, Foursquare, and Facebook, offering a qualitative insight into each location.However, due to the initial dataset's high proportion of incomplete or inconsistently structured entries, a rigorous cleaning process was implemented. This process entailed the removal of erroneous and incomplete data points, ultimately refining the dataset to 2,341 entries that meet criteria for quality and structural coherence. These selected entries were subjected to further validation to ensure data integrity, enabling a more accurate representation of London's attractions.- London.csvIt contains columns including a unique identifier, POI name, category, location information (address), latitude, longitude, specific details, and user-generated reviews. Those reviews have been previously retrieved and pre-processed from Google Places, Foursquare, and Facebook, and have different formats: all words, only nouns, nouns + verbs, noun + adjectives and nouns + verbs + adjectives.- London_annotated.csvIt contains the ground truth relating to the previous dataset, with manual annotations made by humans on the categorisation of each of the POIs into 12 different pre-defined categories.It has the following columns:* POI name* POI's address* One column for each of the above categories. 1 means that the POI belongs to the category while blank indicates that it does not.
Reddit Mental Health Dataset (RMHD)
kaggle.com
zip
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khandakar Amed (2023). Reddit Mental Health Dataset (RMHD) [Dataset]. https://www.kaggle.com/datasets/entenam/reddit-mental-health-dataset/code
Explore at:
zip(647220389 bytes)Available download formats
Dataset updated
Sep 1, 2023
Authors
Khandakar Amed
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Citation

Rani, S.; Ahmed, K.; Subramani, S. From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives. Appl. Sci. 2024, 14, 1547. https://doi.org/10.3390/app14041547

RMHD Our dataset, meticulously curated from Reddit, encompasses a comprehensive collection of posts from five key subreddits focused on mental health: r/anxiety, r/depression, r/mentalhealth, r/suicidewatch, and r/lonely. These subreddits were chosen for their rich, focused discussions on mental health issues, making them invaluable for research in this area.

The dataset spans from January 2019 through August 2022 and is systematically structured into folders by year. Within each yearly folder, the data is further segmented into monthly batches. Each month's data is compiled into five separate CSV files, corresponding to the selected subreddits.

Structure of Part A : Raw Data:Each CSV file in our dataset includes the following columns, providing a detailed view of the Reddit posts along with essential metadata: Author: The username of the Reddit post's author. Created_utc: The UTC timestamp of when the post was created. Score:The net score (upvotes minus downvotes) of the post. Selftext: The main text content of the post. **Subreddit: **The subreddit from which the post was sourced. Title: The title of the Reddit post. Timestamp:The local date and time when the post was created, converted from the UTC timestamp. This structured approach allows researchers to conduct detailed, time-based analyses and to easily access data from specific subreddits.

Structure of Part B : Labelled Data :Part B of our dataset, which includes a subset of 800 manually annotated posts, is structured differently to provide focused insights into the mental health discussions. The columns in Part B are as follows: Score: The net score (upvotes minus downvotes) of the post. Selftext:The main text content of the post. Subreddit: The subreddit from which the post was sourced. Title: The title of the Reddit post. Label: The assigned label indicating the identified root cause of mental health issues, based on our annotation process are : Drug and Alcohol , Early Life, Personality,Trauma and Stress

This annotation process brings additional depth to the dataset, allowing researchers to explore the underlying factors contributing to mental health issues.

The dataset, with a zipped size of approximately 1.68GB, is publicly available and serves as a rich resource for researchers interested in exploring the root causes of mental health issues as represented in social media discussions, particularly within the diverse conversations found on Reddit.
120 years of Olympic history: athletes and results
kaggle.com
zip
Updated Jun 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rgriffin (2018). 120 years of Olympic history: athletes and results [Dataset]. https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results
Explore at:
zip(5690772 bytes)Available download formats
Dataset updated
Jun 15, 2018
Authors
rgriffin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.

Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.

Content

The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

ID - Unique number for each athlete

Name - Athlete's name

Sex - M or F

Age - Integer

Height - In centimeters

Weight - In kilograms

Team - Team name

NOC - National Olympic Committee 3-letter code

Games - Year and season

Year - Integer

Season - Summer or Winter

City - Host city

Sport - Sport

Event - Event

Medal - Gold, Silver, Bronze, or NA

Acknowledgements

The Olympic data on www.sports-reference.com is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed 'statistorians'. Check out their blog for more information. All I did was consolidated their decades of work into a convenient format for data analysis.

Inspiration

This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.
Detailed NFL Play-by-Play Data 2009-2018
kaggle.com
zip
Updated Dec 22, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Horowitz (2018). Detailed NFL Play-by-Play Data 2009-2018 [Dataset]. https://www.kaggle.com/datasets/maxhorowitz/nflplaybyplay2009to2016
Explore at:
zip(287411671 bytes)Available download formats
Dataset updated
Dec 22, 2018
Authors
Max Horowitz
Description
Introduction

The lack of publicly available National Football League (NFL) data sources has been a major obstacle in the creation of modern, reproducible research in football analytics. While clean play-by-play data is available via open-source software packages in other sports (e.g. nhlscrapr for hockey; PitchF/x data in baseball; the Basketball Reference for basketball), the equivalent datasets are not freely available for researchers interested in the statistical analysis of the NFL. To solve this issue, a group of Carnegie Mellon University statistical researchers including Maksim Horowitz, Ron Yurko, and Sam Ventura, built and released nflscrapR an R package which uses an API maintained by the NFL to scrape, clean, parse, and output clean datasets at the individual play, player, game, and season levels. Using the data outputted by the package, the trio went on to develop reproducible methods for building expected point and win probability models for the NFL. The outputs of these models are included in this dataset and can be accessed using the nflscrapR package.

Content

The dataset made available on Kaggle contains all the regular season plays from the 2009-2016 NFL seasons. The dataset has 356,768 rows and 100 columns. Each play is broken down into great detail containing information on: game situation, players involved, results, and advanced metrics such as expected point and win probability values. Detailed information about the dataset can be found at the following web page, along with more NFL data: https://github.com/ryurko/nflscrapR-data.

Acknowledgements

This dataset was compiled by Ron Yurko, Sam Ventura, and myself. Special shout-out to Ron for improving our current expected points and win probability models and compiling this dataset. All three of us are proud founders of the Carnegie Mellon Sports Analytics Club.

Inspiration

This dataset is meant to both grow and bring together the community of sports analytics by providing clean and easily accessible NFL data that has never been availabe on this scale for free.

Facebook

Twitter

Click to copy link

Link copied

Cite

Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:

pptxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3840102.v1

Dataset updated

Sep 19, 2016

Dataset provided by

Figsharehttp://figshare.com/

Authors

Benj Petre; Aurore Coince; Sophien Kamoun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Clear search

Close search

Google apps

Main menu

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Dataset of book subjects that contain The economics of immigration :...

Video game pricing analytics dataset

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

Water-column environmental variables and accompanying discrete CTD...

Kickastarter Campaigns

Data from: Candidate selective sweeps in U.S. wheat populations

Case study: Cyclistic bike-share analysis

Introduction

Scenario

****Primary Stakeholders:****

ASK

Limitations

Process

Analyze

Share

Act

Recommendations

OK, Computer, what are these books about? - data files

LScDC Word-Category RIG Matrix

SDSS Galaxy Subset

Supplement 2. R code used for wolf analysis.

Energy Expenditure of Human Physical Activity

🗂️ Data source

⚙️ Data processing

💡 Inspiration

🔍 Data overview

🔢 Columns

🔀 Data usage

🖲️ Study devices and their body location

Plotly Dashboard Healthcare

Context

Content

Acknowledgements

Inspiration

Housing Price Prediction using DT and RF in R

Enriched Tourism Dataset London (POIs)

Reddit Mental Health Dataset (RMHD)

120 years of Olympic history: athletes and results

Context

Content

Acknowledgements

Inspiration

Detailed NFL Play-by-Play Data 2009-2018

Introduction

Content

Acknowledgements

Inspiration

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Primary Stakeholders: