19 datasets found

Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
w
Dataset of book subjects that contain The economics of immigration :...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain The economics of immigration : selected papers of Barry R. Chiswick [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=The+economics+of+immigration+:+selected+papers+of+Barry+R.+Chiswick&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 1 row and is filtered where the books is The economics of immigration : selected papers of Barry R. Chiswick. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
d
Water-column environmental variables and accompanying discrete CTD...
catalog.data.gov
data.usgs.gov
Updated Oct 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Water-column environmental variables and accompanying discrete CTD measurements collected off California and Oregon during NOAA Ship Lasker R-19-05 (USGS field activity 2019-672-FA) from October to November 2019 (ver. 2.0, July 2022) [Dataset]. https://catalog.data.gov/dataset/water-column-environmental-variables-and-accompanying-discrete-ctd-measurements-collected--c3a6b
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Various water column variables, including salinity, dissolved inorganic nutrients, pH, total alkalinity, dissolved inorganic carbon, radio-carbon isotopes were measured in samples collected using a Niskin-bottle rosette at selected depths from sites offshore of California and Oregon from October to November 2019 during NOAA Ship Lasker R-19-05 (USGS field activity 2019-672-FA). CTD (Conductivity Temperature Depth) data were also collected at each depth that a Niskin-bottle sample was collected and are presented along with the water sample data. This data release supersedes version 1.0, published in August 2020 at https://doi.org/10.5066/P9ZS1JX8. Versioning details are documented in the accompanying VersionHistory_P9JKYWQU.txt file.
S
Computational code of square cascade for separation of Ne stable isotopes
scidb.cn
Updated Feb 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fatemeh (2023). Computational code of square cascade for separation of Ne stable isotopes [Dataset]. http://doi.org/10.57760/sciencedb.07250
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.07250
Dataset updated
Feb 7, 2023
Dataset provided by
Science Data Bank
Authors
fatemeh
License
https://api.github.com/licenses/unlicensehttps://api.github.com/licenses/unlicense
Description
One of the methods for the separation of stable isotopes is the thermal diffusion column. The advantages of this method include small-scale operations because of apparatus simplicity and a small inventory, especially in gas phase operations. These features attract attention to the tritium and noble gas separation system. In this research, the R cascade was used for designing and determining the number of columns. Moreover, the square cascade was adopted for the final design because of its flexibility. Calculations were performed as an example for the separation of 20Ne and 22Ne isotopes. Accordingly, all R cascades that have enriched Ne isotopes to more than 99% were investigated, and the number of columns was determined. Also, using the specified columns, the square cascade parameters were optimized. A calculation code entitled ''RSQ_CASCADE'' was developed in this regard. The unit separation factor of 3 was considered, and the number of stages was studied in the range of 10 to 20. The results showed that the column separation power, the relative total flow rate, and the required columns were linearly related to the number of stages. The separation power and relative total flow decreased with the increase of stage number while the number of columns increased. Therefore, the cascade of 85 columns was recommended to separate the Ne stable isotopes. These calculations resulted in the 17-stage square cascade, with five columns in each stage. By changing the stages cut, feed point, and cascade feed flow rate, the best parameters of square cascade were determined according to the cascade separation power and column separation power. As the column separation power had the maximum value in the cascade feed 50, it was selected for separating Ne isotopes.
d
Council; Council Files April 17, 1847, Case of Leander Thompson, GC3/series...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA (2023). Council; Council Files April 17, 1847, Case of Leander Thompson, GC3/series 378, Petition of Luther Rist [Dataset]. http://doi.org/10.7910/DVN/NCAOR
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/NCAOR
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA
Description
Petition subject: Execution case Original: http://nrs.harvard.edu/urn-3:FHCL:12233039 Date of creation: (unknown) Petition location: Uxbridge Selected signatures:Luther RistSusan R. UsherHarriett N. Moury Total signatures: 175 Legal voter signatures (males not identified as non-legal): 64 Female signatures: 84 Unidentified signatures: 27 Female only signatures: No Identifications of signatories: inhabitants, [females] Prayer format was printed vs. manuscript: Manuscript Signatory column format: not column separated Additional non-petition or unrelated documents available at archive: additional documents available Additional archivist notes: Leander Thompson Location of the petition at the Massachusetts Archives of the Commonwealth: Governor Council Files, April 17, 1847, Case of Leander Thompson Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.
Data from: Assessing the impact of static and fluctuating ocean...
zenodo.org
bin
Updated Jan 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew A Vaughan; Danielle L Dixson; Matthew A Vaughan; Danielle L Dixson (2021). Assessing the impact of static and fluctuating ocean acidification on the behavior of Amphiprion percula [Dataset]. http://doi.org/10.5281/zenodo.4459414
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4459414
Dataset updated
Jan 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matthew A Vaughan; Danielle L Dixson; Matthew A Vaughan; Danielle L Dixson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Attached is the complete raw data from Vaughan and Dixson 2021 ‘Assessing the impact of static and fluctuating ocean acidification on the behavior of Amphiprion percula’.

Data collected from the behavioral lateralization trials has been inputted into the file ‘Vaughan_2020_Lateralization_Raw’. Column A indicate the CO₂ treatment group, where “SPD” = Static Present Day, “SFD” = Static Future Day, “FPD” = Fluctuating Present Day, and “FFD” = Fluctuating Future Day. Each individual fish used from each treatment group (n=30) is displayed in Column B. Column C shows the binary results, in order, of each fish’s turns in the T-maze, and was scored as 0 (right turn) or 1 (left turn) for a total of 10 turns. The total number of turns to the right and left are provided in Column D-E. The relative lateralization (L_R) of each fish was calculated {L_R = [(Turn to the right – Turn to the left)/(Turn to the right + Turn to the left)] ∗ 100} in Column F. Absolute lateralization (L_A) is provided in Column G.

Chemosensory response data has been inputted into the file ‘Vaughan_2020_Chemosensory_Raw’. Column A and B display treatment group and fish ID (n=20) as outlined above. The cue used in trial of either Tang (nonpredator) or Cod (predator) is provided in Column C, and the control in Column D. Numbers in these are used solely for the purpose of data analysis. The side of the cue in the flume is provided in Column E, and corresponds with the cue labelled in Column C. Buckets containing either the cue or control were placed above the flume and color coded as “BS” (blue side) and “RS” (red side), as the person scoring the trials was blinded. This also helped account for the switch (from one side of the flume to the other) that occurs halfway through each trial. Columns F-G represent results from the first 2min recording period, and Columns H-I represent results from the second 2min recording period. The total tallies from each fish are provided in Column J; the totals from each side are calculated in Columns K-L, and then sorted by either cue or control in Columns M-N. Proportions and percentages in cue and control are calculated and provided in Columns O-P and Q-R, respectively.

Carbonate chemistry data is compiled and provided in the ‘Vaughan_2020_Carbonate_Chemistry’. Measurements were taken each week (Column A) of each treatment group (as stated above, Column B). Column C reflects the time recordings were taken in the fluctuating treatments to hit the high, mid and low CO₂ points at “6:30”, “12:30” and “18:30”. Measurements of static treatment groups were taken at randomly selected times to get the reflection of the carbonate chemistry of these treatments, but for the purpose of clarity in this document they are listed as “Static”. Measurements were taken from a subset of tanks each that rotated each week (Column D). Our target pH_NBS values (i.e. what was programmed into the APEX System) are listed in Column E. Columns F-H displayed pH_NBS (taken with APEX probes), temperature °C (taken with a portable Mettler Toledo probe) and salinity (taken with a refractometer). Water samples were analyzed spectrophotometrically to provide pH_T and dissolved inorganic carbon, with values provided in Columns I-J. Using the program CO2SYS, total alkalinity and pCO₂ were calculated, with values provided in Columns K-L.
Sloan Digital Sky Survey DR16
kaggle.com
zip
Updated Dec 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mukharbek Organokov (2019). Sloan Digital Sky Survey DR16 [Dataset]. https://www.kaggle.com/muhakabartay/sloan-digital-sky-survey-dr16
Explore at:
zip(6728394 bytes)Available download formats
Dataset updated
Dec 30, 2019
Authors
Mukharbek Organokov
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Feedback: Mukharbek Organokov organokov.m@gmail.com

Context

Sloan Digital Sky Survey current DR16 Server Data release with Galaxies, Stars and Quasars.

License: Creative Commons Attribution license (CC-BY) More datailes here. Find more here.

Content

The table results from a query which joins two tables:
- "PhotoObj" which contains photometric data
- "SpecObj" which contains spectral data.

16 variables (double) and 1 additional variable (char) 'class'. A class object can be predicted from the other 16 variables.

Variables description:
objid = Object Identifier
ra = J2000 Right Ascension (r-band)
dec = J2000 Declination (r-band)
u = better of deV/Exp magnitude fit (u-band)
g = better of deV/Exp magnitude fit (g-band)
r = better of deV/Exp magnitude fit (r-band)
i = better of deV/Exp magnitude fit (i-band)
z = better of deV/Exp magnitude fit (z-band)
run = Run Number
rerun = Rerun Number
camcol = Camera column
field = Field number
specobjid = Object Identifier
class = object class (galaxy, star or quasar object)
redshift = Final Redshift
plate = plate number
mjd = MJD of observation
fiberid = fiberID

Comments

A four-color UVGR intermediate-band photometric system (Thuan-Gunn astronomic magnitude system) is discussed in [1]. The Sloan Digital Sky Survey (SDSS) photometric system, a new five-color (u′ g′ r′ i′ z′) wide-band CCD system is described in [2]

The variables 'run', 'rerun', 'camcol' and 'field' features which describe a field within an image taken by the SDSS. A field is basically a part of the entire image corresponding to 2048 by 1489 pixels. A field can be identified by: - run number, which identifies the specific scan, - the camera column, or "camcol," a number from 1 to 6, identifying the scanline within the run, and the field number. The field number typically starts at 11 (after an initial rampup time), and can be as large as 800 for particularly long runs. - An additional number, rerun, specifies how the image was processed.

The variable 'class' identifies an object to be either a galaxy (GALAXY), star (STAR) or quasar (QSO).
####References:
[1] Thuan & Gunn (1976, PASP, 88,543)
[2] Fukugita, M. et al, Astronomical J. v.111, p.1748

Data server

Data can be obtained using SkyServer SQL Search with the command below:
-- This query does a table JOIN between the imaging (PhotoObj) and spectra
-- (SpecObj) tables and includes the necessary columns in the SELECT to upload
-- the results to the SAS (Science Archive Server) for FITS file retrieval.
SELECT TOP 100000
p.objid,p.ra,p.dec,p.u,p.g,p.r,p.i,p.z,
p.run, p.rerun, p.camcol, p.field,
s.specobjid, s.class, s.z as redshift,
s.plate, s.mjd, s.fiberid
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE
p.u BETWEEN 0 AND 19.6
AND g BETWEEN 0 AND 20

Learn how to. Some examples. Full SQL Tutorial.

Or perform a complicated, CPU-intensive query of SDSS catalog data using CasJobs, SQL-based interface to the CAS.

Acknowledgements

SDSS collaboration.

Inspiration

The Sloan Digital Sky Survey has created the most detailed three-dimensional maps of the Universe ever made, with deep multi-color images of one-third of the sky, and spectra for more than three million astronomical objects. It allows to learn and explore all phases and surveys - past, present, and future - of the SDSS.
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
doi.org
+1more
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

d
House Unpassed Legislation 1842, Docket 1153, SC1/series 230, Petition of...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA (2023). House Unpassed Legislation 1842, Docket 1153, SC1/series 230, Petition of J.H. Brown [Dataset]. http://doi.org/10.7910/DVN/98KUO
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/98KUO
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA
Description
Petition subject: Against railroad discrimination with focus on white passengers Original: http://nrs.harvard.edu/urn-3:FHCL:10956457 Date of creation: (unknown) Petition location: Sudbury Legislator, committee, or address that the petition was sent to: Francis R. Gourgas, Concord Selected signatures:J.H. BrownSally BrownLoring Eaton Total signatures: 76 Legal voter signatures (males not identified as non-legal): 31 Female signatures: 37 Unidentified signatures: 8 Female only signatures: No Identifications of signatories: inhabitants, [females] Prayer format was printed vs. manuscript: Printed Signatory column format: not column separated Additional non-petition or unrelated documents available at archive: no additional documents Additional archivist notes: 11057/4 written on back Location of the petition at the Massachusetts Archives of the Commonwealth: House Unpassed 1842, Docket 1153 Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.
n
Gold Level L2 ratio of the column abundance of thermospheric O relative to...
heliophysicsdata.gsfc.nasa.gov
application/x-cdf +2
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Gold Level L2 ratio of the column abundance of thermospheric O relative to N2 [Dataset]. https://heliophysicsdata.gsfc.nasa.gov/WS/hdp/1/Spase?ResourceID=spase%3A%2F%2FNASA%2FNumericalData%2FGOLD%2FL2%2FON2%2FPT8S
Explore at:
application/x-cdf, csv, binAvailable download formats
Dataset updated
Nov 14, 2025
License
https://spdx.org/licenses/CC0-1.0https://spdx.org/licenses/CC0-1.0
Variables measured
Latitude, UTC time, Longitude, GOLD channel, Lookup Table, L1C File Name, Scan Stop Time, Mask Wavelength, Scan Start Time, N2 LBH brightness, and 16 more
Description
GOLD daytime disk scan (DAY) measurements are used to derive the ratio of the column abundance of thermospheric O relative to N2, conventionally referred to as O/N2 or ΣO/N2, but abbreviated to ON2 for the GOLD data product. ON2 is derived from dayside Level 1C data after binning pixels 2x2 for approximately 68 disk scan measurements performed per day by GOLD in nominal operation.

Algorithm heritage

The disk ON2 retrieval algorithm was originally developed by Computational Physics, Inc. (CPI) for use with GUVI and SSUSI radiance images (Strickland et al., 1995). The GOLD implementation of this algorithm takes advantage of GOLD's ability to transmit the full spectrum to maximize the signal-to-noise ratio and eliminate atomic emission lines that contaminate the N2 LBH bands (e.g., N I 149.3 nm). This algorithm has been extensively documented and applied over the past several decades (e.g., Evans et al. [1995]; Christensen et al. [2003]; Strickland et al. [2004]) and Correira et al. [2021] describe the implementation used for the GOLD data.

Algorithm theoretical basis

The geophysical parameter retrieved, O/N2, is the ratio of the vertical column density of O relative to N2, defined at a standard reference N2 depth of 1017 cm-2, which is chosen to minimize uncertainty in the derived O/N2. It is retrieved directly from the ratio of the O I 135.6 nm and N2 LBH band intensities measured by GOLD on the dayside disk (DAY measurement mode). The AURIC atmospheric radiance model (Strickland et al. [1999]) is used to derive this relationship as a function of solar zenith angle and to create the look-up table (LUT) used by the algorithm.

References

Christensen, A. B., et al. (2003), Initial observations with the Global Ultraviolet Imager (GUVI) in the NASA TIMED satellite mission, J. Geophys. Res., vol. 108, NO. A12, 1451, doi:10.1029/2003JA009918.

Correira, J., Evans, J. S., Lumpe, J. D., Krywonos, A., Daniell, R., Veibell, V., et al. (2021). Thermospheric composition and solar EUV flux from the Globalscale Observations of the Limb and Disk (GOLD) mission. Journal of Geophysical Research: Space Physics, 126, e2021JA029517. https://doi.org/10.1029/2021JA029517

Evans, J. S., D. J. Strickland and R. E. Huffman (1995), Satellite remote sensing of thermospheric O/N2 and solar EUV: 2. Data analysis, J. Geophys. Res., vol. 100, NO. A7, pages 12,227-12,233.

Strickland, D. J., R. R. Meier, R. L. Walterscheid, J. D. Craven, A. B. Christensen, L. J. Paxton, D. Morrison, and G. Crowley (2004), Quiet-time seasonal behavior of the thermosphere seen in the far ultraviolet dayglow, J. Geophys. Res., vol. 109, A01302, doi:10.1029/2003JA010220.

Strickland, D.J., J. Bishop, J.S. Evans, T. Majeed, P.M. Shen, R.J. Cox, R. Link, and R.E. Huffman (1999), Atmospheric Ultraviolet Radiance Integrated Code (AURIC): theory, software architecture, inputs and selected results, JQSRT, 62, 689-742.

Strickland, D. J., J. S. Evans, and L. J. Paxton (1995), Satellite remote sensing of thermospheric O/N2 and solar EUV: 1. Theory, J. Geophys. Res., 110, A7, pages 12,217-12,226.
Case study: Cyclistic bike-share analysis
kaggle.com
zip
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge4141 (2022). Case study: Cyclistic bike-share analysis [Dataset]. https://www.kaggle.com/datasets/jorge4141/case-study-cyclistic-bikeshare-analysis
Explore at:
zip(131490806 bytes)Available download formats
Dataset updated
Mar 25, 2022
Authors
Jorge4141
Description
Introduction

This is a case study called Capstone Project from the Google Data Analytics Certificate.

In this case study, I am working as a junior data analyst at a fictitious bike-share company in Chicago called Cyclistic.

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

Scenario

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.

****Primary Stakeholders:****

1: Cyclistic Executive Team

2: Lily Moreno, Director of Marketing and Manager

ASK

How do annual members and casual riders use Cyclistic bikes differently?

Why would casual riders buy Cyclistic annual memberships?

How can Cyclistic use digital media to influence casual riders to become members?

# Prepare

The last four quarters were selected for analysis which cover April 01, 2019 - March 31, 2020. These are the datasets used:

Divvy_Trips_2019_Q2 Divvy_Trips_2019_Q3 Divvy_Trips_2019_Q4 Divvy_Trips_2020_Q1

The data is stored in CSV files. Each file contains one month data for a total of 12 .csv files.

Data appears to be reliable with no bias. It also appears to be original, current and cited.

I used Cyclistic’s historical trip data found here: https://divvy-tripdata.s3.amazonaws.com/index.html

The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

Limitations

Financial information is not available.

Process

Used R to analyze and clean data

After installing the R packages, data was collected, wrangled and combined into a single file.

Columns were renamed.

Looked for incongruencies in the dataframes and converted some columns to character type, so they can stack correctly.

Combined all quarters into one big data frame.

Removed unnecessary columns

Analyze

Inspected new data table to ensure column names were correctly assigned.

Formatted columns to ensure proper data types were assigned (numeric, character, etc).

Consolidated the member_casual column.

Added day, month and year columns to aggregate data.

Added ride-length column to the entire dataframe for consistency.

Deleted trip duration rides that showed as negative and bikes out of circulation for quality control.

Replaced the word "member" with "Subscriber" and also replaced the word "casual" with "Customer".

Aggregated data, compared average rides between members and casual users.

Share

After analysis, visuals were created as shown below with R.

Act

Conclusion:

Data appears to show that casual riders and members use bike share differently.

Casual riders' average ride length is more than twice of that of members.

Members use bike share for commuting, casual riders use it for leisure and mostly on the weekends.

Unfortunately, there's no financial data available to determine which of the two (casual or member) is spending more money.

Recommendations

Offer casual riders a membership package with promotions and discounts.
o
Uniform Crime Reporting Program Data: Offenses Known and Clearances by...
openicpsr.org
search.datacite.org
Updated Dec 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2017). Uniform Crime Reporting Program Data: Offenses Known and Clearances by Arrest, 1960-2016 [Dataset]. http://doi.org/10.3886/E100707V2
Explore at:
Unique identifier
https://doi.org/10.3886/E100707V2
Dataset updated
Dec 17, 2017
Dataset provided by
University of Pennsylvania. Department of Criminology
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1960 - 2016
Area covered
United States
Description
This is a collection of Offenses Known and Clearances By Arrest data from 1960 to 2016. The monthly zip files contain one data file per year(57 total, 1960-2016) as well as a codebook for each year. These files have been read into R using the ASCII and setup files from ICPSR (or from the FBI for 2016 data) using the package asciiSetupReader. The end of the zip folder's name says what data type (R, SPSS, SAS, Microsoft Excel CSV, feather, Stata) the data is in. Due to file size limits on open ICPSR, not all file types were included for all the data.

The files are lightly cleaned. What this means specifically is that column names and value labels are standardized. In the original data column names were different between years (e.g. the December burglaries cleared column is "DEC_TOT_CLR_BRGLRY_TOT" in 1975 and "DEC_TOT_CLR_BURG_TOTAL" in 1977). The data here have standardized columns so you can compare between years and combine years together. The same thing is done for values inside of columns. For example, the state column gave state names in some years, abbreviations in others. For the code uses to clean and read the data, please see my GitHub file here. https://github.com/jacobkap/crime_data/blob/master/R_code/offenses_known.R

The zip files labeled "yearly" contain yearly data rather than monthly. These also contain far fewer descriptive columns about the agencies in an attempt to decrease file size. Each zip folder contains two files: a data file in whatever format you choose and a codebook. The data file is aggregated yearly and has already combined every year 1960-2016. For the code I used to do this, see here https://github.com/jacobkap/crime_data/blob/master/R_code/yearly_offenses_known.R.

If you find any mistakes in the data or have any suggestions, please email me at jkkaplan6@gmail.com

As a description of what UCR Offenses Known and Clearances By Arrest data contains, the following is copied from ICPSR's 2015 page for the data.

The Uniform Crime Reporting Program Data: Offenses Known and Clearances By Arrest dataset is a compilation of offenses reported to law enforcement agencies in the United States. Due to the vast number of categories of crime committed in the United States, the FBI has limited the type of crimes included in this compilation to those crimes which people are most likely to report to police and those crimes which occur frequently enough to be analyzed across time. Crimes included are criminal homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft. Much information about these crimes is provided in this dataset. The number of times an offense has been reported, the number of reported offenses that have been cleared by arrests, and the number of cleared offenses which involved offenders under the age of 18 are the major items of information collected.
d
Council; Council Files September 22, 1843, Case of Isaac Leavitt, GC3/series...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA (2023). Council; Council Files September 22, 1843, Case of Isaac Leavitt, GC3/series 378, Petition of Charles W. Lillie [Dataset]. http://doi.org/10.7910/DVN/2RMA9
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2RMA9
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA
Time period covered
Sep 11, 1843
Description
Petition subject: Execution case Original: http://nrs.harvard.edu/urn-3:FHCL:12232985 Date of creation: 1843-09-11 Petition location: Roxbury Selected signatures:Charles W. LillieStephen R. DoggettCaroline Williams Total signatures: 13 Legal voter signatures (males not identified as non-legal): 9 Female signatures: 4 Female only signatures: No Identifications of signatories: inhabitants, [females] Prayer format was printed vs. manuscript: Manuscript Signatory column format: not column separated Additional non-petition or unrelated documents available at archive: additional documents available Additional archivist notes: Isaac Leavitt Location of the petition at the Massachusetts Archives of the Commonwealth: Governor Council Files, September 22, 1843, Case of Isaac Leavitt Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Data from: Candidate selective sweeps in U.S. wheat populations
data.niaid.nih.gov
agdatacommons.nal.usda.gov
+1more
zip
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sajal Sthapit; Travis Ruff; Marcus Hooker; Deven See (2024). Candidate selective sweeps in U.S. wheat populations [Dataset]. http://doi.org/10.5061/dryad.ghx3ffbx0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ghx3ffbx0
Dataset updated
Nov 6, 2024
Dataset provided by
The Land Institute
USDA-ARS Wheat Health, Genetics, and Quality Research
Washington State University
Authors
Sajal Sthapit; Travis Ruff; Marcus Hooker; Deven See
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United States
Description
Exploration of novel alleles from ex situ collection is still limited in modern plant breeding as these alleles exist in genetic backgrounds of landraces that are not adapted to modern production environments. The practice of backcross breeding results in the preservation of the adapted background of elite parents but leaves little room for novel alleles from landraces to be incorporated. The selection of adaptation-associated linkage blocks instead of the entire adapted background may allow breeders to incorporate more of the landrace’s genetic background and to observe and evaluate novel alleles. Important adaptation-associated linkage blocks would have been selected over multiple cycles of breeding and hence are likely to exhibit signatures of positive selection or selective sweeps. We conducted a genome-wide scan for candidate selective sweeps (CSS) using Fst, Rsb, and xpEHH in state, regional, spring, winter, and market class population pairs and report 446 CSS in 19 population pairs over time and 1033 CSS in 44 population pairs across geography and class. Further validation of these candidate selective sweeps in specific breeding programs may lead to the identification of sets of loci that can be selected to restore population-specific adaptation without multiple backcrossing. Methods Folder Structure

The dataset has the following folder structure

./ or the root folder has the scripts used for analysis in R Markdown files as well as the corresponding .html output from running these scripts.

./data/ has the raw data and the intermediate data saves from the analysis

./functions/ has one file "functions_for_selection_sweep_analysis.R" that has the custom functions written for the analysis in the manuscript.

./output/ has the analysis results and figures used in the manuscript

./output/mapchart/ has the MapChart input files for drawing linkage maps of canddiate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2 standard deviations

./output/mapchart_sd2.5/ has the MapChart input files for drawing linkage maps of candidate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2.5 standard deviations.

./rehh_files/ has two subfolders /genotype and /map that store the intermediate files generated by the R package 'rehh' to calcualte Rsb and xpEHH.

Raw data files

The analysis in the manuscript uses the following raw data files. Data files not in this list are all intermediate files created by the analysis scripts.

./data/90k_SNP_type.txt

A tab-delimmited file with 4 columns as described below:

Index: serial number of genetic markers/loci on the 90K wheat SNP chip.

Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.

SNP: Alleles present in the single nucleotide polymorphism (SNP) marker/loci.

SNPTYPE: Same information as in column SNP but in a format without square brackets and /

./data/KIM_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

A tab-delimmited filed with information on known informative markers (KIM) recorded in 8 columns described below.

Marker: Name of the marker to be used as the label in the linkage maps in Supplemental Figures.

Chromosome: Chromosome label for wheat.

Start1.0: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 1.0. This information was not used in the current study.

Start: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 2.1.

Prop: Proportion sequence match for the marker to the reference genome sequence version 2.1.

SNP_ID: Alternative name for the marker. This information was not used in the current study.

Gene: Name of the gene.

Function: Function of the gene.

./data/R-generated-genotype-for-analysis-imputed-AB-format.csv

Raw 90K wheat SNP chip data after quality filtering and imputation uisng LinkImpute as described in Sthapit et al. The dataset includes the 7 information column described below, followed by 753 columns with genotype information in the AB format.

Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.

SNPid: Unique IWA and IWB SNP names of the genetic markers/loci on the 90K wheat SNP chip.

Chrom: Wheat chromosome labels.

Ord: Order of the marker. This information was not used for analysis.

cM: Centimorgan position of the marker. This information was not used for analysis.

Comment: Notes on manual classification of genotype calls in GenomeStudio.

Remaining columns have variety names and their corresponding genotype calls in AB format.

./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv

Same information as in ./data/R-generated-genotype-for-analysis-imputed-AB-format.csv but the genotype information in the last 753 columns are recorded in the nucleotide (ACGT) format.

./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

Contains physical base pair positions on the 'Chinese Spring' wheat reference sequence version 2.1 for the 90K SNP chip markers. The file has 5 columns without column headers. The column descriptions are given below.

First column has unique names of the genetic markers/loci on the 90K Wheat SNP chip.

Second column has wheat chromosome labels.

Third column has the starting base pair position of the marker on the reference sequence version 2.1.

Fourth column has the ending base pair position of the marker on the reference sequence version 2.1.

Fifth column has the mid-point of the third and fourth column, which was used at the SNP position for the marker in this study.

./data/variety_details.txt

Contains information about the 753 wheat varieties used as the diversity panel for this study. The file contains 12 columns, which are described below:

GS.Sample.ID: Names of the samples/varieties as they were in the raw output from the Illumina SNP calling software Genome Studio.

Corrected.Sample.ID: Names of the samples/varieties after they were corrected for typos (for example, 'Eric' to 'Erik') and removal of the prefix "varname" for varieties for varieties that only have numbers in their names ('varname2154' to '2154').

ACNO: Accession number of the varieties from the NPGS-GRIN database.

Habit: Growth habit (spring or winter) of the varieties.

Region: U.S. wheat growing regions: EAS, Eastern; GPL, Great Plains; NOR, Northern; PAC, Pacific; PNW, Pacific Northwest. Description of how states were assigned to these regions are in the methods section of the manuscript.

State: U.S. state the varieties are from.

Year: The year the variety was released in the U.S.

MC: Market class of the wheat variety: HRS, hard red spring; HRW, hard red winter; SRW, soft red winter; SWS, soft white spring; SWW, soft white winter.

HeadType: Designates if the spike or head of the wheat is club or common.

Sector: Was the variety from the public or private sector. Information in this column is incomplete and hence was not used for any analysis in the manuscript.

Decade: Decade the variety was released.

BP: Breeding period the variety was released.

Description of Scripts

Here we describe the scripts in order along with the input data files used and the output files these scripts produced.

./00_import_RefSeqv2.1_physical_positions.Rmd ./00_import_RefSeqv2.1_physical_positions.html (R Markdown output html)

The study uses genotype data generated from our previous study (https://doi.org/10.1002/tpg2.20196) that had marker physical positions based on wheat reference sequence version 1. This script updates the marker physical positions to the wheat reference sequence version 2.1 and saves the updated genotype files for subsequent analyses.

Input files:

./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt ./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv ./data/R-generated-genotype-for-analysis-imputed-AB-format.csv

Output files:

./data/genotype_AB_format_13995_loci_imputed.txt ./data/genotype_nucleotide_format_13995_loci_imputed.txt

01_define_populations.Rmd 01_define_populations.html (R Markdown output html)

The script assigns what varieties go into what sub-populations as described in the methods section of the manuscript.

Input files:

./functions/functions_for_selection_sweep_analysis.R ./data/variety_details.txt

Output files:

./data/populations.rds./output/first_last_varieties.csv 02_calculate_iHH_iES_inES.Rmd 02_calculate_iHH_iES_inES.html (R Markdown output html)

This script uses the 'rehh' package function 'scan_hh' called through the custom function 'scan_population' to calculate the integrated extended haplotype homozygosity (iHH), integrated site-specific extended haplotype homozygosity (iES), and integrated normalized site-specific extended haplotype homozygosity (inES) for all markers of all 21 chromosomes and all wheat sub-populations in the study. The intermediate files needed to run these calculations were written to the folders ./rehh_files/genotype and ./rehh_files/map. The output is saved as an RDS file to be used as input for subsequent scripts.

Input files:

./functions/functions_for_selection_sweep_analysis.R ./data/genotype_nucleotide_format_13995_loci_imputed.txt ./data/populations.rds

Output files:

./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds 03_calculate_allele_freq_Fst_Rsb_xpEHH.Rmd 03_calculate_allele_freq_Fst_Rsb_xpEHH.html (R Markdown output html)

Script calculates allele frequencies for all the sub-populations and Fst, Rsb, and xpEHH statistics for defined sub-population pairs.

Input files:

./functions/functions_for_selection_sweep_analysis.R ./data/genotype_nucleotide_format_13995_loci_imputed.txt ./data/genotype_AB_format_13995_loci_imputed.txt ./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds

Output files:

./output/allele_freq_Fst_Rsb_xpEHH.Rds
Z
Data from: Lower complexity of motor primitives ensures robust control of...
data.niaid.nih.gov
Updated Jun 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santuz, Alessandro; Ekizos, Antonis; Kunimasa, Yoko; Kijima, Kota; Ishikawa, Masaki; Arampatzis, Adamantios (2022). Lower complexity of motor primitives ensures robust control of high-speed human locomotion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3764760
Explore at:
Dataset updated
Jun 18, 2022
Dataset provided by
Osaka University of Health and Sport Sciences
Humboldt-Universität zu Berlin, Dalhousie University
Humboldt-Universität zu Berlin
Authors
Santuz, Alessandro; Ekizos, Antonis; Kunimasa, Yoko; Kijima, Kota; Ishikawa, Masaki; Arampatzis, Adamantios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Walking and running are mechanically and energetically different locomotion modes. For selecting one or another, speed is a parameter of paramount importance. Yet, both are likely controlled by similar low-dimensional neuronal networks that reflect in patterned muscle activations called muscle synergies. Here, we investigated how humans synergistically activate muscles during locomotion at different submaximal and maximal speeds. We analysed the duration and complexity (or irregularity) over time of motor primitives, the temporal components of muscle synergies. We found that the challenge imposed by controlling high-speed locomotion forces the central nervous system to produce muscle activation patterns that are wider and less complex relative to the duration of the gait cycle. The motor modules, or time-independent coefficients, were redistributed as locomotion speed changed. These outcomes show that robust locomotion control at challenging speeds is achieved by modulating the relative contribution of muscle activations and producing less complex and wider control signals, whereas slow speeds allow for more irregular control.

In this supplementary data set we made available: a) the metadata with anonymized participant information, b) the raw EMG, c) the touchdown and lift-off timings of the recorded limb, d) the filtered and time-normalized EMG, e) the muscle synergies extracted via NMF and f) the code to process the data, including the scripts to calculate the Higuchi's fractal dimension (HFD) of motor primitives. In total, 180 trials from 30 participants are included in the supplementary data set.

The file “metadata.dat” is available in ASCII and RData format and contains:

Code: the participant’s code

Group: the experimental group in which the participant was involved (G1 = walking and submaximal running; G2 = submaximal and maximal running)

Sex: the participant’s sex (M or F)

Speeds: the type of locomotion (W for walking or R for running) and speed at which the recordings were conducted in 10*[m/s]

Age: the participant’s age in years

Height: the participant’s height in [cm]

Mass: the participant’s body mass in [kg]

PB: 100 m-personal best time (for G2).

The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus. Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17). All the other trials consist of 30 gait cycles. Trials are named like “P20_R_20,” where the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.

Old versions not compatible with the R package musclesyneRgies

The files containing the gait cycle breakdown are available in RData format, in the file named “CYCLE_TIMES.RData”. The files are structured as data frames with as many rows as the available number of gait cycles and two columns. The first column named “touchdown” contains the touchdown incremental times in seconds. The second column named “stance” contains the duration of each stance phase of the right foot in seconds. Each trial is saved as an element of a single R list. Trials are named like “CYCLE_TIMES_P20_R_20,” where the characters “CYCLE_TIMES” indicate that the trial contains the gait cycle breakdown times, the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17).

The files containing the raw, filtered and the normalized EMG data are available in RData format, in the files named “RAW_EMG.RData” and “FILT_EMG.RData”. The raw EMG files are structured as data frames with as many rows as the amount of recorded data points and 13 columns. The first column named “time” contains the incremental time in seconds. The remaining 12 columns contain the raw EMG data, named with muscle abbreviations that follow those reported above. Each trial is saved as an element of a single R list. Trials are named like “RAW_EMG_P03_R_30”, where the characters “RAW_EMG” indicate that the trial contains raw emg data, the characters “P03” indicate the participant number (in this example the 3rd), the character “R” indicate the locomotion type (see above), and the numbers “30” indicate the locomotion speed (see above). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.

The files containing the muscle synergies extracted from the filtered and normalized EMG data are available in RData format, in the files named “SYNS_H.RData” and “SYNS_W.RData”. The muscle synergies files are divided in motor primitives and motor modules and are presented as direct output of the factorisation and not in any functional order. Motor primitives are data frames with 6000 rows and a number of columns equal to the number of synergies (which might differ from trial to trial) plus one. The rows contain the time-dependent coefficients (motor primitives), one column for each synergy plus the time points (columns are named e.g. “time, Syn1, Syn2, Syn3”, where “Syn” is the abbreviation for “synergy”). Each gait cycle contains 200 data points, 100 for the stance and 100 for the swing phase which, multiplied by the 30 recorded cycles, result in 6000 data points distributed in as many rows. This output is transposed as compared to the one discussed in the methods section to improve user readability. Each set of motor primitives is saved as an element of a single R list. Trials are named like “SYNS_H_P12_W_07”, where the characters “SYNS_H” indicate that the trial contains motor primitive data, the characters “P12” indicate the participant number (in this example the 12th), the character “W” indicate the locomotion type (see above), and the numbers “07” indicate the speed (see above). Motor modules are data frames with 12 rows (number of recorded muscles) and a number of columns equal to the number of synergies (which might differ from trial to trial). The rows, named with muscle abbreviations that follow those reported above, contain the time-independent coefficients (motor modules), one for each synergy and for each muscle. Each set of motor modules relative to one synergy is saved as an element of a single R list. Trials are named like “SYNS_W_P22_R_20”, where the characters “SYNS_W” indicate that the trial contains motor module data, the characters “P22” indicate the participant number (in this example the 22nd), the character “W” indicates the locomotion type (see above), and the numbers “20” indicate the speed (see above). Given the nature of the NMF algorithm for the extraction of muscle synergies, the supplementary data set might show non-significant differences as compared to the one used for obtaining the results of this paper.

The files containing the HFD calculated from motor primitives are available in RData format, in the file named “HFD.RData”. HFD results are presented in a list of lists containing, for each trial, 1) the HFD, and 2) the interval time k used for the calculations. HFDs are presented as one number (mean HFD of the primitives for that trial), as are the interval times k. Trials are named like “HFD_P01_R_95”, where the characters “HFD” indicate that the trial contains HFD data, the characters “P01” indicate the participant number (in this example the 1st), the character “R” indicates the locomotion type (see above), and the numbers “95” indicate the speed (see above).

All the code used for the pre-processing of EMG data, the extraction of muscle synergies and the calculation of HFD is available in R format. Explanatory comments are profusely present throughout the script “muscle_synergies.R”.
Single column 1D radiative transfer simulations for a case study of...
zenodo.org
nc
Updated Feb 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carola Barrientos-Velasco; Carola Barrientos-Velasco (2023). Single column 1D radiative transfer simulations for a case study of low-level-stratus clouds in the central Arctic during PS106 [Dataset]. http://doi.org/10.5281/zenodo.7674007
Explore at:
ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7674007
Dataset updated
Feb 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carola Barrientos-Velasco; Carola Barrientos-Velasco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Arctic
Description
The collection of datasets published contain the input parameters and output simulations from a single column 1D radiative transfer simulations using the Rapid Radiative Transfer Model for General Circulation Model (GCM) applications (RRTMG).

The simulations are focused on a selected case study of low-level-stratus clouds during the PS106 research cruise conducted in 2017 in the Central Arctic. The simulations are based on remote sensing observations, which were synergistically used with the Cloudnet algorithm to derive macro and microphysical properties of clouds. The atmospheric profiles of temperature, pressure, and ozone are from ERA5 (European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis) and values of surface albedo from CERES (Clouds and the Earth's Radiant Energy System) SYN1deg Ed. 4.1.
d
Senate Unpassed Legislation 1839, Docket 10525, SC1/series 231, Petition of...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA (2023). Senate Unpassed Legislation 1839, Docket 10525, SC1/series 231, Petition of Samuel E. Sewall [Dataset]. http://doi.org/10.7910/DVN/E8QSD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/E8QSD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA
Time period covered
Dec 19, 1838
Description
Petition subject: To abolish slavery in Washington D.C. and against the admission of Florida and slave states Original: http://nrs.harvard.edu/urn-3:FHCL:11857942 Date of creation: 1838-12-19 Petition location: Boston Legislator, committee, or address that the petition was sent to: George Bradburn, Nantucket Selected signatures:Samuel E. SewallCharles R. PettengillJames C. OdiornePerez GillJoseph NoyesOrin CarpenterJohn T. HiltonRobert MorrisCatherine BarbadoesChloe A. LeeWilliam C. NellEunice R. DavisPeter GrayHenry WeedenJoel W. Lewis Total signatures: 162 Legal voter signatures (males not identified as non-legal): 133 Female signatures: 12 Unidentified signatures: 17 Female only signatures: No Identifications of signatories: citizens, [females], [males of color], [females of color] Prayer format was printed vs. manuscript: Printed Signatory column format: not column separated Additional non-petition or unrelated documents available at archive: no additional documents Additional archivist notes: Appears that only the bottom two sections include females Location of the petition at the Massachusetts Archives of the Commonwealth: Senate Unpassed 1839, Docket 10525 Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:

pptxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3840102.v1

Dataset updated

Sep 19, 2016

Dataset provided by

Figsharehttp://figshare.com/

Authors

Benj Petre; Aurore Coince; Sophien Kamoun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Clear search

Close search

Google apps

Main menu

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Dataset of book subjects that contain The economics of immigration :...

Water-column environmental variables and accompanying discrete CTD...

Computational code of square cascade for separation of Ne stable isotopes

Council; Council Files April 17, 1847, Case of Leander Thompson, GC3/series...

Data from: Assessing the impact of static and fluctuating ocean...

Sloan Digital Sky Survey DR16

Context

Content

Comments

Data server

Acknowledgements

Inspiration

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

House Unpassed Legislation 1842, Docket 1153, SC1/series 230, Petition of...

Gold Level L2 ratio of the column abundance of thermospheric O relative to...

Case study: Cyclistic bike-share analysis

Introduction

Scenario

****Primary Stakeholders:****

ASK

Limitations

Process

Analyze

Share

Act

Recommendations

Uniform Crime Reporting Program Data: Offenses Known and Clearances by...

Council; Council Files September 22, 1843, Case of Isaac Leavitt, GC3/series...

LScDC Word-Category RIG Matrix

Data from: Candidate selective sweeps in U.S. wheat populations

Data from: Lower complexity of motor primitives ensures robust control of...

Single column 1D radiative transfer simulations for a case study of...

Senate Unpassed Legislation 1839, Docket 10525, SC1/series 231, Petition of...

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Primary Stakeholders: