8 datasets found
  1. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  2. Data from: Annotated database of Slovenian adjectives

    • zenodo.org
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petra Mišmaš; Petra Mišmaš; Marko Simonović; Marko Simonović; Stefan Milosavljević; Stefan Milosavljević (2025). Annotated database of Slovenian adjectives [Dataset]. http://doi.org/10.5281/zenodo.15174244
    Explore at:
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Petra Mišmaš; Petra Mišmaš; Marko Simonović; Marko Simonović; Stefan Milosavljević; Stefan Milosavljević
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database presents the morphological annotation of Slovenian adjectives. It includes the 6,000 most frequent adjectives in Slovenian, extracted from the Gigafida 2.0 corpus (deduplicated) using the CQL [tag="P.*"] on a random sample of 10,000,000 lines in the NoSketch engine in March 2024.

    Among the adjectives on the list, there are some homophonous items and given that the corpus is not annotated for meaning, homophonous adjectives were counted as a single item. For example, premočen can either mean ‘soaked’ (from the verb premočiti ‘to drench’) or ‘too strong’ (from močen ‘strong’). The annotator decided which meaning they perceive as more salient and annotated the item as that specific adjective.

    Proper names were not annotated as morphologically complex. For example, the possessive Gregorinov ‘Gregorin’s’ (Gregorin is a last name) is only marked for the possessive suffix -ov, even if the last name itself is probably decomposable to Gregor+in.

    Column-by-column overview

    We start by listing the columns in the database and showing what property is annotated in each of them.

    Column A: ID

    Adjectives are annotated with consecutive numbers, and this column contains a unique number assigned to each adjective.

    Column B: Adjective

    This column lists the citation form (lemma) of each adjective.

    Column C: Frequency

    This column provides the frequency of each individual adjective's lemma.

    Column D: Included

    This column distinguishes between items we consider actual adjectives in the relevant sense from all other items. Words marked with 1 are included in the annotation, while items marked with a 0 are excluded. The reasons for exclusion are:

    • the item not being an adjective,

    • the item being misspelled and

    • the item being a proper name or a part of a proper name.

    Columns E–K: Suffix 1 to Suffix 7

    These columns list the specific suffixes contained in each adjective. Suffix 1 is the one closest to the root, followed by Suffix 2, and so on.

    The aim was to pursue maximal decomposition. Therefore, for instance, the possessive adjectival pronoun svoj ‘own’ was decomposed into s-v-oj, based on its relation to s-eb-e ‘oneself’ as well as to m-oj ‘my’ and t-v-oj ‘your’.

    See Appendix for the specific decisions regarding the annotation.

    Column L: Ending

    If the adjective has a phonologically overt inflectional ending (e.g., slovensk-i ‘Slovenian’), this ending is listed in column E.

    Column M: Prefixes

    If the adjective has a prefix, the prefix is listed in this column.

    Prefixes in loanwords are annotated in this column if the version without the prefix (or with some other prefix) also exists in Slovenian. E.g., iracionalen ‘irrational’ is annotated as having the prefix i- because racionalen ‘rational’ also exists. On the other hand, dis- in diskonten ‘discount’ is not given in this column, because *konten does not exist in Slovenian

    If the adjective has several prefixes, these are listed in the column and are separated by a plus sign. The rightmost prefix is the one closest to the root/base.

    If the prefix is marked with an asterisk, this prefix modifies an existing adjective. In other cases, the prefix is a part of a non-adjectival base that got adjectivised. Compare:

    • predolg ‘too long’: prefix pre* (dolg ‘long’ is an adjective)

    • preminul ‘dead’: prefix marked as pre (the adjective is derived from the verb preminiti ‘to die’)

    • zavezniški ‘ally’: prefix marked as za (the adjective is derived from the noun zaveznik ‘ally’).

    Items that could be taken to be a prefix, but an unprefixed version of the base (or a version with a different prefix) is not attested, are given in brackets. For instance zanikrn ‘sloppy’ has (za) in this column, since the annotator has the intuition that za is a prefix in this word, but *nikrn is not attested.

    If it was unclear whether the item in question was a single prefix or could be further decomposed, a potential decomposition is provided. One such example is izpodbijan ‘contentious’ where prefixes iz- and pod- also exist (as do prepositions iz, pod and izpod), which is why prefixes iz+pod were annotated.

    Column N: Non-derived adjective

    Adjectives that are taken to be non-derived (i.e., in cases where we have no arguments to assume they are morphologically complex) get a 1 in this column (if not, they are assigned a 0). For instance, bled ‘pale’ has a 1 in this column, whereas mandlj-ev ‘made out of almonds’ has a 0.

    Column O: Zero

    Adjectives that contain a base from a different category or a compound base, but do not include an overt adjectivising morpheme, are assigned a 1 in this column (if not, they are assigned a 0). An example is drag-o-cen-Ø lit. expensive-o-price ‘invaluable’.

    Column P: Compound base

    Adjectives that have a compound base get a 1 in this column (if not, they are assigned a 0).

    If an adjective is annotated with a 1, the right component of the compound is decomposed for suffixes only. For instance, drug-o-uvrščen ‘runner-up’ (literally second-o-classified) has the prefix u- in the right part, but this is not annotated separately.

    Loan adjectives are marked as having a compound base if the components of the base are used in other contexts in Slovenian. E.g., the base of radiološki ‘radiological’ is radiolog, which contains radio, used as an independent word meaning ‘radio’ and -log, also attested in, e.g., psiholog ‘psychologist’, arheolog ‘archeologist’.

    Finally, if an item is marked as a compound, it is not also marked as participial, even if it contains a deverbal participle. A case in point is drug-o-uvrščen ‘runner-up’, which contains the passive participle of the verb uvrstiti ‘classify’.

    Column R: PTCP

    If the adjective is a passive or active participle, it is assigned a 1 in this column. If not, they are assigned a 0.

    Appendix: Specific decisions for the annotation of suffixes in columns E–J

    The general criterion for annotating an element as a suffix was its occurrence in multiple adjectives and/or in combination with other suffixes. Crucially, this means that we also attempted to decompose elements sometimes considered a single suffix. For example, -kast in siv-kast ‘gray-ish’ was annotated as siv+k+ast, since both -k and -ast are independently attested suffixes (kič-ast ‘kitsch-y’, ljub-(e)k ‘cute’).

    Especially in the domain of borrowed words, in some cases, it was impossible to reconstruct the underlying representation of suffixes that only appear before palatalising suffixes. For instance, in sarkastičen ‘sarcastic’, the sequence -ič- can, in principle, be underlyingly -ik-, -ic-, or -ič-, as all these underlying representations could lead to the surface allomorph -ič-. In such cases, we opted for analogy with comparable words whose intermediate bases do surface as independent words. In this case, an analogy can be made with words like logističen ‘logistic’, with the base logistika ‘logistics’. As a consequence, sarkastičen was annotated as sarkast+ik+n.

    Some nominal bases display so-called stem extensions, which occur throughout the paradigm of the noun (e.g. vrem-e ‘weather’ has the genitive singular vrem-en-a, dative singular vrem-en-u etc.) Stem extensions like en were not annotated as derivational suffixes, so that e.g., vrem-en-sk-i is annotated as having only the suffix sk.

    Similarly, many nouns ending in -r in the nominative singular get an extra -j in other forms in the paradigm. Because -j is present in the declension of the noun, it was not annotated as a suffix. E.g. krompir ‘potato’ has the genitive singular krompir-j-a. The related adjective krompir-j-ev ‘related to potato’ is

  3. c

    Data from: Data and code from: Stem borer herbivory dependent on...

    • s.cnmilf.com
    • datasets.ai
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-and-code-from-stem-borer-herbivory-dependent-on-interactions-of-sugarcane-variety-ass-1e076
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and _location. All _location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip

  4. Video game pricing analytics dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivi Deveshwar (2023). Video game pricing analytics dataset [Dataset]. https://www.kaggle.com/datasets/shivideveshwar/video-game-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivi Deveshwar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The review dataset for 3 video games - Call of Duty : Black Ops 3, Persona 5 Royal and Counter Strike: Global Offensive was taken through a web scrape of SteamDB [https://steamdb.info/] which is a large repository for game related data such as release dates, reviews, prices, and more. In the initial scrape, each individual game has two files - customer reviews (Count: 100 reviews) and price time series data.

    To obtain data on the reviews of the selected video games, we performed web scraping using R software. The customer reviews dataset contains the date that the review was posted and the review text, while the price dataset contains the date that the price was changed and the price on that date. In order to clean and prepare the data we first start by sectioning the data in excel. After scraping, our csv file fits each review in one row with the date. We split the data, separating date and review, allowing them to have separate columns. Luckily scraping the price separated price and date, so after the separating we just made sure that every file had similar column names.

    After, we use R to finish the cleaning. Each game has a separate file for prices and review, so each of the prices is converted into a continuous time series by extending the previously available price for each date. Then the price dataset is combined with its respective in R on the common date column using left join. The resulting dataset for each game contains four columns - game name, date, reviews and price. From there, we allow the user to select the game they would like to view.

  5. f

    LLM selection results by FS procedure in terms of R-values, L-values, and...

    • plos.figshare.com
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shang-Ming Zhou; Ronan A. Lyons; Sinead Brophy; Mike B. Gravenor (2023). LLM selection results by FS procedure in terms of R-values, L-values, and ω-values (Numeric values in the 2nd column represent rule IDs). [Dataset]. http://doi.org/10.1371/journal.pone.0051468.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Shang-Ming Zhou; Ronan A. Lyons; Sinead Brophy; Mike B. Gravenor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LLM selection results by FS procedure in terms of R-values, L-values, and ω-values (Numeric values in the 2nd column represent rule IDs).

  6. f

    ukbtools: An R package to manage and query UK Biobank data

    • plos.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ken B. Hanscombe; Jonathan R. I. Coleman; Matthew Traylor; Cathryn M. Lewis (2023). ukbtools: An R package to manage and query UK Biobank data [Dataset]. http://doi.org/10.1371/journal.pone.0214311
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ken B. Hanscombe; Jonathan R. I. Coleman; Matthew Traylor; Cathryn M. Lewis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThe UK Biobank (UKB) is a resource that includes detailed health-related data on about 500,000 individuals and is available to the research community. However, several obstacles limit immediate analysis of the data: data files vary in format, may be very large, and have numerical codes for column names.Resultsukbtools removes all the upfront data wrangling required to get a single dataset for statistical analysis. All associated data files are merged into a single dataset with descriptive column names. The package also provides tools to assist in quality control by exploring the primary demographics of subsets of participants; query of disease diagnoses for one or more individuals, and estimating disease frequency relative to a reference variable; and to retrieve genetic metadata.ConclusionHaving a dataset with meaningful variable names, a set of UKB-specific exploratory data analysis tools, disease query functions, and a set of helper functions to explore and write genetic metadata to file, will rapidly enable UKB users to undertake their research.

  7. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated May 18, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2022 [Dataset]. http://doi.org/10.3886/E103500V10
    Explore at:
    Dataset updated
    May 18, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1991 - 2021
    Area covered
    United States
    Description

    !!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 10 release notes:Adds 2022 dataVersion 9 release notes:Adds 2021 data.Version 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.

  8. b

    Summary deployment data for MOCNESS 1m2 and 10m2 tows from R/V Kilo Moana...

    • datacart.bco-dmo.org
    • bco-dmo.org
    csv
    Updated Jan 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey C. Drazen; Hilary G. Close; Cecelia Hannides; Brian N. Popp; Kanesa Seraphin (2016). Summary deployment data for MOCNESS 1m2 and 10m2 tows from R/V Kilo Moana KM1407, KM1418, KM1506 in the Central North Pacific, Station ALOHA from 2014-2015 (SuspendSinkPart project) [Dataset]. https://datacart.bco-dmo.org/dataset/636602
    Explore at:
    csv(83.61 KB)Available download formats
    Dataset updated
    Jan 27, 2016
    Dataset provided by
    Biological and Chemical Data Management Office
    Authors
    Jeffrey C. Drazen; Hilary G. Close; Cecelia Hannides; Brian N. Popp; Kanesa Seraphin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 19, 2014 - May 11, 2015
    Area covered
    Variables measured
    lat, lon, net, sal, tow, temp, year, event, fluor, sigma, and 14 more
    Measurement technique
    MOCNESS10, MOCNESS1
    Description

    Summary data from 1m2 and 10m2 MOCNESS tows conducted off Hawaii. Start and end of each net deployment pressure, fluorometry,conductivity, salinity, temperature,potential temperature and potential density. These data represent 33 tows from three cruises.

    DMO notes:
    Changed MOCNESS time column to yrday_local and used it to get hour/min.
    The first tow in each cruise has incorrect MOCNESS time when the tow crossed midnight. The hour and minute calculations are correct but for some reason the MOCNESS incremented a day when a net was opened. This is only true for the first tow in each cruise.
    Added year column to help with time conversions.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:
pptxAvailable download formats
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Search
Clear search
Close search
Google apps
Main menu