16 datasets found
  1. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  2. q

    Large Datasets in R - Plant Phenology & Temperature Data from NEON

    • qubeshub.org
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
    Explore at:
    Dataset updated
    May 10, 2018
    Dataset provided by
    QUBES
    Authors
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
    Description

    This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

  3. Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  4. P

    titanic5 Dataset Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    titanic5 Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/titanic5-dataset
    Explore at:
    Description

    titanic5 Dataset Created by David Beltran del Rio March 2016.

    Notes This is the final (for now) version of my update to the Titanic data. I think it’s finally ready for publishing if you’d like. What I did was to strip all the passenger and crew data from the Encyclopedia Titanica (ET) web pages (excluding channel crossing passengers), create a unique ID for each passenger and crew member (Name_ID), then (painstakingly and hopefully 100% correctly) match to your earlier titanic3 dataset, in order to compare the two and to get your sibsp and parch variables. Since the ET is updated occasionally the work put into the ID and matching can be reused and refined later. I did eventually hear back from the ET people, they are willing to make the underlying database available in the future, I have not yet taken them up on it.

    The two datasets line up nicely, most of the differences in the newer titanic5 dataset are in the age variable, as I had mentioned before - the new set has less missing ages - 51 missing (vs 263) out of 1309.

    I am in the process of refining my analysis of the data as well, based on your comments below and your Regression Modeling Strategies example.

    titanic3_wID data can be matched to titanic5 using the Name_ID variable. Tab titanic5 Metadata has the variable descriptions and allowable values for Class and Class/Dept.

    A note about the ages - instead of using the add 0.5 trick to indicate estimated birth day / date I have a flag that indicates how the “final” age (Age_F) was arrived at. It’s the Age_F_Code variable - the allowable values are in the Titanic5_metadata tab in the attached excel. The reason for this is that I already had some fractional ages for infants where I had age in months instead of years and I wanted to avoid confusion for 6 month old infants, although I don’t think there are any in the data! Also, I was thinking to make fractional ages or age in days for all passengers for whom I have DoB, but I have not yet done so.

    Here’s what the tabs are:

    Titanic5_all - all (mostly cleaned) Titanic passenger and crew records Titanic5_work - working dataset, crew removed, unnecessary variables removed - this is the one I import into SAS / R to work on Titanic5_metadata - Variable descriptions and allowable values titanic3_wID - Original Titanic3 dataset with Name_ID added for merging to Titanic5 I have a csv, R dataset, and SAS dataset, but the variable names are an older version, so I won’t send those along for now to avoid confusion.

    If it helps send my contact info along to your student in case any questions arise. Gmail address probably best, on weekends for sure: davebdr@gmail.com

    The tabs in titanic5.xls are

    Titanic5_all Titanic5_passenger (the one to be used for analysis) Titanic5_metadata (used during analysis file creation) Titanic3_wID

  5. case study 1 bike share

    • kaggle.com
    Updated Oct 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohamed osama (2022). case study 1 bike share [Dataset]. https://www.kaggle.com/ososmm/case-study-1-bike-share/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    mohamed osama
    Description

    Cyclistic: Google Data Analytics Capstone Project

    Cyclistic - Google Data Analytics Certification Capstone Project Moirangthem Arup Singh How Does a Bike-Share Navigate Speedy Success? Background: This project is for the Google Data Analytics Certification capstone project. I am wearing the hat of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. This project will be completed by using the 6 Data Analytics stages: Ask: Identify the business task and determine the key stakeholders. Prepare: Collect the data, identify how it’s organized, determine the credibility of the data. Process: Select the tool for data cleaning, check for errors and document the cleaning process. Analyze: Organize and format the data, aggregate the data so that it’s useful, perform calculations and identify trends and relationships. Share: Use design thinking principles and data-driven storytelling approach, present the findings with effective visualization. Ensure the analysis has answered the business task. Act: Share the final conclusion and the recommendations. Ask: Business Task: Recommend marketing strategies aimed at converting casual riders into annual members by better understanding how annual members and casual riders use Cyclistic bikes differently. Stakeholders: Lily Moreno: The director of marketing and my manager. Cyclistic executive team: A detail-oriented executive team who will decide whether to approve the recommended marketing program. Cyclistic marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy. Prepare: For this project, I will use the public data of Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by Motivate International Inc. under the license. I downloaded the ZIP files containing the csv files from the above link but while uploading the files in kaggle (as I am using kaggle notebook), it gave me a warning that the dataset is already available in kaggle. So I will be using the dataset cyclictic-bike-share dataset from kaggle. The dataset has 13 csv files from April 2020 to April 2021. For the purpose of my analysis I will use the csv files from April 2020 to March 2021. The source csv files are in Kaggle so I can rely on it's integrity. I am using Microsoft Excel to get a glimpse of the data. There is one csv file for each month and has information about the bike ride which contain details of the ride id, rideable type, start and end time, start and end station, latitude and longitude of the start and end stations. Process: I will use R as language in kaggle to import the dataset to check how it’s organized, whether all the columns have appropriate data type, find outliers and if any of these data have sampling bias. I will be using below R libraries

    Load the tidyverse, lubridate, ggplot2, sqldf and psych libraries

    library(tidyverse) library(lubridate) library(ggplot2) library(plotrix) ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

    ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.4 ✔ dplyr 1.0.7 ✔ tidyr 1.1.3 ✔ stringr 1.4.0 ✔ readr 2.0.1 ✔ forcats 0.5.1

    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

    Attaching package: ‘lubridate’

    The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
    

    Set the working directory

    setwd("/kaggle/input/cyclistic-bike-share")

    Import the csv files

    r_202004 <- read.csv("202004-divvy-tripdata.csv") r_202005 <- read.csv("20...

  6. mcu-characters

    • kaggle.com
    Updated Oct 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Apriandito A.S. (2019). mcu-characters [Dataset]. https://www.kaggle.com/apriandito/mcu-characters/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Apriandito A.S.
    Description

    MCU Characters

    This dataset is a dataset from Kamis Data Program - R Indonesia, which i convert to .CSV file. Source: Kamis Data R-Indonesia

    Content

    Dataset about characters in the Marvel Cinematic Universe.

  7. Z

    Data and Code for "Does Organic Farming Jeopardize Food Security of Farm...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aïhounton, Ghislain Boris Dossou (2024). Data and Code for "Does Organic Farming Jeopardize Food Security of Farm Households in Benin?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10899544
    Explore at:
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Henningsen, Arne
    Aïhounton, Ghislain Boris Dossou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Benin
    Description

    This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "Does Organic Farming Jeopardize Food Security of Farm Households in Benin?" authored by Ghislain B.D. Aïhounton and Arne Henningsen and published in the journal Food Policy (Volume 124, April 2024, 102622, DOI: 10.1016/j.foodpol.2024.102622).

    We conducted the empirical analysis with the "R" statistical software (version 4.3.3) using the add-on packages "AER" (version 1.2.12), "DescTools" (version 0.99.54), "lmtest" (version 0.9.40), "moments" (version 0.14.1), "sandwich" (version 3.1.0), "stargazer" (version 5.2.3), and "xtable" (version 1.8.4) that are all available at CRAN.

    This replication package contains the following files:

    • READMEThis file.

    • R/dataBenin.csvA CSV file that contains the (unprepared) data set. The variables in this file are described in file R/Variables.csv. This CSV file is imported by R script PrepareDataFoodNutrition.R.

    • R/Variables.csvA CSV file that describes the variables in the (unprepared) data set (file R/dataBenin.csv).

    • R/PrepareData.RAn R script that imports the (unprepared) data set (file R/dataBenin.csv), calculates additional variables and add theses variables to the data set, removes observations that should not be used in the empirical analysis, and saves the prepared data set as CSV file (R/dataFoodNutrition.csv).

    • R/dataPrepared.csvA CSV file that contains the (prepared) data set used in the empirical analysis. This CSV file is created by the R script R/PrepareDataFoodNutrition.R. It is imported by the R scripts R/DescriptiveTab.R, FoodNutritionImpact.R, and GridSearchFoodSecurity.R.

    • R/DescriptiveTab.RAn R script that imports the prepared data set (file R/dataFoodNutrition.R) and creates Table 1 of the paper ("Descriptive statistics", file paper/tables/DescriptiveStat.tex) as LaTeX file.

    • R/Estimations.RAn R script that imports the prepared data set (file R/dataFoodNutrition.R), conducts all the analyses presented in the paper, creates Tables 2 and 3 of the paper ("OLS and IV regression results of the conditional associations between organic farming and outcomes" and "OLS and IV regression results of the conditional associations between organic farming and mediating outcomes", LaTeX files paper/tables/estMainReg.tex and paper/tables/estMedReg.tex), creates Figures 1 and 2 of the paper ("Estimated conditional associations of organic farming with outcomes" and "Estimated conditional associations of organic farming with mediating outcomes", 12 PDF files paper/figures/*.pdf), and 45 Tables that are included in the Supplementary Information: 36 tables with detailed regression results (LaTeX files paper/tables/tabels/est*.tex), one table with results of the first-stage probit regression (LaTeX file paper/tables/tabels/estProbit.tex), 6 tables with detailed regression results of estimations for testing the exogeneity of the instrument as suggested by Di Falco et al. (2011) (LaTeX files paper/tables/tabels/estOLS*Falco.tex), and 2 tables with coefficient bounds obtained as suggested by Oster (2019) (LaTeX files paper/tables/tabels/Oster*.tex).

    • R/GridSearch.RAn R script that re-runs our regression analyses with different units of measurement of IHS-transformed variables and calculates various indicators that can can be used to assess the appropriateness of different units of measurement as suggested by Aihounton and Henningsen (2021) and that creates 28 Tables that are included in the Supplementary Information (LaTeX files paper/tables/tabels/grid*.tex).

    • R/functions/calcOsterBounds.RAn R script that defines the R function calcOsterBounds() that calculates coefficient bounds using the method suggested by Oster (2019). This function is used by the R script R/FoodNutritionImpact.R.

    • R/functions/calcSemiElaOrg.RAn R script that defines the R function calcSemiElaOrg() that calculates the semi-elasticity of various log-transformed or IHS-transformed variables with respect to the dummy variable for organic farming. This function is used by the R scripts R/FoodNutritionImpact.R and R/GridSearchFoodSecurity.R.

    • R/functions/createFormula.RAn R script that defines the R function createFormula() that creates the regression formulas for the various empirical analyses that are presented in the paper. This function is used by the R scripts R/FoodNutritionImpact.R and R/GridSearchFoodSecurity.R.

    • R/functions/functionsTables.RAn R script that defines various R functions that are used to create tables in LaTeX format. These functions are used by the R scripts R/FoodNutritionImpact.R and R/GridSearchFoodSecurity.R.

    • R/functions/predR2.RAn R script that defines the R function predR2() that calculates the predictive R-squared value. This R script has been obtained from the replication package of the article:Aïhounton, G. B. D. and Henningsen, A. (2021). Units of measurement and the inverse hyperbolic sine transformation. The Econometrics Journal, 24(2):334–351. https://doi.org/10.1093/ectj/utaa032The function consists of a slightly modified version of the code that is available at: https://tomhopper.me/2014/05/16/can-we-do-better-than-r-squared/ This function is used by the R script R/GridSearchFoodSecurity.R.

    • paper/figures/*.pdf12 LaTeX files that are the (sub)figures in Figures 1 and 2 of the paper ("Estimated conditional associations of organic farming with outcomes" and "Estimated conditional associations of organic farming with mediating outcomes"). These 12 files are created by the R script R/FoodNutritionImpact.R.

    • paper/tables/DescriptiveStat.texA LaTeX file that creates Table 1 of the paper ("Descriptive statistics"). This file is created by the R script R/DescriptiveTab.R.

    • paper/tables/estMainReg.texA LaTeX file that creates Table 2 of the paper ("OLS and IV regression results of the conditional associations between organic farming and outcomes"). This file is created by the R script R/FoodNutritionImpact.R.

    • paper/tables/estMedReg.texA LaTeX file that creates Table 3 of the paper ("OLS and IV regression results of the conditional associations between organic farming and mediating outcomes"). This file is created by the R script R/FoodNutritionImpact.R.

    • paper/tables/tabels/est*.tex36 LaTeX files that create 36 tables that are included in the Supplementary Information and present detailed regression results. These 36 files are created by the R script R/FoodNutritionImpact.R.

    • paper/tables/tabels/estProbit.texA LaTeX files that creates a table that is included in the Supplementary Information and presents the results of the first-stage probit regression. This file is created by the R script R/FoodNutritionImpact.R.

    • paper/tables/tabels/estOLS*Falco.tex6 LaTeX files that create 6 tables that are included in the Supplementary Information and present detailed regression results for testing the exogeneity of the instrument as suggested by Di Falco et al. (2011). These 6 files are created by the R script R/FoodNutritionImpact.R.

    • paper/tables/tabels/Oster*.tex2 LaTeX files that create 2 tables that are included in the Supplementary Information and present coefficient bounds obtined as suggested by Oster (2019). These 2 files are created by the R script R/FoodNutritionImpact.R.

    • paper/tables/tabels/grid*.tex28 LaTeX files that create 28 tables that are included in the Supplementary Information and present various indicators for assessing the appropriateness of different units of measurement of IHS-transformed variables as suggested by Aihounton and Henningsen (2021). These 28 files are created by the R script R/GridSearchFoodSecurity.R

  8. u

    Data from: United States wildlife and wildlife product imports from...

    • agdatacommons.nal.usda.gov
    bin
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evan A. Eskew; Allison M. White; Naom Ross; Kristine M. Smith; Katherine F. Smith; Jon Paul Rodríguez; Carlos Zambrana-Torrelio; William B. Karesh; Peter Daszak (2025). Data from: United States wildlife and wildlife product imports from 2000–2014 [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Data_from_United_States_wildlife_and_wildlife_product_imports_from_2000_2014/24853503
    Explore at:
    binAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Scientific Data
    Authors
    Evan A. Eskew; Allison M. White; Naom Ross; Kristine M. Smith; Katherine F. Smith; Jon Paul Rodríguez; Carlos Zambrana-Torrelio; William B. Karesh; Peter Daszak
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The global wildlife trade network is a massive system that has been shown to threaten biodiversity, introduce non-native species and pathogens, and cause chronic animal welfare concerns. Despite its scale and impact, comprehensive characterization of the global wildlife trade is hampered by data that are limited in their temporal or taxonomic scope and detail. To help fill this gap, we present data on 15 years of the importation of wildlife and their derived products into the United States (2000–2014), originally collected by the United States Fish and Wildlife Service. We curated and cleaned the data and added taxonomic information to improve data usability. These data include >2 million wildlife or wildlife product shipments, representing >60 biological classes and >3.2 billion live organisms. Further, the majority of species in the dataset are not currently reported on by CITES parties. These data will be broadly useful to both scientists and policymakers seeking to better understand the volume, sources, biological composition, and potential risks of the global wildlife trade. Resources in this dataset:Resource Title: United States LEMIS wildlife trade data curated by EcoHealth Alliance (Version 1.1.0) - Zenodo. File Name: Web Page, url: https://doi.org/10.5281/zenodo.3565869 Over 5.5 million USFWS LEMIS wildlife or wildlife product records spanning 15 years and 28 data fields. These records were derived from >2 million unique shipments processed by USFWS during the time period and represent >3.2 billion live organisms. We provide the final cleaned data as a single comma-separated value file. Original raw data as provided by the USFWS are also available. Although relatively large (~1 gigabyte), the cleaned data file can be imported into a software environment of choice for data analysis. Alternatively, the assocated R package provides access to a release of the same cleaned dataset but with a data download and manipulation framework that is designed to work well with this large dataset. Both the Zenodo data repository and the R package contain a metadata file describing each of the data fields as well as a lookup table to retrieve full values for the abbreviated codes used throughout the dataset. Contents: lemis_2000_2014_cleaned.csv: This file represents the compiled, cleaned LEMIS data from 2000-2014. This data is identical to the version 1.1.0 dataset available through the lemis R package. lemis_codes.csv: Full values for all coded values used in the LEMIS data. Identical to the output from the lemis R package function "lemis_codes()". lemis_metadata.csv: Data fields and field descriptions for all variables in the LEMIS data. Identical to the output from the lemis R package function "lemis_metadata()". raw_data.zip: This archive contains all of the raw LEMIS data files that are processed and cleaned with the code contained in the 'data-raw' subdirectory of the lemis R package repository.Resource Software Recommended: R package,url: https://github.com/ecohealthalliance/lemis

  9. Data and Code for "Climate impacts and adaptation in US dairy systems...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gisbert-Queral; Maria Gisbert-Queral; Arne Henningsen; Arne Henningsen; Bo Markussen; Bo Markussen; Meredith T. Niles; Ermias Kebreab; Ermias Kebreab; Angela J. Rigden; Angela J. Rigden; Nathaniel D. Mueller; Nathaniel D. Mueller; Meredith T. Niles (2021). Data and Code for "Climate impacts and adaptation in US dairy systems 1981-2018" [Dataset]. http://doi.org/10.5281/zenodo.4818011
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 22, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Gisbert-Queral; Maria Gisbert-Queral; Arne Henningsen; Arne Henningsen; Bo Markussen; Bo Markussen; Meredith T. Niles; Ermias Kebreab; Ermias Kebreab; Angela J. Rigden; Angela J. Rigden; Nathaniel D. Mueller; Nathaniel D. Mueller; Meredith T. Niles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.

    This data and code archive contains the following files and folders:

    * README
    Description: text file with this description

    * flowchart.pdf
    Description: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.

    * runAll.sh
    Description: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)

    * Folder "DataRaw"
    Description: folder for raw data files
    This folder contains the following files:

    - DataRaw/COWS.xlsx
    Description: MS-Excel file with the number of cows per county
    Source: USDA NASS Quickstats
    Observations: All available counties and years from 2002 to 2012

    - DataRaw/milk_state.xlsx
    Description: MS-Excel file with average monthly milk yields per cow
    Source: USDA NASS Quickstats
    Observations: All available states from 1981 to 2018

    - DataRaw/TMAX.csv
    Description: CSV file with daily maximum temperatures
    Source: PRISM Climate Group (spatially averaged)
    Observations: All counties from 1981 to 2018

    - DataRaw/VPD.csv
    Description: CSV file with daily maximum vapor pressure deficits
    Source: PRISM Climate Group (spatially averaged)
    Observations: All counties from 1981 to 2018

    - DataRaw/countynamesandID.csv
    Description: CSV file with county names, state FIPS codes, and county FIPS codes
    Source: US Census Bureau
    Observations: All counties

    - DataRaw/statecentroids.csv
    Descriptions: CSV file with latitudes and longitudes of state centroids
    Source: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" function
    Observations: All states

    * Folder "DataGenerated"
    Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).

    * Folder "Results"
    Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).

    * Folder "Figures"
    Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.

    * Folder "Tables"
    Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.

    * Folder "logFiles"
    Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.

    * PrepareCowsData.R
    Description: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses

    * PrepareWeatherData.R
    Description: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses

    * PrepareMilkData.R
    Description: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses

    * CalcFrequenciesTHI_Temp.R
    Description: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state

    * CalcAvgTHI.R
    Description: R script that calculates the average THI in each state

    * PreparePanelTHI.R
    Description: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins

    * PreparePanelTemp.R
    Description: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins

    * PreparePanelFinal.R
    Description: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses

    * EstimateTrendsTHI.R
    Description: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set

    * EstimateModels.R
    Description: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications

    * CalcCoefStateYear.R
    Description: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification

    * SearchWeightMonths.R
    Description: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term

    * TestModelSpec.R
    Description: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10

    * CreateFigure1a.R
    Description: R script that creates subfigure a of Figure 1

    * CreateFigure1b.R
    Description: R script that creates subfigure b of Figure 1

    * CreateFigure2a.R
    Description: R script that creates subfigure a of Figure 2

    * CreateFigure2b.R
    Description: R script that creates subfigure b of Figure 2

    * CreateFigure2c.R
    Description: R script that creates subfigure c of Figure 2

    * CreateFigure3.R
    Description: R script that creates the subfigures of Figure 3

    * CreateFigure4.R
    Description: R script that creates the subfigures of Figure 4

    * CreateFigure5_TableS6.R
    Description: R script that creates the subfigures of Figure 5 and Table S6

    * CreateFigureS1.R
    Description: R script that creates Figure S1

    * CreateFigureS2.R
    Description: R script that creates Figure S2

    * CreateTableS2_S3_S7.R
    Description: R script that creates Tables S2, S3, and S7

    * CreateTableS4_S5.R
    Description: R script that creates Tables S4 and S5

    * CreateTableS8.R
    Description: R script that creates Table S8

    * CreateTableS9.R
    Description: R script that creates Table S9

  10. h

    journal-entries-emotion-detection-vad

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maya Markus-Malone (2025). journal-entries-emotion-detection-vad [Dataset]. https://huggingface.co/datasets/mmarkusmalone/journal-entries-emotion-detection-vad
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Maya Markus-Malone
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Reddit Diary of a Redditor VAD Dataset Dataset Creation Process

    Scraping Reddit Posts

    Posts were scraped from the r/diaryofaredditor subreddit using the Reddit API. The script used for scraping is shown below:import requests import csv import time

    access_token = "" headers = { "Authorization": f"bearer {access_token}", "User-Agent": "ChangeMeClient/0.1" }

    url = "https://oauth.reddit.com/r/diaryofaredditor/new" params = {"limit": 100} after = None

    csv_path =… See the full description on the dataset page: https://huggingface.co/datasets/mmarkusmalone/journal-entries-emotion-detection-vad.

  11. d

    Replication Data for: Lameness during the dry period: epidemiology and...

    • search.dataone.org
    • borealisdata.ca
    • +1more
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daros, Ruan R.; Eriksson, Hanna K.; Weary, Daniel M.; von Keyserlingk, Marina A. G. (2023). Replication Data for: Lameness during the dry period: epidemiology and associated factors [Dataset]. http://doi.org/10.5683/SP2/YTZMKX
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Daros, Ruan R.; Eriksson, Hanna K.; Weary, Daniel M.; von Keyserlingk, Marina A. G.
    Description

    Original data, R script (code) and code output for the paper published on Journal of Dairy Science. For best use, replicate analysis using R. Importing data using the .csv file may cause some variables (columns of the spreadsheet) to be imported with the wrong format. Any issues, do not hesitate in contact. Happy coding!

  12. Students Test Data

    • kaggle.com
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATHARV BHARASKAR (2023). Students Test Data [Dataset]. https://www.kaggle.com/datasets/atharvbharaskar/students-test-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ATHARV BHARASKAR
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.

    Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.

    Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.

    Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.

    Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.

    Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.

    Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.

    Here's a column-wise description of the dataset:

    Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)

    UNIVERSITY: The university where the student is enrolled.

    PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).

    Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.

    Semester: The semester or academic term in which the student took the exam.

    Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.

    GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.

    Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.

    TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.

  13. Z

    Dataset for Repeated double cross validation applied to the PCA-LDA...

    • data.niaid.nih.gov
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bonifacio, Alois (2020). Dataset for Repeated double cross validation applied to the PCA-LDA classification of SERS spectra: a case study with serum samples from hepatocellular carcinoma patients [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4277796
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    Sergo, Valter
    Bonifacio, Alois
    Mitri, Elisa
    Crocè, Lory Saveria
    Di Silvestre, Alessia
    Pascut, Devis
    Gurian, Elisa
    Giuffrè, Mauro
    Tiribelli, Claudio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all the spectra used in the paper "Repeated double cross validation applied to the PCA-LDA classification of SERS spectra: a case study with serum samples from hepatocellular carcinoma patients", plus the R code to import the TXT (ASCII) files into a dataset, preprocess data, set-up and cross validate the PCA-LDA model and generate the figures shown in the paper.

    Data are available in 2 different formats:

    • 1 compressed archive ("dataset.zip") containing all the 144 TXT files (1 file = 1 spectrum)

    • 1 single CSV file (“dataset.csv”) with all the 144 spectra in the form of a table. The data are structured as follow, with each row being 1 spectrum, preceded by metadata: "acquisition_date", "substrate_batch", "class", "sample_code".

    The code for R is available as a single file "Rcode.R".

  14. MovieLens ratings

    • kaggle.com
    zip
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khoa Hoàngg (2024). MovieLens ratings [Dataset]. https://www.kaggle.com/datasets/khoahongg/movielens-ratings/code
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 17, 2024
    Authors
    Khoa Hoàngg
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Each folder contains the following files: - train.csv: A csv file that contains the training data. - test.csv: A csv file that contains the testing data. - movie_to_index.pkl: A Python pickle file that contains a dictionary. The dictionary maps a movie_id to its corresponding index in the similarity matrix. - user_to_index.pkl: A Python pickle file that contains a dictionary. The dictionary maps a user_id to an index. - rating_matrix.npy: A npy file that contains the rating matrix in the training data \( \text{rating\_matrix}[u, i] = \text{r}_{\text{user_at_index_u}, \text{ movie\_at\_index\_i}} \) - similarity_matrix.npy: A npy file that contains a precomputed similarity matrix between movies in the training data. \( \text{similarity\_matrix}[i, j] = \text{purecosine}(R_{\text{movie\_at\_index\_i}}, R_{\text{movie\_at\_index\_j}}) \) - qtus.pkl: A Python pickle file that contains a dictionary. + Keys: Pair of user_index, movie_index (u, t). + Values: Indexes of movies rated by u, sorted by similarity in DESCENDING ORDER. For neighborhood_size = k \( \text{qtus}[(u,t)][:k] = Q_t(u) \)

    Loading the Similarity Matrix

    import numpy as np
    
    # Load the similarity matrix
    similarity_matrix = np.load('path_to_your_folder/similarity_matrix.npy')
    

    Loading the Dictionaries

    import pickle
    
    # Load the movie_to_index dictionary
    with open('path_to_your_folder/movie_to_index.pkl', 'rb') as f:
      movie_to_index = pickle.load(f)
    
    # Load the user_to_index dictionary
    with open('path_to_your_folder/user_to_index.pkl', 'rb') as f:
      user_to_index = pickle.load(f)
    
  15. Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an...

    • zenodo.org
    bin, csv
    Updated Sep 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Colgan Huston; Daniel Colgan Huston (2021). Data files for: Huston, D.C. et al. 2021. Stable isotope signatures of an acanthocephalan and trematode from the herbivorous marine fish Kyphosus bigibbus (Perciformes: Kyphosidae). Journal of Parasitology. 107: 726–730 [Dataset]. http://doi.org/10.5281/zenodo.4886698
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Sep 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Colgan Huston; Daniel Colgan Huston
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data files for the paper: Huston, D.C. et al. 2021. Stable isotope signatures of an acanthocephalan and trematode from the herbivorous marine fish Kyphosus bigibbus (Perciformes: Kyphosidae). Journal of Parasitology. 107(5) 726–730

    Includes raw data, .csv files for import of data into R, R script file, and excel spreadsheet file used to create Figure 1.

  16. Replication Package for ML-EUP Conversational Agent Study

    • zenodo.org
    pdf
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). Replication Package for ML-EUP Conversational Agent Study [Dataset]. http://doi.org/10.5281/zenodo.7780223
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication Package Files

    • 1. Forms.zip: contains the forms used to collect data for the experiment

    • 2. Experiments.zip: contains the participants’ and sandboxers’ experimental task workflow with Newton.

    • 3. Responses.zip: contains the responses collected from participants during the experiments.

    • 4. Analysis.zip: contains the data analysis scripts and results of the experiments.

    • 5. newton.zip: contains the tool we used for the WoZ experiment.

    • TutorialStudy.pdf: script used in the experiment with and without Newton to be consistent with all participants.

    • Woz_Script.pdf: script wizard used to maintain consistent Newton responses among the participants.

    1. Forms.zip

    The forms zip contains the following files:

    • Demographics.pdf: a PDF form used to collect demographic information from participants before the experiments

    • Post-Task Control (without the tool).pdf: a PDF form used to collect data from participants about challenges and interactions when performing the task without Newton

    • Post-Task Newton (with the tool).pdf: a PDF form used to collect data from participants after the task with Newton.

    • Post-Study Questionnaire.pdf: a PDF form used to collect data from the participant after the experiment.

    2. Experiments.zip

    The experiments zip contains two types of folders:

    • exp[participant’s number]-c[number of dataset used for control task]e[number of dataset used for experimental task]. Example: exp1-c2e1 (experiment participant 1 - control used dataset 2, experimental used dataset 1)

    • sandboxing[sandboxer’s number]. Example: sandboxing1 (experiment with sandboxer 1)

    Every experiment subfolder contains:

    • warmup.json: a JSON file with the results of Newton-Participant interactions in the chat for the warmup task.

    • warmup.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the warmup task.

    • sample1.csv: Death Event dataset.

    • sample2.csv: Heart Disease dataset.

    • tool.ipynb: a Jupyter notebook file with the participant’s results from the code provided by Newton in the experimental task.

    • python.ipynb: a Jupyter notebook file with the participant’s results from the code they tried during the control task.

    • results.json: a JSON file with the results of Newton-Participant interactions in the chat for the task with Newton.

    To load an experiment chat log into Newton, add the following code to the notebook:

    import anachat
    import json
    
    with open("result.json", "r") as f:
      anachat.comm.COMM.history = json.load(f)
    

    Then, click on the notebook name inside Newton chat

    Note 1: the subfolder for P6 is exp6-e2c1-serverdied because the experiment server died before we were able to save the logs. We reconstructed them using the notebook newton_remake.ipynb based on the video recording.

    Note 2: The sandboxing occurred during the development of Newton. We did not collect all the files, and the format of JSON files is different than the one supported by the attached version of Newton.

    3. Responses.zip

    The responses zip contains the following files:

    • demographics.csv: a CSV file containing the responses collected from participants using the demographics form

    • task_newton.csv: a CSV file containing the responses collected from participants using the post-task newton form.

    • task_control.csv: a CSV file containing the responses collected from participants using the post-task control form.

    • post_study.csv: a CSV file containing the responses collected from participants using the post-study control form.

    4. Analysis.zip

    The analysis zip contains the following files:

    • 1.Challenge.ipynb: a Jupyter notebook file where the perceptions of challenges figure was created.

    • 2.Interactions.py: a Python file where the participants’ JSON files were created.

    • 3.Interactions.Graph.ipynb: a Jupyter notebook file where the participant’s interaction figure was created.

    • 4.Interactions.Count.ipynb: a Jupyter notebook file that counts participants’ interaction with each figure.

    • config_interactions.py: this file contains the definitions of interaction colors and grouping

    • interactions.json: a JSON file with the interactions during the Newton task of each participant based on the categorization.

    • requirements.txt: dependencies required to run the code to generate the graphs and json analysis.

    To run the analyses, install the dependencies on python 3.10 with the following command and execute the scripts and notebooks in order.:

    pip install -r requirements.txt

    5. newton.zip

    The newton zip contains the source code of the Jupyter Lab extension we used in the experiments. Read the README.md file inside it for instructions on how to install and run it.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:
pptxAvailable download formats
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Search
Clear search
Close search
Google apps
Main menu