17 datasets found
  1. Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  2. Plotly Dashboard Healthcare

    • kaggle.com
    zip
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A SURESH (2022). Plotly Dashboard Healthcare [Dataset]. https://www.kaggle.com/datasets/sureshmecad/plotly-dashboard-healthcare
    Explore at:
    zip(1741234 bytes)Available download formats
    Dataset updated
    Jan 4, 2022
    Authors
    A SURESH
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data Visualization

    Content

    a. Scatter plot

      i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for 
        any pair of genes.
    
      ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
    
      iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint - 
        visit https://plotly.com/r/, https://plotly.com/python)
    
      iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
        Gender/Sex column from the metadata file.
    

    b. Boxplot/violin plot

      i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across 
       multiple categories as defined by user selected variable (a column from the metadata file)
    
     ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
    

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  3. m

    MurdochBushCourtLoC: Wifi-Based Localisation Datasets for No-GPS Open Areas...

    • data.mendeley.com
    Updated Feb 21, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed A. nassar (2020). MurdochBushCourtLoC: Wifi-Based Localisation Datasets for No-GPS Open Areas Using Smart Bins [Dataset]. http://doi.org/10.17632/rdhfvhyg5p.2
    Explore at:
    Dataset updated
    Feb 21, 2020
    Authors
    Mohamed A. nassar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The First Wifi-Based Localisation/Positioning Datasets for No-GPS Open Areas Using Smart Bins. There are two directories:

    datasets It contains two main types of datasets:

    1- Fingerprint dataset fingerprint.csv contains four users who generated fingerprints using their mobile devices.

    2- APs dataset APs.csv: a huge dataset contains auto-generate rss reported by APs. APs_users_date_time_label.csv: it contains a labelled APs_four_users_label.csv for four users only.

    scripts This directory contains all jupyter notebooks used to create datasets and provide statistical analysis (normalisation, t-test, etc.) and visualisations (historgrams, box-plots, etc_

  4. Pneumonia Imbalance Chest X-Ray Dataset

    • kaggle.com
    zip
    Updated Dec 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashvath S.P (2023). Pneumonia Imbalance Chest X-Ray Dataset [Dataset]. https://www.kaggle.com/datasets/ashvath07/pneumonia-imbalance-chest-x-ray-dataset
    Explore at:
    zip(985981313 bytes)Available download formats
    Dataset updated
    Dec 17, 2023
    Authors
    Ashvath S.P
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10740475%2F292038228cf850864acacff21b9ff2db%2FPneumonia%20imbalance%20data%20(11).png?generation=1741359821707786&alt=media" alt="">

    We have made a new CXR dataset for Pneumonia detection, by amalgamating this original CXR dataset with two other CXR datasets and eventually, we incorporated two new classes Tuberculosis (TB) and Bacterial Pneumonia (BP) into this dataset. Moreover, we deliberately maintain a substantial class imbalance across various classes. In the training set, we have chosen 1946 images for BP, 2531 for Covid, 4209 for LO, 7134 for Normal, 490 for TB, and 941 for VP classes. This deliberate decision increases the challenges in the dataset compared to its previous version.

    We labeled this new dataset as 'Pneumonia Imbalance CXR Dataset'. The sources from which we assembled two new classes, BP and TB, are [12] and [13] respectively. This is to clarify that we have not introduced any additional images for the Normal, LO, Covid, and Viral Pneumonia (VP) classes in the original CXR dataset. These images from four different classes, remain unchanged from the previous version of the CXR dataset. Rather, we isolated Bacterial Pneumonia (BP) class images from the first source [12] and introduced them as a new class directly into the new dataset. Likewise, TB images were included as a separate class from the second source [13]. Compared to the existing CXR dataset, this new dataset is more skewed and challenging, making it more representative of real-world hospital scenarios. This can be further noticed from Fig.1b the number of images per class is more diverse in this CXR dataset, thus, making the problem more challenging for conventional neural network.

    It is shown that those conventional models, including CNN and Vision Transformer, did not perform well on this Imbalance Pneumonia dataset. Moreover, Fig.1a showed the box plot diagram of correlation co-efficient [23] of various classes. Here, we took any random image from each class and compute correlation co-efficient with respect to all other images in that class and thereafter we plot these box plot diagrams. From this Figure 1a, it is evident that the thickness of the box plots is greater for the Covid, Normal, and TB classes, indicating a higher intra-class variance.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10740475%2F5b52f4941a8cd8f70cba5e3bea82cc44%2Fintra-class-correlation-table.png?generation=1742629139434303&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10740475%2F73b304bd71be1deae5132b027d3296b2%2Finter-class-correlation-table.png?generation=1742632825027072&alt=media" alt="">

    The increased intra-class variance in these classes makes the Pneumonia Imbalance dataset more challenging. The mean and standard deviation of intra-class correlation and inter-class correlation are separately presented in Table 1 and Table-2 respectively. From this, it can be concluded that this new Pneumonia Imbalance CXR dataset is more diverse and challenging than any other existing Covid-19 dataset.

    This "Pneumonia Imbalance Dataset" is associated with the ref [3]. If you're employing our dataset for experimentation or paper publications, kindly cite our paper [3].

    Reference: 12. Kermany, Daniel S., et al. Identifying medical diagnoses and treatable diseases by image-based deep learning." \textit{cell}, vol. 172, no.5, pp: 1122-1131, 2018. 13. Rahman, Tawsifur, et al.Reliable tuberculosis detection using chest X-ray with deep learning, segmentation and visualization." \textit{IEEE Access}, vol. 8, pp: 191586-191601, 2020).

  5. d

    Boxplots of future (2056-95) overall drought-event characteristics derived...

    • catalog.data.gov
    • data.usgs.gov
    Updated Oct 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Boxplots of future (2056-95) overall drought-event characteristics derived from climate models downscaled by the MACA method assuming historical-standard stomatal resistance [Dataset]. https://catalog.data.gov/dataset/boxplots-of-future-2056-95-overall-drought-event-characteristics-derived-from-climate-mode-cfac9
    Explore at:
    Dataset updated
    Oct 8, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The South Florida Water Management District (SFWMD) and the U.S. Geological Survey (USGS) have evaluated projections of future droughts for south Florida based on climate model output from the Multivariate Adaptive Constructed Analogs (MACA) downscaled climate dataset from the Coupled Model Intercomparison Project Phase 5 (CMIP5). The MACA dataset includes both Representative Concentration Pathways 4.5 and 8.5 (RCP4.5 and RCP8.5). A Portable Document Format (PDF) file is provided which presents boxplots of future overall drought-event characteristics based on 6-mo. and 12-mo. averaged balance anomaly timeseries derived from climate models downscaled by the MACA method assuming the historical-standard stomatal resistance (rs). Overall cumulative drought-event characteristics during the future period 2056-95 are provided as boxplots for four regions: (1) the entire South Florida Water Management District (SFWMD), (2) the Lower West Coast (LWC) water supply region, (3) the Lower East Coast (LEC) water supply region, and (4) the Okeechobee plus (OKEE+) water supply meta-region consisting of Lake Okeechobee (OKEE), the Lower Kissimmee (LKISS), Upper Kissimmee (UKISS), and Upper East Coast (UEC) water supply regions in the SFWMD.

  6. Phishing URL Content Dataset

    • kaggle.com
    zip
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
    Explore at:
    zip(62701 bytes)Available download formats
    Dataset updated
    Nov 25, 2024
    Authors
    Aaditey Pillai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Phishing URL Content Dataset

    Executive Summary

    Motivation:
    Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

    Applications:
    - Building robust phishing detection systems.
    - Enhancing security measures in email filtering and web browsing.
    - Training cybersecurity practitioners in identifying malicious URLs.

    The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

    Description of Data

    This dataset comprises two types of URLs:
    1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

    Key Features:
    - URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
    - Content-based features: Link density, iframe presence, external/internal links, and metadata.
    - Certificate-based features: SSL/TLS details like validity period and organization.
    - WHOIS data: Registration details like creation and expiration dates.

    Statistics:
    - Total Samples: 800 (400 phishing, 400 benign).
    - Features: 22 including URL, domain, link density, and SSL attributes.

    Power Analysis

    To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

    Exploratory Data Analysis (EDA)

    Insights from EDA:
    - Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

    EDA visualizations are provided in the repository.

    Link to Publicly Available Data and Code

    The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

    Ethics Statement

    Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
    1. Protects User Privacy: No personally identifiable information is included.
    2. Promotes Ethical Use: Intended solely for academic and research purposes.
    3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

    Risks:
    - Misuse of the dataset for creating more deceptive phishing attacks.
    - Over-reliance on outdated features as phishing tactics evolve.

    Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

    Open Source License

    This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.

  7. Sales data for bulls

    • kaggle.com
    zip
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yerkin Mudebayev (2023). Sales data for bulls [Dataset]. https://www.kaggle.com/datasets/yerkinmudebayev/sales-data-for-bulls
    Explore at:
    zip(413774 bytes)Available download formats
    Dataset updated
    Apr 6, 2023
    Authors
    Yerkin Mudebayev
    Description

    Preliminary investigation (a) Carry out a shortened initial investigation (steps 1, 2 and 3) based on the matrix scatter plot and box plot. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. (b) Explain why using the correlation matrix for the factor analysis is indicated. (c) Display the sample correlation matrix R. Does the matrix R suggest the number of factors to use? (d) Perform a preliminary simplified principal component analysis using R. i. List the eigenvalues and describe the percent contributions to the variance. ii. Determine the number of principal components to retain and justify your an- swer by considering at least three methods. Note and comment if there is any disagreement between the methods. (e) Include your code

  8. Titanic: A Voyage into the Past

    • kaggle.com
    zip
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asher Mehfooz (2023). Titanic: A Voyage into the Past [Dataset]. https://www.kaggle.com/datasets/ashirzaki/titanic
    Explore at:
    zip(22564 bytes)Available download formats
    Dataset updated
    Nov 7, 2023
    Authors
    Asher Mehfooz
    Description

    **Dataset Overview ** The Titanic dataset is a widely used benchmark dataset for machine learning and data science tasks. It contains information about passengers who boarded the RMS Titanic in 1912, including their age, sex, social class, and whether they survived the sinking of the ship. The dataset is divided into two main parts:

    Train.csv: This file contains information about 891 passengers who were used to train machine learning models. It includes the following features:

    PassengerId: A unique identifier for each passenger Survived: Whether the passenger survived (1) or not (0) Pclass: The passenger's social class (1 = Upper, 2 = Middle, 3 = Lower) Name: The passenger's name Sex: The passenger's sex (Male or Female) Age: The passenger's age Sibsp: The number of siblings or spouses aboard the ship Parch: The number of parents or children aboard the ship Ticket: The passenger's ticket number Fare: The passenger's fare Cabin: The passenger's cabin number Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton) Test.csv: This file contains information about 418 passengers who were not used to train machine learning models. It includes the same features as train.csv, but does not include the Survived label. The goal of machine learning models is to predict whether or not each passenger in the test.csv file survived.

    **Data Preparation ** Before using the Titanic dataset for machine learning tasks, it is important to perform some data preparation steps. These steps may include:

    Handling missing values: Some of the features in the dataset have missing values. These values can be imputed or removed, depending on the specific task. Encoding categorical variables: Some of the features in the dataset are categorical variables, such as Pclass, Sex, and Embarked. These variables need to be encoded numerically before they can be used by machine learning algorithms. Scaling numerical variables: Some of the features in the dataset are numerical variables, such as Age and Fare. These variables may need to be scaled to ensure that they are on the same scale. Data Visualization

    Data visualization can be a useful tool for exploring the Titanic dataset and gaining insights into the data. Some common data visualization techniques that can be used with the Titanic dataset include:

    Histograms: Histograms can be used to visualize the distribution of numerical variables, such as Age and Fare. Scatter plots: Scatter plots can be used to visualize the relationship between two numerical variables. Box plots: Box plots can be used to visualize the distribution of a numerical variable across different categories, such as Pclass and Sex. Machine Learning Tasks

    The Titanic dataset can be used for a variety of machine learning tasks, including:

    Classification: The most common task is to use the train.csv file to train a machine learning model to predict whether or not each passenger in the test.csv file survived. Regression: The dataset can also be used to train a machine learning model to predict the fare of a passenger based on their other features. Anomaly detection: The dataset can also be used to identify anomalies, such as passengers who are outliers in terms of their age, social class, or other features.

  9. Dataset for: Evidence of sensory error threshold in triggering locomotor...

    • figshare.com
    txt
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Herrick (2025). Dataset for: Evidence of sensory error threshold in triggering locomotor adaptations in humans [Dataset]. http://doi.org/10.6084/m9.figshare.25343671.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Emily Herrick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Task: 23 healthy adults (25.0 ± 6.2 y.o., 12 males, 11 females) with no known neurological disease or persistent musculoskeletal injury were divided into two groups: (i) Group 1 (N=11) to perform the walking task with only a kinematic constraint, and (ii) Group 2 (N=12) to perform the walking task with both a kinematic constraint and asymmetric limb loading. The participants walked on the instrumented treadmill with ground reaction force sensors (Bertec, Columbus, OH) with each session consisting of three trial types: control, adaptation, and washout. The control trials documented baseline gait pattern for five minutes of walking at 1 m/s. During the asymmetric trials, participants walked with the device on the right leg constraining the stride length of the left limb. Participants were instructed to not touch the device with the left leg while walking for ten minutes. Specifically for Group 2, participants were explicitly instructed to unload their constrained leg, and verbal feedback was given during the task execution when the unloading was not evident from the real-time vertical component of ground reaction force. The washout period allowed the observation of aftereffects after the removal of the constraints.Dataset: The dataset consists of two analyses: (i) comparing the cumulative sum of ground reaction forces (GRFs) for each leg during the stance phase between Groups 1 and 2, and (ii) comparing the asymmetry index between Groups 1 and 2. For the first analysis, the data can be found in the file nTable_CS.csv, see below for more details. A supporting figure displaying example GRFs for each group can be made using the data found in file nTable_ex_GRF.csv, see below for more details. For the second analysis, the data can be found in the file nTable_AI.csv, see below for more details. To calculate the asymmetry index (AI), use the following equation: AI = (nDS_Left_Leading_s - nDS_Left_Trailing_s) / (nDS_Left_Leading_s + nDS_Left_Trailing_s). All data is in csv files, so you can easily import them into your software of choice. Scripts are included for our analysis done in MATLAB, see below for more details.nTable_CS.csvColumns:idGroup -- group identifier (either 1 or 2)sSession -- session identifier (S + #)sCondition -- condition identifier (either control, asymL (adaptation), or washout)sLeg -- leg identifier (either L (left) or R (right)nCumulSum -- cumulative sum of GRFs during the stance phase, normalized to the participant's weight and the number of steps in each conditionnTable_ex_GRF.csvColumns:idGroup -- group identifier (either 1 or 2)sSignal -- signal name (either LFz (GRFs for the left leg) or RFz (GRFs for the right leg))nData_avg -- data for the average values across the period, normalized to the participant's weightnData_sd -- data for the standard deviation across the periodnTable_AI.csvColumns:idGroup -- group identifier (either 1 or 2)sSession -- session identifier (S + #)sCondition -- condition identifier (either Control, Adapt (i.e., adaptation), or Post (i.e., washout))nDS_Left_Leading_s -- left leading double stance (i.e., the phase of the gait cycle where both leg are in contact with the ground and the left leg is in the front) durations in secondsnDS_Left_Trailing_s -- left trailing double stance (i.e., the phase of the gait cycle where both leg are in contact with the ground and the left leg is in the back) durations in secondsMATLAB Scripts:main_v9_figshare.m -- the main script to run the entire analysisplotExampleGRF.m -- function to plot an example GRF profile for a given groupplotLoading_v5.m -- function to plot the box plots and perform the stats for the loading analysistestHypothesis_v6.m -- function to plot the box plots and perform the stats for the aftereffects analysisgetAvgBehavior_v5.m -- function to plot the average behavior regarding asymmetry for a given groupplotAvgSignalwSD.m -- function to plot an average signal with standard deviation lines above and below it

  10. Feature Engineering Dataset

    • kaggle.com
    zip
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harikant Shukla (2023). Feature Engineering Dataset [Dataset]. https://www.kaggle.com/datasets/harikantshukla/feature-engineering-dataset/discussion
    Explore at:
    zip(95245 bytes)Available download formats
    Dataset updated
    Apr 18, 2023
    Authors
    Harikant Shukla
    Description

    While searching for the dream house, the buyer looks at various factors, not just at the height of the basement ceiling or the proximity to an east-west railroad.

    Using the dataset, find the factors that influence price negotiations while buying a house.

    There are 79 explanatory variables describing every aspect of residential homes in Ames, Iowa.

    Task to be Performed:

    1) Download the “PEP1.csv” using the link given in the Feature Engineering project problem statement 2) For a detailed description of the dataset, you can download and refer to data_description.txt using the link given in the Feature Engineering project problem statement Tasks to Perform 1) Import the necessary libraries 1.1 Pandas is a Python library for data manipulation and analysis. 1.2 NumPy is a package that contains a multidimensional array object and several derivative ones. 1.3 Matplotlib is a Python visualization package for 2D array plots. 1.4 Seaborn is built on top of Matplotlib. It's used for exploratory data analysis and data visualization. 2) Read the dataset 2.1 Understand the dataset 2.2 Print the name of the columns 2.3 Print the shape of the dataframe Tasks to Perform 2.4 Check for null values 2.5 Print the unique values 2.6 Select the numerical and categorical variables 3) Descriptive stats and EDA 3.1 EDA of numerical variables 3.2 Missing value treatment 3.3 Identify the skewness and distribution 3.4 Identify significant variables using a correlation matrix 3.5 Pair plot for distribution and density Project Outcome • The aim of the project is to help understand working with the dataset and performing analysis. • This project will assess the data and prepares a fresh dataset for training and prediction • To create a box plot to identify the variables with outliers

  11. f

    UC_vs_US Statistic Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Utrecht University
    Authors
    F. (Fabiano) Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

    Tagging scheme:
    Aligned (AL) - A concept is represented as a class in both models, either
    

    with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

    All the calculations and information provided in the following sheets
    

    originate from that raw data.

    Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
    

    including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

    Sheet 3 (Size-Ratio):
    

    The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

    Sheet 4 (Overall):
    

    Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

    For sheet 4 as well as for the following four sheets, diverging stacked bar
    

    charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

    Sheet 5 (By-Notation):
    

    Model correctness and model completeness is compared by notation - UC, US.

    Sheet 6 (By-Case):
    

    Model correctness and model completeness is compared by case - SIM, HOS, IFA.

    Sheet 7 (By-Process):
    

    Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

    Sheet 8 (By-Grade):
    

    Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

  12. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14171251
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 11/15/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.

    #Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  13. Predict Term Deposit

    • kaggle.com
    zip
    Updated Nov 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Predict Term Deposit [Dataset]. https://www.kaggle.com/aslanahmedov/predict-term-deposit
    Explore at:
    zip(588608 bytes)Available download formats
    Dataset updated
    Nov 29, 2021
    Authors
    Aslan Ahmedov
    Description

    Predict Term Deposit

    Introduction

    Bank has multiple banking products that it sells to customer such as saving account, credit cards, investments etc. It wants to which customer will purchase its credit cards. For the same it has various kind of information regarding the demographic details of the customer, their banking behavior etc. Once it can predict the chances that customer will purchase a product, it wants to use the same to make pre-payment to the authors.

    In this part I will demonstrate how to build a model, to predict which clients will subscribing to a term deposit, with inception of machine learning. In the first part we will deal with the description and visualization of the analysed data, and in the second we will go to data classification models.

    Strategy

    -Desire target -Data Understanding -Preprocessing Data -Machine learning Model -Prediction -Comparing Results

    Desire Target

    Predict if a client will subscribe (yes/no) to a term deposit — this is defined as a classification problem.

    Data

    The dataset (Assignment-2_data.csv) used in this assignment contains bank customers’ data. File name: Assignment-2_Data File format: . csv Numbers of Row: 45212 Numbers of Attributes: 17 non- empty conditional attributes attributes and one decision attribute.

    imagehttps://user-images.githubusercontent.com/91852182/143783430-eafd25b0-6d40-40b8-ac5b-1c4f67ca9e02.png"> imagehttps://user-images.githubusercontent.com/91852182/143783451-3e49b817-29a6-4108-b597-ce35897dda4a.png">

    Exploratory Data Analysis (EDA)

    Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do at least necessary preprocess for our data before feeding it into our model.

    In this assignment, we are going to utilize python to develop a predictive machine learning model. First, we will import some important and necessary libraries.

    Below we are can see that there are various numerical and categorical columns. The most important column here is y, which is the output variable (desired target): this will tell us if the client subscribed to a term deposit(binary: ‘yes’,’no’).

    imagehttps://user-images.githubusercontent.com/91852182/143783456-78c22016-149b-4218-a4a5-765ca348f069.png">

    We must to check missing values in our dataset if we do have any and do, we have any duplicated values or not.

    imagehttps://user-images.githubusercontent.com/91852182/143783471-a8656640-ec57-4f38-8905-35ef6f3e7f30.png">

    We can see that in 'age' 9 missing values and 'balance' as well 3 values missed. In this case based that our dataset it has around 45k row I will remove them from dataset. on Pic 1 and 2 you will see before and after.

    imagehttps://user-images.githubusercontent.com/91852182/143783474-b3898011-98e3-43c8-bd06-2cfcde714694.png">

    From the above analysis we can see that only 5289 people out of 45200 have subscribed which is roughly 12%. We can see that our dataset highly unbalanced. we need to take it as a note.

    imagehttps://user-images.githubusercontent.com/91852182/143783534-a05020a8-611d-4da1-98cf-4fec811cb5d8.png">

    Our list of categorical variables.

    imagehttps://user-images.githubusercontent.com/91852182/143783542-d40006cd-4086-4707-a683-f654a8cb2205.png">

    Our list of numerical variables.

    imagehttps://user-images.githubusercontent.com/91852182/143783551-6b220f99-2c4d-47d0-90ab-18ede42a4ae5.png">

    "Age" Q-Q Plots and Box Plot.

    In above boxplot we can see that some point in very young age and as well impossible age. So,

    imagehttps://user-images.githubusercontent.com/91852182/143783564-ad0e2a27-5df5-4e04-b5d7-6d218cabd405.png"> imagehttps://user-images.githubusercontent.com/91852182/143783589-5abf0a0b-8bab-4192-98c8-d2e04f32a5c5.png">

    Now, we don’t have issues on this feature so we can use it

    imagehttps://user-images.githubusercontent.com/91852182/143783599-5205eddb-a0f5-446d-9f45-cc1adbfcce67.png"> imagehttps://user-images.githubusercontent.com/91852182/143783601-e520d59c-3b21-4627-a9bb-cac06f415a1e.png">

    "Duration" Q-Q Plots and Box Plot

    imagehttps://user-images.githubusercontent.com/91852182/143783634-03e5a584-a6fb-4bcb-8dc5-1f3cc50f9507.png"> imagehttps://user-images.githubusercontent.com/91852182/143783640-f6e71323-abbe-49c1-9935-35ffb2d10569.png">

    This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes...

  14. DataScience for Work - Human Resources

    • kaggle.com
    zip
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beytullah Soylev (2024). DataScience for Work - Human Resources [Dataset]. https://www.kaggle.com/datasets/soylevbeytullah/ds4work-human-resources
    Explore at:
    zip(51278 bytes)Available download formats
    Dataset updated
    Apr 28, 2024
    Authors
    Beytullah Soylev
    Description

    Case Study: Improving Human Resources with Data Science

    Objective: Utilize data science to predict employee turnover and enhance the Human Resources department.

    Key Learnings:

    Leveraging Data Science for HR Transformation: Understand how data science can reduce employee turnover and revolutionize HR.

    Logistic Regression and Random Forest Classifiers: Grasp the theory behind these classifiers and implement them using scikit-learn.

    Sigmoid Functions and Pandas DataFrames: Extract probability values using sigmoid functions and manipulate datasets with Pandas.

    Python Functions and Pandas Dataframe Applications: Develop and apply Python functions to Pandas dataframes.

    Exploratory Data Analysis with Matplotlib and Seaborn: Perform EDA using Matplotlib and Seaborn, generating KDE plots, box plots, and count plots.

    Categorical Variable Transformation and Data Set Division: Convert categorical variables into dummy variables and divide datasets into training and testing sets using scikit-learn.

    Artificial Neural Networks for Classification: Understand the theory and application of artificial neural networks in classification tasks.

    Classification Model Evaluation and Result Interpretation: Evaluate classification models using confusion matrices and classification reports, distinguishing between precision, recall, and F1 scores.

    Embark on this data-driven journey to transform Human Resources!

  15. e

    Guyadiv Savane Roche Virginie: plots of forest inventory since 2006.

    • data.europa.eu
    • aquacoope.org
    • +3more
    Updated Jan 1, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). Guyadiv Savane Roche Virginie: plots of forest inventory since 2006. [Dataset]. https://data.europa.eu/data/datasets/f98cbfbc-bcc5-4892-ad92-6628cf8bb706
    Explore at:
    Dataset updated
    Jan 1, 2007
    Description

    Guyadiv is a network of permanent forest plots installed in French Guiana. The site of Savane Roche Virginie is composed of three 1ha-plots. 3 complete inventories have been made in 2006, 2007 and 2008 : 1664 trees with dbh>=10cm have been registrated. 475, 507 and 657 species have been identified respectively to each plot. 97,9%, 99,2% and 98,4% of the inventoried trees have been identified to the species level. We only have the point coordinates and not the precise demarcation of the sample plots. In order to calculate the bounding box for these plots, we have expanded the point location 300 meters in each direction.

  16. e

    Guyadiv Eperon Barré: plots of forest inventory since 2009.

    • data.europa.eu
    • data.geocatalogue.fr
    • +2more
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Guyadiv Eperon Barré: plots of forest inventory since 2009. [Dataset]. https://data.europa.eu/data/datasets/0f1266c1-f360-4a24-a49d-b5105614d017?locale=it
    Explore at:
    Dataset updated
    Dec 17, 2021
    Description

    Guyadiv is a network of permanent forest plots installed in French Guiana. The site of Eperon Barré is composed of two plots : one 1ha-plot and one 100x200m-plot. A complete inventory has been made in 2009 and 952 trees with dbh>=10cm have been registrated. 535 and 405 species have been identified in the two plots. 99,1% and 98,3% of the registrated trees have been identified to the species level. We only have the point coordinates and not the precise demarcation of the sample plots. In order to calculate the bounding box for these plots, we have expanded the point location 500 meters in each direction.

  17. Walmart Data Set

    • kaggle.com
    zip
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Garrett Carter (2023). Walmart Data Set [Dataset]. https://www.kaggle.com/datasets/matthewgarrettcarter/walmart-data-set
    Explore at:
    zip(272320 bytes)Available download formats
    Dataset updated
    Jan 4, 2023
    Authors
    Matthew Garrett Carter
    Description

    Introduction

    The purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.

    The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.

    The following packages required for this project:

    install.packages("tidyverse")
    install.packages("dplyr")
    install.packages("tsibble")
    

    The following libraries required:

    library("tidyverse")
    library(readr)
    library(dplyr)
    library(ggplot2)
    library(readr)
    library(lubridate)
    library(tsibble)
    

    Downloading data set into RStudio:

    Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")
    

    Data Inspection

    Compared column names of each file to verify consistency.

    
    colnames(Walmart)
    colnames(Walmart)
    dim(Walmart)
    str(Walmart)
    head(Walmart)
    which(is.na(Walmart$Date))
    sum(is.na(Walmart))
    

    There is NA data in the set.

    Turning Store and Holiday_flag into factors:

    Walmart$Store<-as.factor(Walmart$Store)
    Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)
    

    Splicing the date into Year and weekyear:

    Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble"
    Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")
    
    

    Filered Holiday_Flag Column to include only holidays weeks:

    Walmart_Holiday<-
     filter(Walmart, Holiday_Flag==1)
    

    Filered Holiday_Flag Column to include only non holidays Weeks:

    Walmart_Non_Holiday<-
     filter(Walmart, Holiday_Flag==0)
    

    Lets review all 45 stores' weekly sales and compare them. Using dataset Walmart

    ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores', 
                                      x='Weekly sales', y='Store')+theme_bw()
    

    Results

    From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.

    Lets verify the results via slice_max and slice_min:

    Walmart %>% slice_max(Weekly_Sales)
    
    Walmart %>% slice_min(Weekly_Sales) 
    

    It looks the information was correct. Lets check the mean for the weekly_sales column:

    mean(Walmart$Weekly_Sales)
    

    The mean for Weekly_Sales column for the Walmart dataset was 1046965.

    Lets check for the MIN and MAX of Weekly Sales but only if they are holiday sales weeks:

    ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores', 
                                      x='Weekly sales', y='Store')+theme_bw()
    

    Result

    Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:

    Walmart_Holiday %>% slice_max(Weekly_Sales)
    
    Walmart_Holiday %>% slice_min(Weekly_Sales)
    

    The results match what is given on the boxplot. Lets find the mean:

    mean(Walmart_Holiday$Weekly_Sales)
    

    The result was that the mean was 1122888.

    Lets check for the MIN and MAX of Weekly Sales but only if they are non holiday sales weeks:

    ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()
    

    Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:

    Walmart_Non_Holiday %>% slice_max(Weekly_Sales)
    
    Walmart_Non_Holiday %>% slice_min(Weekly_Sales)  
    
    mean(Walmart_Non_Holiday$Weekly_Sales)
    

    Results matched. And the mean for weekly sales was 1041256.

    Which Year had the most sales?

    ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))
    

    According the plot, 2010 had the most sales. Lets use a boxplot to see more.

    ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012', 
                                         x='Year', y='Weekly Sales')
    

    2010 Saw higher sales numbers and higher medium

    Is there any differance between Sales during no Holiday weeks and Holiday weeks?

    Lets start with holiday weekly sales:

    ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
    
  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Organization logo

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:
pptxAvailable download formats
Dataset updated
Sep 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Search
Clear search
Close search
Google apps
Main menu