21 datasets found
  1. Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ā€˜as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ā€˜allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ā€˜Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ā€˜Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ā€˜Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ā€˜File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ā€˜ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ā€˜ggplot2’ in the Package Search space and click on ā€˜Get List’. Select ā€˜ggplot2’ in the Package column and click on ā€˜Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  2. Superstore Sales Analysis

    • kaggle.com
    zip
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis/versions/1
    Explore at:
    zip(3009057 bytes)Available download formats
    Dataset updated
    Oct 21, 2023
    Authors
    Ali Reda Elblgihy
    Description

    Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

    1- Data Import and Transformation:

    • Gather and import relevant sales data from various sources into Excel.
    • Utilize Power Query to clean, transform, and structure the data for analysis.
    • Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

    2- Data Quality Assessment:

    • Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.
    • Standardize data formats and ensure that all data is in a consistent, usable state.

    3- Calculating COGS:

    • Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.
    • Apply appropriate formulas and calculations to determine COGS accurately.

    4- Discount Analysis:

    • Analyze the discount values offered on products to understand their impact on sales and profitability.
    • Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

    5- Sales Metrics:

    • Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.
    • Utilize Excel functions to compute these metrics and create visuals for better insights.

    6- Visualization:

    • Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.
    • Visual representations can help identify trends, outliers, and patterns in the data.

    7- Report Generation:

    • Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

    Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.

  3. HelpSteer: AI Alignment Dataset

    • kaggle.com
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). HelpSteer: AI Alignment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/helpsteer-ai-alignment-dataset
    Explore at:
    zip(16614333 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    HelpSteer: AI Alignment Dataset

    Real-World Helpfulness Annotated for AI Alignment

    By Huggingface Hub [source]

    About this dataset

    HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use HelpSteer: An Open-Source AI Alignment Dataset

    HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.

    Step 1 - Choosing the Data File

    Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.

    ## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as ā€œFeature Engineeringā€

    ## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality

    Research Ideas

    • Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.
    • Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.
    • Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...

  4. u

    Association analysis of high-low outlier road intersection crashes within...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975741.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on road intersection crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0278, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 0,99 MBNumber of Files: The dataset contains a total of 10212 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of the "high-low" outlier road intersection crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-low" outlier road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)

  5. FoodLAND - Survey and experimental dataset from marketing tests in Tanzania,...

    • zenodo.org
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesper Clement; Jesper Clement (2025). FoodLAND - Survey and experimental dataset from marketing tests in Tanzania, Uganda, and Kenya [Dataset]. http://doi.org/10.5281/zenodo.14929611
    Explore at:
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jesper Clement; Jesper Clement
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Tanzania, Kenya, Uganda
    Description

    The dataset compiles information collected under T5.8 (Marketing tests and strategies) of the project FoodLAND. It contains biometric (emotional response + visual attention) and behavioral data from 600+ urban consumers in Tanzania, Uganda, and Kenya when evaluating new local produces food products. The Excel file contains the raw data cleaned for outliers and metadata.

  6. Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...

    • osti.gov
    • dataone.org
    • +1more
    Updated Dec 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA [Dataset]. http://doi.org/10.15485/1823516
    Explore at:
    Dataset updated
    Dec 31, 2020
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Environmental System Science Data Infrastructure for a Virtual Ecosystem
    Area covered
    East River, Colorado, United States
    Description

    A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.

  7. Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    o; o
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 4, 2025
    Description

    Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

    This supplementary dataset supports the manuscript titled ā€œMahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.ā€ It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

    • ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

    • ESM_2.py – Python script to calculate Z-scores from raw financial ratios

    • ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

    • ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

    • ESM_5.xlsx – Mahalanobis distance values for each firm

    • ESM_6.py – Python script to compute Mahalanobis distances

    • ESM_7.py – Python script to visualize Mahalanobis distances

    • ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

    • ESM_9.py – Python script to compute mean Z-scores

    • ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

    • ESM_11.py – Python script to re-standardize mean Z-scores

    • ESM_12.py – Python script to generate the hierarchical clustering dendrogram

    All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

  8. H

    Data for: Non-target and suspected-target screening for potentially...

    • dataverse.harvard.edu
    • data.mendeley.com
    Updated Sep 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janis Rusko (2019). Data for: Non-target and suspected-target screening for potentially hazardous chemicals in food contact materials: investigation of paper straws [Dataset]. http://doi.org/10.7910/DVN/MNY13S
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Janis Rusko
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Briefly the data contains: - Average and individual spectra of the compounds identified in the study - The compiled suspect candidate database, used for non-target screening - Excel workbook for the analysis of retention time outliers - The R script implemented for the mutagenicity and carcinogenicity analysis via the battery of (Q)SAR tools - The output of the R script as an Excel workbook

  9. zomato order data

    • kaggle.com
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NayakGanesh007 (2025). zomato order data [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/zomato-order-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    NayakGanesh007
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Zomato Food Orders – Data Analysis Project šŸ“Œ Description: This dataset contains food order data from Zomato, one of India’s leading food delivery platforms. It includes information on customer orders, order status, restaurants, delivery times, and more. The goal of this project is to explore and analyze key insights around customer behavior, delivery patterns, restaurant performance, and order trends.

    šŸ” Project Objectives: šŸ“Š Perform Exploratory Data Analysis (EDA)

    šŸ“¦ Analyze most frequently ordered cuisines and items

    ā±ļø Understand average delivery times and delays

    🧾 Identify top restaurants and order volumes

    šŸ“ˆ Uncover order trends by time (hour/day/week)

    šŸ’¬ Visualize data using Matplotlib & Seaborn

    🧹 Clean and preprocess data (missing values, outliers, etc.)

    šŸ“ Dataset Features (Example Columns): Column Name Description Order ID - Unique ID for each order Customer ID - Unique customer identifier Restaurant - Name of the restaurant Cuisine - Type of cuisine ordered Order Time - Timestamp when the order was placed Delivery Time - Timestamp when the order was delivered Order Status - Status of the order (Delivered, Cancelled) Payment Method - Mode of payment (Cash, Card, UPI, etc.) Order Amount - Total price of the order

    šŸ›  Tools & Libraries Used: Python

    Pandas, NumPy for data manipulation

    Matplotlib, Seaborn for visualization

    Excel (for raw dataset preview and checks)

    āœ… Outcomes: Customer ordering trends by cuisine and location

    Time-of-day and day-of-week analysis for peak delivery times

    Delivery efficiency evaluation

    Business recommendations for improving customer experience

  10. u

    Association analysis of high-low outlier unsignalled road intersection...

    • zivahub.uct.ac.za
    xlsx
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier unsignalled road intersection crashes within the CoCT in 2017, 2018 and 2019 [Dataset]. http://doi.org/10.25375/uct.25982002.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset provides comprehensive information on unsignalled road intersection crashes recognised as "high-low" clusters within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 10% of the total "high-high" cluster unsignalled road intersection crashes resulting for the years 2017, 2018 and 2019. The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,223, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 57,4 KB Number of Files: The dataset contains a total of 1050 association rulesDate Created: 24th May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the unsignalled road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-low" outlier unsignalled road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019

  11. Cardiovascular diseases dataset

    • kaggle.com
    zip
    Updated Mar 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David (2020). Cardiovascular diseases dataset [Dataset]. https://www.kaggle.com/aiaiaidavid/cardio-data-dv13032020
    Explore at:
    zip(458315 bytes)Available download formats
    Dataset updated
    Mar 14, 2020
    Authors
    David
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description of the data set

    This data set is a cleaned up copy of cardio_train.csv which can be found at:

    https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

    The original data set has been analyzed with Excel, correcting negative values, and removing outliers.

    A number of features in the dataset are used to predict the presence or absence of a cardiovascular disease.

    Below is a description of the features:

    AGE: integer (years of age)
    HEIGHT: integer (cm) 
    WEIGHT: integer (kg)
    GENDER: categorical (1: female, 2: male)
    AP_HIGH: systolic blood pressure, integer
    AP_LOW: diastolic blood pressure, integer 
    CHOLESTEROL: categorical (1: normal, 2: above normal, 3: well above normal)
    GLUCOSE: categorical (1: normal, 2: above normal, 3: well above normal)
    SMOKE: categorical (0: no, 1: yes)
    ALCOHOL: categorical (0: no, 1: yes)
    PHYSICAL_ACTIVITY: categorical (0: no, 1: yes)
    

    And the target variable:

    CARDIO_DISEASE: categorical (0: no, 1: yes)
    
  12. Bestseller Book Data

    • kaggle.com
    zip
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    oyebusola (2024). Bestseller Book Data [Dataset]. https://www.kaggle.com/datasets/oyecrafts/bestseller-book-data
    Explore at:
    zip(420400 bytes)Available download formats
    Dataset updated
    Mar 28, 2024
    Authors
    oyebusola
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides valuable insights into the ratings distribution of bestselling books across different categories. With a meticulous categorization of bestsellers based on their user ratings, this dataset offers a comprehensive overview of the popularity and reception of top-selling books. Whether you're interested in exploring highly-rated bestsellers, very highly-rated bestsellers, or moderately rated bestsellers, this dataset empowers you to analyze trends and patterns in the literary world. Leveraging this dataset opens up opportunities for market research, trend analysis, and strategic decision-making for publishers, authors, and book enthusiasts alike.

    What questions were asked

    • What is the distribution of bestseller ratings among the top-selling books?
    • How many books fall into each category of bestseller ratings (e.g., very highly rated, highly rated, moderately rated)?
    • Which genres tend to have the highest-rated bestsellers?
    • Are there any trends or patterns in the ratings of bestsellers over time?
    • What are the characteristics of highly-rated bestsellers compared to moderately-rated ones?
    • How do the prices of bestsellers correlate with their ratings?
    • Can we identify any outliers or anomalies in the dataset that may require further investigation?
    • Are there any authors who consistently produce highly-rated bestsellers?
    • How does the number of reviews correlate with the user ratings of bestsellers?
    • What insights can be gained from comparing the ratings breakdowns across different years or time periods?

    What were the tasks completed?

    1.Data Cleaning and Manipulation in Excel: Conducted data cleaning and manipulation tasks such as removing duplicates, handling missing values, and formatting data for analysis in Excel.

    2.Data Collection from Kaggle: Gathered the initial dataset containing information about bestselling books from Kaggle, a popular platform for datasets.

    3.Visualization in Tableau: Created interactive visualizations of the dataset using Tableau, a powerful data visualization tool, to explore and analyze bestseller ratings breakdowns.

    4.Reporting on Google Docs: Generated reports and summaries of the findings using Google Docs, a collaborative document editing platform, to communicate insights effectively.

  13. d

    Data from: Small molecule inhibitor of tau self-association in a mouse model...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eliot Davidowitz; Patricia Lopez; Heidy Jimenez; Leslie Adrien; Peter Davies; James Moe (2023). Small molecule inhibitor of tau self-association in a mouse model of tauopathy: A preventive study in P301L tau JNPL3 mice [Dataset]. http://doi.org/10.5061/dryad.v9s4mw71q
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    Dryad
    Authors
    Eliot Davidowitz; Patricia Lopez; Heidy Jimenez; Leslie Adrien; Peter Davies; James Moe
    Time period covered
    May 22, 2023
    Description

    The blinded study was independently performed by Peter Davies, Ph.D. and the datasets were provided to Oligomerix which unblinded the study groups.

  14. m

    Data of the study: "Extending the limits of force endurance: Stimulation of...

    • data.mendeley.com
    Updated Jul 12, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RƩmi Radel (2017). Data of the study: "Extending the limits of force endurance: Stimulation of the motor or the frontal cortex?" [Dataset]. http://doi.org/10.17632/dy89c6hg5b.1
    Explore at:
    Dataset updated
    Jul 12, 2017
    Authors
    RƩmi Radel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datafile is an excel file containing all the data of the study: "Extending the limits of force endurance: Stimulation of the motor or the frontal cortex?". These data were collected from January to March 31, 2017 at the UniversitƩ de Nice Sophia Antipolis by RƩmi Radel, Gauthier Denis and Gavin Tempest. The NIRS data were preprocessed using the Homer software. Each variable is described in the corresponding manuscript. The link to the manuscript will be added upon acceptance of the paper. Outliers have not been removed from the data in this version of the dataset.

  15. m

    Bikeability Cycle Training: A Route to Increasing Young People’s Subjective...

    • data.mendeley.com
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Bishop (2025). Bikeability Cycle Training: A Route to Increasing Young People’s Subjective Wellbeing? A Retrospective Cohort Study [Dataset]. http://doi.org/10.17632/tp6msdmwm9.2
    Explore at:
    Dataset updated
    Jun 25, 2025
    Authors
    Dan Bishop
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These files comprise anonymised raw data downloaded from the JISC survey platform, along with the associated code file, an Excel spreadsheet file that highlights multivariate outliers that were removed after initial screening for spurious/uncorroborated survey responses, and the final dataset (i.e., minus deleted cases) in SPSS .sav file format.

  16. LCK Spring 2024 Players Statistics

    • kaggle.com
    zip
    Updated Dec 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Rozado (2024). LCK Spring 2024 Players Statistics [Dataset]. https://www.kaggle.com/datasets/lukasrozado/lck-spring-2024-players-statistics/code
    Explore at:
    zip(156203 bytes)Available download formats
    Dataset updated
    Dec 1, 2024
    Authors
    Lukas Rozado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides an in-depth look at the League of Legends Champions Korea (LCK) Spring 2024 season. It includes detailed metrics for players, champions, and matches, meticulously cleaned and organized for easy analysis and modeling.

    Data Collection

    The data was collected using a combination of manual efforts and automated web scraping tools. Specifically:

    Source: Data was gathered from Gol.gg, a well-known platform for League of Legends statistics. Automation: Web scraping was performed using Python libraries like BeautifulSoup and Selenium to extract information on players, matches, and champions efficiently. Focus: The scripts were designed to capture relevant performance metrics for each player and champion used during the Spring 2024 split.

    Data Cleaning and Processing

    The raw data obtained from web scraping required significant preprocessing to ensure its usability. The following steps were taken:

    Handling Raw Data:

    Extracted key performance indicators like KDA, Win Rate, Games Played, and Match Durations from the source. Normalized inconsistent formats for metrics such as win rates (e.g., removing %) and durations (e.g., converting MM:SS to total seconds).

    Data Cleaning:

    Removed duplicate rows and ensured no missing values. Fixed inconsistencies in player and champion names to maintain uniformity. Checked for outliers in numerical metrics (e.g., unrealistically high KDA values).

    Data Organization:

    Created three separate tables for better data management:

    Player Statistics: General player performance metrics like KDA, win rates, and average kills. Champion Statistics: Data on games played, win rates, and KDA for each champion. Match List: Details of each match, including players, champions, and results. Added sequential Player IDs to connect the three datasets, facilitating relational analysis. Date Formatting: Converted all date fields to the DD/MM/YYYY format for consistency. Removed irrelevant time data to focus solely on match dates.

    Tools and Libraries Used

    The following tools were used throughout the project:

    Python: Libraries: Pandas, NumPy for data manipulation; BeautifulSoup, Selenium for web scraping. Visualization: Matplotlib, Seaborn, Plotly for potential analysis. Excel: Consolidated final datasets into a structured Excel file with multiple sheets. Data Validation: Used Python scripts to check for missing data, validate numerical columns, and ensure data consistency. Kaggle Integration: Cleaned datasets and a comprehensive README file were prepared for direct upload to Kaggle.

    Applications

    This dataset is ready for use in: Exploratory Data Analysis (EDA): Visualize player and champion performance trends across matches. Machine Learning: Develop models to predict match outcomes based on player and champion statistics. Sports Analytics: Gain insights into champion picks, win rates, and individual player strategies.

    Acknowledgments

    This dataset was made possible by the extensive statistics available on Gol.gg and the use of Python-based web scraping and data cleaning methodologies. It is shared under the CC BY 4.0 License to encourage reuse and collaboration.

  17. Additional file 12 of Patterns of extreme outlier gene expression suggest an...

    • springernature.figshare.com
    xlsx
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz (2025). Additional file 12 of Patterns of extreme outlier gene expression suggest an edge of chaos effect in transcriptomic networks [Dataset]. http://doi.org/10.6084/m9.figshare.30091431.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 10, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 12: Table S12. Data for outlier genes occurring in modules in a semi-graphic depiction (Excel file with three tabs). Table S12A: Depiction of mouse outlier modules based on shared OO in at least three individuals for gene pairs and larger groups of genes. Table S12B: Depiction of human outlier modules based on shared OO in at least three individuals for gene pairs and larger groups of genes. Tale S12C: Depiction of Drosophila outlier modules based on shared OO in at least three individuals for gene pairs and larger groups of genes.

  18. Additional file 8 of Patterns of extreme outlier gene expression suggest an...

    • springernature.figshare.com
    xlsx
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz (2025). Additional file 8 of Patterns of extreme outlier gene expression suggest an edge of chaos effect in transcriptomic networks [Dataset]. http://doi.org/10.6084/m9.figshare.30091419.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 10, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 8: Table S8. Gene lists with transcriptome data (CPM) for Drosophila data (Excel file with twelve tabs). For Drosophila melanogaster (Dmel) there are two parts (head and body), for Drosophila simulans (Dsim) there are four populations, as indicated in the tabs. In each case, "all" includes data for all genes above the minimal expression cutoff value, "OO" is the corresponding sub list for all genes with at least one over-outlier expression

  19. COVID19-SelectedAfricanCountries

    • kaggle.com
    zip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ojobo Agbo (2022). COVID19-SelectedAfricanCountries [Dataset]. https://www.kaggle.com/datasets/ojoboagbo/covid19selectedafricancountries
    Explore at:
    zip(323895 bytes)Available download formats
    Dataset updated
    Jun 30, 2022
    Authors
    Ojobo Agbo
    Description

    Data Set

    This dataset contains the COVID19 data for some specific African countries, as sourced from one of the world's top repositories on COVID 19 (https://www.worldometers.info/coronavirus/#countries).

    The raw data contains COVID19 cases, deaths, recoveries, population etc, grouped into continents and countries.

    Motivation

    Over the last 3 years, the whole world has been ravaged by the pandemic COVID19. Over this period, some nations have come to a halt, economic activities reduced drastically in many cities. This was accompanied by hundreds of thousands of deaths across the world.

    Considering a continent as populous as Africa, we have had our own fair share of the effects of the COVID19 pandemic.

    This analysis project was motivated by my desire to seek out and compare COVID 19 prevalence in some African countries between June 15th - June 27th; and also draw out insights from this analysis.

    Data Cleaning

    Upon collection of this data from the data source, the data was cleaned using MS Excel to search for missing values, outliers, spellings, duplicate data etc.

    This cleaned data was further transformed using Power Query.

    Analysis

    I carried out this analysis in a bid to answer some pressing questions: 1. Which were the 10 best-performing countries (based on the least number of COVID cases) 2. Which were the 10 worst performing countries (based on the most number of COVID cases) 3. Carry out descriptive analysis for each of 1 and 2 above. 4. Compare the expository analysis between 1 and 2 stated above. 5. Create visualization for 3 and 4 above. 6. Perform a forecast of cases for each of the 10 best and worst-performing countries.

    Visualization

    The analysis was done by visualization and creating insights using Microsoft PowerBI Desktop.

  20. Additional file 11 of Patterns of extreme outlier gene expression suggest an...

    • springernature.figshare.com
    xlsx
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz (2025). Additional file 11 of Patterns of extreme outlier gene expression suggest an edge of chaos effect in transcriptomic networks [Dataset]. http://doi.org/10.6084/m9.figshare.30091428.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 10, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 11: Table S11. Pedigree and data for the mouse family analysis (Excel file with five tabs). Table S11A: pedigree scheme for the five families. Table S11B: data and analysis for brain. Table S11C: data and analysis for kidney. Table S11D: data and analysis for liver. Table S11E: subset of data and analysis for genes that follow Mendelian segregation ratios

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Organization logo

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:
pptxAvailable download formats
Dataset updated
Sep 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ā€˜as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ā€˜allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ā€˜Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ā€˜Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ā€˜Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ā€˜File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ā€˜ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ā€˜ggplot2’ in the Package Search space and click on ā€˜Get List’. Select ā€˜ggplot2’ in the Package column and click on ā€˜Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Search
Clear search
Close search
Google apps
Main menu