21 datasets found

Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
Superstore Sales Analysis
kaggle.com
zip
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis/versions/1
Explore at:
zip(3009057 bytes)Available download formats
Dataset updated
Oct 21, 2023
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
HelpSteer: AI Alignment Dataset
kaggle.com
zip
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). HelpSteer: AI Alignment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/helpsteer-ai-alignment-dataset
Explore at:
zip(16614333 bytes)Available download formats
Dataset updated
Nov 22, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
HelpSteer: AI Alignment Dataset

Real-World Helpfulness Annotated for AI Alignment

By Huggingface Hub [source]

About this dataset

HelpSteer is an Open-Source dataset designed to empower AI Alignment through the support of fair, team-oriented annotation. The dataset provides 37,120 samples each containing a prompt and response along with five human-annotated attributes ranging between 0 and 4; with higher results indicating better quality. Using cutting-edge methods in machine learning and natural language processing in combination with the annotation of data experts, HelpSteer strives to create a set of standardized values that can be used to measure alignment between human and machine interactions. With comprehensive datasets providing responses rated for correctness, coherence, complexity, helpfulness and verbosity, HelpSteer sets out to assist organizations in fostering reliable AI models which ensure more accurate results thereby leading towards improved user experience at all levels

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use HelpSteer: An Open-Source AI Alignment Dataset

HelpSteer is an open-source dataset designed to help researchers create models with AI Alignment. The dataset consists of 37,120 different samples each containing a prompt, a response and five human-annotated attributes used to measure these responses. This guide will give you a step-by-step introduction on how to leverage HelpSteer for your own projects.

Step 1 - Choosing the Data File

Helpsteer contains two data files – one for training and one for validation. To start exploring the dataset, first select the file you would like to use by downloading both train.csv and validation.csv from the Kaggle page linked above or getting them from the Google Drive repository attached here: [link]. All the samples in each file consist of 7 columns with information about a single response: prompt (given), response (submitted), helpfulness, correctness, coherence, complexity and verbosity; all sporting values between 0 and 4 where higher means better in respective category.

## Step 2—Exploratory Data Analysis (EDA) Once you have your file loaded into your workspace or favorite software environment (e.g suggested libraries like Pandas/Numpy or even Microsoft Excel), it’s time explore it further by running some basic EDA commands that summarize each feature's distribution within our data set as well as note potential trends or points of interests throughout it - e.g what are some traits that are polarizing these responses more? Are there any outliers that might signal something interesting happening? Plotting these results often provides great insights into pattern recognition across datasets which can be used later on during modeling phase also known as “Feature Engineering”

## Step 3—Data Preprocessing After your interpretation of raw data while doing EDA should form some hypotheses around what features matter most when trying to estimate attribute scores of unknown responses accurately so proceeding with preprocessing such as cleaning up missing entries or handling outliers accordingly becomes highly recommended before starting any modelling efforts with this data set - kindly refer also back at Kaggle page description section if unsure about specific attributes domain ranges allowed values explicitly for extra confidence during this step because having correct numerical suggestions ready can make modelling workload lighter later on while building predictive models . It’s important not rushing over this stage otherwise poor results may occur later when aiming high accuracy too quickly upon model deployment due low quality

Research Ideas

Designating and measuring conversational AI engagement goals: Researchers can utilize the HelpSteer dataset to design evaluation metrics for AI engagement systems.

Identifying conversational trends: By analyzing the annotations and data in HelpSteer, organizations can gain insights into what makes conversations more helpful, cohesive, complex or consistent across datasets or audiences.

Training Virtual Assistants: Train artificial intelligence algorithms on this dataset to develop virtual assistants that respond effectively to customer queries with helpful answers

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/pu...
u
Association analysis of high-low outlier road intersection crashes within...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975741.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25975741.v1
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0278, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 0,99 MBNumber of Files: The dataset contains a total of 10212 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of the "high-low" outlier road intersection crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-low" outlier road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
FoodLAND - Survey and experimental dataset from marketing tests in Tanzania,...
zenodo.org
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesper Clement; Jesper Clement (2025). FoodLAND - Survey and experimental dataset from marketing tests in Tanzania, Uganda, and Kenya [Dataset]. http://doi.org/10.5281/zenodo.14929611
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14929611
Dataset updated
Feb 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jesper Clement; Jesper Clement
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Tanzania, Kenya, Uganda
Description
The dataset compiles information collected under T5.8 (Marketing tests and strategies) of the project FoodLAND. It contains biometric (emotional response + visual attention) and behavioral data from 600+ urban consumers in Tanzania, Uganda, and Kenya when evaluating new local produces food products. The Excel file contains the raw data cleaned for outliers and metadata.
Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...
osti.gov
dataone.org
+1more
Updated Dec 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA [Dataset]. http://doi.org/10.15485/1823516
Explore at:
Unique identifier
https://doi.org/10.15485/1823516
Dataset updated
Dec 31, 2020
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Environmental System Science Data Infrastructure for a Virtual Ecosystem
Area covered
East River, Colorado, United States
Description
A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.
Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15337959
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
o; o
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 4, 2025
Description
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

ESM_2.py – Python script to calculate Z-scores from raw financial ratios

ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

ESM_5.xlsx – Mahalanobis distance values for each firm

ESM_6.py – Python script to compute Mahalanobis distances

ESM_7.py – Python script to visualize Mahalanobis distances

ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

ESM_9.py – Python script to compute mean Z-scores

ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

ESM_11.py – Python script to re-standardize mean Z-scores

ESM_12.py – Python script to generate the hierarchical clustering dendrogram

All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
H
Data for: Non-target and suspected-target screening for potentially...
dataverse.harvard.edu
data.mendeley.com
Updated Sep 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janis Rusko (2019). Data for: Non-target and suspected-target screening for potentially hazardous chemicals in food contact materials: investigation of paper straws [Dataset]. http://doi.org/10.7910/DVN/MNY13S
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/MNY13S
Dataset updated
Sep 17, 2019
Dataset provided by
Harvard Dataverse
Authors
Janis Rusko
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Briefly the data contains: - Average and individual spectra of the compounds identified in the study - The compiled suspect candidate database, used for non-target screening - Excel workbook for the analysis of retention time outliers - The R script implemented for the mutagenicity and carcinogenicity analysis via the battery of (Q)SAR tools - The output of the R script as an Excel workbook
zomato order data
kaggle.com
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NayakGanesh007 (2025). zomato order data [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/zomato-order-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NayakGanesh007
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Zomato Food Orders – Data Analysis Project 📌 Description: This dataset contains food order data from Zomato, one of India’s leading food delivery platforms. It includes information on customer orders, order status, restaurants, delivery times, and more. The goal of this project is to explore and analyze key insights around customer behavior, delivery patterns, restaurant performance, and order trends.

🔍 Project Objectives: 📊 Perform Exploratory Data Analysis (EDA)

📦 Analyze most frequently ordered cuisines and items

⏱️ Understand average delivery times and delays

🧾 Identify top restaurants and order volumes

📈 Uncover order trends by time (hour/day/week)

💬 Visualize data using Matplotlib & Seaborn

🧹 Clean and preprocess data (missing values, outliers, etc.)

📁 Dataset Features (Example Columns): Column Name Description Order ID - Unique ID for each order Customer ID - Unique customer identifier Restaurant - Name of the restaurant Cuisine - Type of cuisine ordered Order Time - Timestamp when the order was placed Delivery Time - Timestamp when the order was delivered Order Status - Status of the order (Delivered, Cancelled) Payment Method - Mode of payment (Cash, Card, UPI, etc.) Order Amount - Total price of the order

🛠 Tools & Libraries Used: Python

Pandas, NumPy for data manipulation

Matplotlib, Seaborn for visualization

Excel (for raw dataset preview and checks)

✅ Outcomes: Customer ordering trends by cuisine and location

Time-of-day and day-of-week analysis for peak delivery times

Delivery efficiency evaluation

Business recommendations for improving customer experience
u
Association analysis of high-low outlier unsignalled road intersection...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier unsignalled road intersection crashes within the CoCT in 2017, 2018 and 2019 [Dataset]. http://doi.org/10.25375/uct.25982002.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25982002.v1
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on unsignalled road intersection crashes recognised as "high-low" clusters within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 10% of the total "high-high" cluster unsignalled road intersection crashes resulting for the years 2017, 2018 and 2019. The dataset is meticulously organised according to support metric values, ranging from 0,10 to 0,223, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 57,4 KB Number of Files: The dataset contains a total of 1050 association rulesDate Created: 24th May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the unsignalled road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 10% of the "high-low" outlier unsignalled road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019
Cardiovascular diseases dataset
kaggle.com
zip
Updated Mar 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David (2020). Cardiovascular diseases dataset [Dataset]. https://www.kaggle.com/aiaiaidavid/cardio-data-dv13032020
Explore at:
zip(458315 bytes)Available download formats
Dataset updated
Mar 14, 2020
Authors
David
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description of the data set

This data set is a cleaned up copy of cardio_train.csv which can be found at:

https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

The original data set has been analyzed with Excel, correcting negative values, and removing outliers.

A number of features in the dataset are used to predict the presence or absence of a cardiovascular disease.

Below is a description of the features:

AGE: integer (years of age) HEIGHT: integer (cm) WEIGHT: integer (kg) GENDER: categorical (1: female, 2: male) AP_HIGH: systolic blood pressure, integer AP_LOW: diastolic blood pressure, integer CHOLESTEROL: categorical (1: normal, 2: above normal, 3: well above normal) GLUCOSE: categorical (1: normal, 2: above normal, 3: well above normal) SMOKE: categorical (0: no, 1: yes) ALCOHOL: categorical (0: no, 1: yes) PHYSICAL_ACTIVITY: categorical (0: no, 1: yes)

And the target variable:

CARDIO_DISEASE: categorical (0: no, 1: yes)
Bestseller Book Data
kaggle.com
zip
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
oyebusola (2024). Bestseller Book Data [Dataset]. https://www.kaggle.com/datasets/oyecrafts/bestseller-book-data
Explore at:
zip(420400 bytes)Available download formats
Dataset updated
Mar 28, 2024
Authors
oyebusola
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides valuable insights into the ratings distribution of bestselling books across different categories. With a meticulous categorization of bestsellers based on their user ratings, this dataset offers a comprehensive overview of the popularity and reception of top-selling books. Whether you're interested in exploring highly-rated bestsellers, very highly-rated bestsellers, or moderately rated bestsellers, this dataset empowers you to analyze trends and patterns in the literary world. Leveraging this dataset opens up opportunities for market research, trend analysis, and strategic decision-making for publishers, authors, and book enthusiasts alike.

What questions were asked

What is the distribution of bestseller ratings among the top-selling books?

How many books fall into each category of bestseller ratings (e.g., very highly rated, highly rated, moderately rated)?

Which genres tend to have the highest-rated bestsellers?

Are there any trends or patterns in the ratings of bestsellers over time?

What are the characteristics of highly-rated bestsellers compared to moderately-rated ones?

How do the prices of bestsellers correlate with their ratings?

Can we identify any outliers or anomalies in the dataset that may require further investigation?

Are there any authors who consistently produce highly-rated bestsellers?

How does the number of reviews correlate with the user ratings of bestsellers?

What insights can be gained from comparing the ratings breakdowns across different years or time periods?

What were the tasks completed?

1.Data Cleaning and Manipulation in Excel: Conducted data cleaning and manipulation tasks such as removing duplicates, handling missing values, and formatting data for analysis in Excel.

2.Data Collection from Kaggle: Gathered the initial dataset containing information about bestselling books from Kaggle, a popular platform for datasets.

3.Visualization in Tableau: Created interactive visualizations of the dataset using Tableau, a powerful data visualization tool, to explore and analyze bestseller ratings breakdowns.

4.Reporting on Google Docs: Generated reports and summaries of the findings using Google Docs, a collaborative document editing platform, to communicate insights effectively.
d
Data from: Small molecule inhibitor of tau self-association in a mouse model...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eliot Davidowitz; Patricia Lopez; Heidy Jimenez; Leslie Adrien; Peter Davies; James Moe (2023). Small molecule inhibitor of tau self-association in a mouse model of tauopathy: A preventive study in P301L tau JNPL3 mice [Dataset]. http://doi.org/10.5061/dryad.v9s4mw71q
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.v9s4mw71q
Dataset updated
Jun 7, 2023
Dataset provided by
Dryad
Authors
Eliot Davidowitz; Patricia Lopez; Heidy Jimenez; Leslie Adrien; Peter Davies; James Moe
Time period covered
May 22, 2023
Description
The blinded study was independently performed by Peter Davies, Ph.D. and the datasets were provided to Oligomerix which unblinded the study groups.
m
Data of the study: "Extending the limits of force endurance: Stimulation of...
data.mendeley.com
Updated Jul 12, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rémi Radel (2017). Data of the study: "Extending the limits of force endurance: Stimulation of the motor or the frontal cortex?" [Dataset]. http://doi.org/10.17632/dy89c6hg5b.1
Explore at:
Unique identifier
https://doi.org/10.17632/dy89c6hg5b.1
Dataset updated
Jul 12, 2017
Authors
Rémi Radel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datafile is an excel file containing all the data of the study: "Extending the limits of force endurance: Stimulation of the motor or the frontal cortex?". These data were collected from January to March 31, 2017 at the Université de Nice Sophia Antipolis by Rémi Radel, Gauthier Denis and Gavin Tempest. The NIRS data were preprocessed using the Homer software. Each variable is described in the corresponding manuscript. The link to the manuscript will be added upon acceptance of the paper. Outliers have not been removed from the data in this version of the dataset.
m
Bikeability Cycle Training: A Route to Increasing Young People’s Subjective...
data.mendeley.com
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Bishop (2025). Bikeability Cycle Training: A Route to Increasing Young People’s Subjective Wellbeing? A Retrospective Cohort Study [Dataset]. http://doi.org/10.17632/tp6msdmwm9.2
Explore at:
Unique identifier
https://doi.org/10.17632/tp6msdmwm9.2
Dataset updated
Jun 25, 2025
Authors
Dan Bishop
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These files comprise anonymised raw data downloaded from the JISC survey platform, along with the associated code file, an Excel spreadsheet file that highlights multivariate outliers that were removed after initial screening for spurious/uncorroborated survey responses, and the final dataset (i.e., minus deleted cases) in SPSS .sav file format.
LCK Spring 2024 Players Statistics
kaggle.com
zip
Updated Dec 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Rozado (2024). LCK Spring 2024 Players Statistics [Dataset]. https://www.kaggle.com/datasets/lukasrozado/lck-spring-2024-players-statistics/code
Explore at:
zip(156203 bytes)Available download formats
Dataset updated
Dec 1, 2024
Authors
Lukas Rozado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides an in-depth look at the League of Legends Champions Korea (LCK) Spring 2024 season. It includes detailed metrics for players, champions, and matches, meticulously cleaned and organized for easy analysis and modeling.

Data Collection

The data was collected using a combination of manual efforts and automated web scraping tools. Specifically:

Source: Data was gathered from Gol.gg, a well-known platform for League of Legends statistics. Automation: Web scraping was performed using Python libraries like BeautifulSoup and Selenium to extract information on players, matches, and champions efficiently. Focus: The scripts were designed to capture relevant performance metrics for each player and champion used during the Spring 2024 split.

Data Cleaning and Processing

The raw data obtained from web scraping required significant preprocessing to ensure its usability. The following steps were taken:

Handling Raw Data:

Extracted key performance indicators like KDA, Win Rate, Games Played, and Match Durations from the source. Normalized inconsistent formats for metrics such as win rates (e.g., removing %) and durations (e.g., converting MM:SS to total seconds).

Data Cleaning:

Removed duplicate rows and ensured no missing values. Fixed inconsistencies in player and champion names to maintain uniformity. Checked for outliers in numerical metrics (e.g., unrealistically high KDA values).

Data Organization:

Created three separate tables for better data management:

Player Statistics: General player performance metrics like KDA, win rates, and average kills. Champion Statistics: Data on games played, win rates, and KDA for each champion. Match List: Details of each match, including players, champions, and results. Added sequential Player IDs to connect the three datasets, facilitating relational analysis. Date Formatting: Converted all date fields to the DD/MM/YYYY format for consistency. Removed irrelevant time data to focus solely on match dates.

Tools and Libraries Used

The following tools were used throughout the project:

Python: Libraries: Pandas, NumPy for data manipulation; BeautifulSoup, Selenium for web scraping. Visualization: Matplotlib, Seaborn, Plotly for potential analysis. Excel: Consolidated final datasets into a structured Excel file with multiple sheets. Data Validation: Used Python scripts to check for missing data, validate numerical columns, and ensure data consistency. Kaggle Integration: Cleaned datasets and a comprehensive README file were prepared for direct upload to Kaggle.

Applications

This dataset is ready for use in: Exploratory Data Analysis (EDA): Visualize player and champion performance trends across matches. Machine Learning: Develop models to predict match outcomes based on player and champion statistics. Sports Analytics: Gain insights into champion picks, win rates, and individual player strategies.

Acknowledgments

This dataset was made possible by the extensive statistics available on Gol.gg and the use of Python-based web scraping and data cleaning methodologies. It is shared under the CC BY 4.0 License to encourage reuse and collaboration.
Additional file 12 of Patterns of extreme outlier gene expression suggest an...
springernature.figshare.com
xlsx
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz (2025). Additional file 12 of Patterns of extreme outlier gene expression suggest an edge of chaos effect in transcriptomic networks [Dataset]. http://doi.org/10.6084/m9.figshare.30091431.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30091431.v1
Dataset updated
Sep 10, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 12: Table S12. Data for outlier genes occurring in modules in a semi-graphic depiction (Excel file with three tabs). Table S12A: Depiction of mouse outlier modules based on shared OO in at least three individuals for gene pairs and larger groups of genes. Table S12B: Depiction of human outlier modules based on shared OO in at least three individuals for gene pairs and larger groups of genes. Tale S12C: Depiction of Drosophila outlier modules based on shared OO in at least three individuals for gene pairs and larger groups of genes.
Additional file 8 of Patterns of extreme outlier gene expression suggest an...
springernature.figshare.com
xlsx
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz (2025). Additional file 8 of Patterns of extreme outlier gene expression suggest an edge of chaos effect in transcriptomic networks [Dataset]. http://doi.org/10.6084/m9.figshare.30091419.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30091419.v1
Dataset updated
Sep 10, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 8: Table S8. Gene lists with transcriptome data (CPM) for Drosophila data (Excel file with twelve tabs). For Drosophila melanogaster (Dmel) there are two parts (head and body), for Drosophila simulans (Dsim) there are four populations, as indicated in the tabs. In each case, "all" includes data for all genes above the minimal expression cutoff value, "OO" is the corresponding sub list for all genes with at least one over-outlier expression
COVID19-SelectedAfricanCountries
kaggle.com
zip
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ojobo Agbo (2022). COVID19-SelectedAfricanCountries [Dataset]. https://www.kaggle.com/datasets/ojoboagbo/covid19selectedafricancountries
Explore at:
zip(323895 bytes)Available download formats
Dataset updated
Jun 30, 2022
Authors
Ojobo Agbo
Description
Data Set

This dataset contains the COVID19 data for some specific African countries, as sourced from one of the world's top repositories on COVID 19 (https://www.worldometers.info/coronavirus/#countries).

The raw data contains COVID19 cases, deaths, recoveries, population etc, grouped into continents and countries.

Motivation

Over the last 3 years, the whole world has been ravaged by the pandemic COVID19. Over this period, some nations have come to a halt, economic activities reduced drastically in many cities. This was accompanied by hundreds of thousands of deaths across the world.

Considering a continent as populous as Africa, we have had our own fair share of the effects of the COVID19 pandemic.

This analysis project was motivated by my desire to seek out and compare COVID 19 prevalence in some African countries between June 15th - June 27th; and also draw out insights from this analysis.

Data Cleaning

Upon collection of this data from the data source, the data was cleaned using MS Excel to search for missing values, outliers, spellings, duplicate data etc.

This cleaned data was further transformed using Power Query.

Analysis

I carried out this analysis in a bid to answer some pressing questions: 1. Which were the 10 best-performing countries (based on the least number of COVID cases) 2. Which were the 10 worst performing countries (based on the most number of COVID cases) 3. Carry out descriptive analysis for each of 1 and 2 above. 4. Compare the expository analysis between 1 and 2 stated above. 5. Create visualization for 3 and 4 above. 6. Perform a forecast of cases for each of the 10 best and worst-performing countries.

Visualization

The analysis was done by visualization and creating insights using Microsoft PowerBI Desktop.
Additional file 11 of Patterns of extreme outlier gene expression suggest an...
springernature.figshare.com
xlsx
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz (2025). Additional file 11 of Patterns of extreme outlier gene expression suggest an edge of chaos effect in transcriptomic networks [Dataset]. http://doi.org/10.6084/m9.figshare.30091428.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30091428.v1
Dataset updated
Sep 10, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Chen Xie; Sven Künzel; Wenyu Zhang; Cassandra A. Hathaway; Shelley S. Tworoger; Diethard Tautz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 11: Table S11. Pedigree and data for the mouse family analysis (Excel file with five tabs). Table S11A: pedigree scheme for the five families. Table S11B: data and analysis for brain. Table S11C: data and analysis for kidney. Table S11D: data and analysis for liver. Table S11E: subset of data and analysis for genes that follow Mendelian segregation ratios

Facebook

Twitter

Click to copy link

Link copied

Cite

Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:

pptxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3840102.v1

Dataset updated

Sep 19, 2016

Dataset provided by

Figsharehttp://figshare.com/

Authors

Benj Petre; Aurore Coince; Sophien Kamoun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Clear search

Close search

Google apps

Main menu

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Superstore Sales Analysis

HelpSteer: AI Alignment Dataset

HelpSteer: AI Alignment Dataset

Real-World Helpfulness Annotated for AI Alignment

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

How to Use HelpSteer: An Open-Source AI Alignment Dataset

Step 1 - Choosing the Data File

Research Ideas

Acknowledgements

License

Association analysis of high-low outlier road intersection crashes within...

FoodLAND - Survey and experimental dataset from marketing tests in Tanzania,...

Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...

Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

Data for: Non-target and suspected-target screening for potentially...

zomato order data

Association analysis of high-low outlier unsignalled road intersection...

Cardiovascular diseases dataset

Description of the data set

Bestseller Book Data

What questions were asked

What were the tasks completed?

Data from: Small molecule inhibitor of tau self-association in a mouse model...

Data of the study: "Extending the limits of force endurance: Stimulation of...

Bikeability Cycle Training: A Route to Increasing Young People’s Subjective...

LCK Spring 2024 Players Statistics

Data Collection

Data Cleaning and Processing

Handling Raw Data:

Data Cleaning:

Data Organization:

Tools and Libraries Used

Applications

Acknowledgments

Additional file 12 of Patterns of extreme outlier gene expression suggest an...

Additional file 8 of Patterns of extreme outlier gene expression suggest an...

COVID19-SelectedAfricanCountries

Additional file 11 of Patterns of extreme outlier gene expression suggest an...

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate