41 datasets found

All Seaborn Built-in Datasets 📊✨
kaggle.com
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelrahman Mohamed (2024). All Seaborn Built-in Datasets 📊✨ [Dataset]. https://www.kaggle.com/datasets/abdoomoh/all-seaborn-built-in-datasets
Explore at:
zip(1383218 bytes)Available download formats
Dataset updated
Aug 27, 2024
Authors
Abdelrahman Mohamed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.

Included Datasets:

Anagrams: Analysis of word anagram patterns.

Anscombe: Anscombe's quartet demonstrating the importance of data visualization.

Attention: Data on attention span variations in different scenarios.

Brain Networks: Connectivity data within brain networks.

Car Crashes: US car crash statistics.

Diamonds: Data on diamond properties including price, cut, and clarity.

Dots: Randomly generated data for scatter plot visualization.

Dow Jones: Historical records of the Dow Jones Industrial Average.

Exercise: The relationship between exercise and health metrics.

Flights: Monthly passenger numbers on flights.

FMRI: Functional MRI data capturing brain activity.

Geyser: Eruption times of the Old Faithful geyser.

Glue: Strength of glue under different conditions.

Health Expenditure: Health expenditure statistics across countries.

Iris: Famous dataset for classifying Iris species.

MPG: Miles per gallon for various vehicles.

Penguins: Data on penguin species and their features.

Planets: Characteristics of discovered exoplanets.

Sea Ice: Measurements of sea ice extent.

Taxis: Taxi trips data in a city.

Tips: Tipping data collected from a restaurant.

Titanic: Survival data from the Titanic disaster.

This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.
d
EXAMPLE - MAHI FY24 SRF Projects (scatter chart)
demo.dev.datopian.com
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). EXAMPLE - MAHI FY24 SRF Projects (scatter chart) [Dataset]. https://demo.dev.datopian.com/dataset/demenech-testing--p6tr-p2ba
Explore at:
Dataset updated
Sep 22, 2025
Description
This is a list of Median Annual Household Incomes for the Overburdened Communities page on the EGLE website.

E-commerce Sales Prediction Dataset

kaggle.com

zip

Updated Dec 14, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Nevil Dhinoja (2024). E-commerce Sales Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/nevildhinoja/e-commerce-sales-prediction-dataset/discussion

Explore at:

zip(16700 bytes)Available download formats

Dataset updated

Dec 14, 2024

Authors

Nevil Dhinoja

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

E-commerce Sales Prediction Dataset

This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.

📂 Dataset Overview

The dataset includes 1,000 records across the following features:

Column Name	Description
Date	The date of the sale (01-01-2023 onward).
Product_Category	Category of the product (e.g., Electronics, Sports, Other).
Price	Price of the product (numerical).
Discount	Discount applied to the product (numerical).
Customer_Segment	Buyer segment (e.g., Regular, Occasional, Other).
Marketing_Spend	Marketing budget allocated for sales (numerical).
Units_Sold	Number of units sold per transaction (numerical).

📊 Data Summary

General Properties

Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.

Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).

Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.

Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.

Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.

Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.

Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.

📈 Data Visualizations

The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.

💡 How the Data Was Created

The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:

Feature Engineering:
- Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
- Generated dependent features like units sold based on logical relationships.
Data Simulation:
- Python Libraries: Used NumPy and Pandas to generate and distribute values.
- Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
Validation:
- Verified data consistency with no missing or invalid values.
- Ensured logical correlations (e.g., higher discounts → increased units sold).

Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.

🛠 Example Usage: Sales Prediction Model

Here’s an example of building a predictive model using Linear Regression:

Written in python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv('ecommerce_sales.csv')

# Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

Created isotopes, half-life, reaction, and threshold reaction energy.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Rafiqul Islam; Mehrdad Shahmohammadi Beni; Chor-yi Ng; Masayasu Miyake; Mahabubur Rahman; Shigeki Ito; Shinichi Gotoh; Taiga Yamaya; Hiroshi Watabe (2023). Created isotopes, half-life, reaction, and threshold reaction energy. [Dataset]. http://doi.org/10.1371/journal.pone.0263521.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0263521.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
M. Rafiqul Islam; Mehrdad Shahmohammadi Beni; Chor-yi Ng; Masayasu Miyake; Mahabubur Rahman; Shigeki Ito; Shinichi Gotoh; Taiga Yamaya; Hiroshi Watabe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Created isotopes, half-life, reaction, and threshold reaction energy.
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
m
Ultimate_Analysis
data.mendeley.com
Updated Jan 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akara Kijkarncharoensin (2022). Ultimate_Analysis [Dataset]. http://doi.org/10.17632/t8x96g88p3.2
Explore at:
Unique identifier
https://doi.org/10.17632/t8x96g88p3.2
Dataset updated
Jan 28, 2022
Authors
Akara Kijkarncharoensin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.

The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.

An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.

A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.

The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.

Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.

Plotly Dashboard Healthcare

kaggle.com

zip

Updated Jan 4, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

A SURESH (2022). Plotly Dashboard Healthcare [Dataset]. https://www.kaggle.com/datasets/sureshmecad/plotly-dashboard-healthcare

Explore at:

zip(1741234 bytes)Available download formats

Dataset updated

Jan 4, 2022

Authors

A SURESH

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Data Visualization

Content

a. Scatter plot

  i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for 
    any pair of genes.

  ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.

  iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint - 
    visit https://plotly.com/r/, https://plotly.com/python)

  iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
    Gender/Sex column from the metadata file.

b. Boxplot/violin plot

  i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across 
   multiple categories as defined by user selected variable (a column from the metadata file)

 ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Additional file 3: of DMDtoolkit: a tool for visualizing the mutated...
springernature.figshare.com
application/x-rar
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiapeng Zhou; Jing Xin; Yayun Niu; Shiwen Wu (2023). Additional file 3: of DMDtoolkit: a tool for visualizing the mutated dystrophin protein and predicting the clinical severity in DMD [Dataset]. http://doi.org/10.6084/m9.figshare.c.3681817_D3.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3681817_D3.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jiapeng Zhou; Jing Xin; Yayun Niu; Shiwen Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
results_DMDtoolkit. This folder includes all the results produced by DMDtoolkit, such as “prediction results of DMD patients.xlsx” which contains the original prediction results of DMD patients from TREAT-NMD, Flanigan’s and GHCPAPF, “case7-1 (combination of multiple mutations).pdf” which is a vector illustration of a combination of multiple mutations. “supplement1.docx”. Examples of basic statistics includes calculation of summary, correlation coefficient, regression coefficient, and t test. “supplement2.docx”. Examples of basic graphs include pedigree, histogram, scatter plot with trend line, stem and leaf plot, and cluster dendrogram. (RAR 5435 kb)
d
OsterlundJBC_Figure 6 & SFig6
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osterlund, Elizabeth (2023). OsterlundJBC_Figure 6 & SFig6 [Dataset]. http://doi.org/10.5683/SP3/CXCMAG
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/CXCMAG
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Osterlund, Elizabeth
Description
Figure 6 data Includes: -Panel A, Data from multiple biological replicates in BMK-DKO cells expressing either mCerulean3-BCL-XL, mCerulean3-BCL-XL-ActA or mCerulean3-BCL-XL-Cb5. '.pzfx' file can be opened in Graphpad prism. Alternatively, see "Fig6A_GetAverageForHeatmaps" for raw data used to generate Heatmaps. The same data was displayed in a Scatter plot (included as well), but was not included in the paper. SFigure 6 data -Images, and example data from a single replicate that were used to generate FigureS6. This figure was made to demonstrate the 3-channel colocalization method. Median, or mode colocalization values shown in Figures 6,7, and SFig8A were determined as shown in SFig6.
f
Additional file 1 of ChromoMap: an R package for interactive visualization...
datasetcatalog.nlm.nih.gov
springernature.figshare.com
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anand, Lakshay; Lopez, Carlos M. Rodriguez (2022). Additional file 1 of ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000305497
Explore at:
Dataset updated
Jan 12, 2022
Authors
Anand, Lakshay; Lopez, Carlos M. Rodriguez
Description
Additional file 1. Example of chromoMap interactive plot constructed using various features of chromoMap including polyploidy (used as multi-track), feature-associated data visualization (scatter and bar plots), chromosome heatmaps, data filters (color-coded scatter and bars). Differential gene expression in a cohort of patients positive for COVID19 and healthy individuals (NCBI Gene Expression Omnibus id: GSE162835) [12]. Each set of five tracks labeled with the same chromosome ID (e.g. 1-22, X & Y) contains the following information: From top to bottom: (1) number of differentially expressed genes (DEGs) (FDR < 0.05) (bars over the chromosome depictions) per genomic window (green boxes within the chromosome). Windows containing ≥ 5 DEGs are shown in yellow. (2) DEGs (FDR < 0.05) between healthy individuals and patients positive for COVID19 visualized as a scatterplot above the chromosome depiction (genes with logFC ≥ 2 or logFC ≤ −2 are highlighted in orange). Dots above the grey dashed line represent upregulated genes in COVID19 positive patients. Heatmap within chromosome depictions indicates the average LogFC value per window. (3–4) Normalized expression of differentially expressed genes (scatterplot) and of each genomic window containing DEG (green scale heatmap) in (3) patients with severe/critical outcomes and (4) asymptomatic/mild outcome patients. (5) logFC of DEGs between healthy individuals and patients positive for COVID19 visualized as scatter plot color-coded based on the metabolic pathway each DEG belongs to.
Large-Scale Preference Dataset
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Large-Scale Preference Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/large-scale-preference-dataset/discussion
Explore at:
zip(207130812 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Large-Scale Preference Dataset

Training Powerful Reward & Critic Models with Aligned Language Models

By Huggingface Hub [source]

About this dataset

UltraFeedback is an unprecedentedly expansive, meticulously detailed, and multifarious preference dataset built exclusively to train powerful reward and critic models with aligned language models. With thousands of prompts lifted from countless distinct sources like UltraChat, ShareGPT, Evol-Instruet, TruthfulQA and more, UltraFeedback contains an overwhelming 256k samples – perfect for introducing to a wide array of AI-driven projects. Dive into the selection of correct answers and incorrect answers attached to this remarkable collection easily within the same data file! Get up close in exploring options presented in UltraFeedback – a groundbreaking new opportunity for data collectors!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The first step is to understand the content of the dataset, including source, models, correct answers and incorrect answers. Knowing which language models (LM) were used to generate completions can help you better interpret the data in this dataset.

Once you are familiar with the column titles and their meanings it’s time to begin exploring! To maximize your insight into this data set use a variety of visualization techniques such as scatter plots or bar charts to view sample distributions across different LMs or answer types. Analyzing trends between incorrect and correct answers through data manipulation techniques such as merging sets can also provide valuable insights into preferences across different prompts and sources.

Finally, you may want to try running LR or other machine learning models on this dataset in order to create simple models for predicting preferences when given inputs from real world scenarios related to specific tasks that require nuanced understanding of instructions provided by one’s peers or superiors.

The possibilities for further exploration of this dataset are endless - now let’s get started!

Research Ideas

Training sentence completion models on the dataset to generate responses with high accuracy and diversity.

Creating natural language understanding (NLU) tasks such as question-answering and sentiment analysis using the aligned dataset as training/testing sets.

Developing strongly supervised learning algorithms that are able to use techniques like reward optimization with potential translation applications in developing machine translation systems from scratch or upstream text-generation tasks like summarization, dialog generation, etc

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------------|:---------------------------------------------------------------| | source | The source of the data. (String) | | instruction | The instruction given to the language models. (String) | | models | The language models used to generate the completions. (String) | | correct_answers | The correct answers to the instruction. (String) | | incorrect_answers | The incorrect answers to the instruction. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
d
EXAMPLE - MAHI FY24 SRF Projects (scatter chart) - Dataset - Datopian CKAN...
demo.dev.datopian.com
Updated Sep 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). EXAMPLE - MAHI FY24 SRF Projects (scatter chart) - Dataset - Datopian CKAN instance [Dataset]. https://demo.dev.datopian.com/gl_ES/dataset/demenech-testing--p6tr-p2ba
Explore at:
Dataset updated
Sep 22, 2025
Description
This is a list of Median Annual Household Incomes for the Overburdened Communities page on the EGLE website.
f
Data from: Indicators for evaluation of model performance: irrigation...
scielo.figshare.com
png
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luiz Ricardo Sobenko; Bruna Dalcin Pimenta; Antonio Pires de Camargo; Adroaldo Dias Robaina; Marcia Xavier Peiter; José Antonio Frizzone (2023). Indicators for evaluation of model performance: irrigation hydraulics applications [Dataset]. http://doi.org/10.6084/m9.figshare.21431380.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21431380.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SciELO journals
Authors
Luiz Ricardo Sobenko; Bruna Dalcin Pimenta; Antonio Pires de Camargo; Adroaldo Dias Robaina; Marcia Xavier Peiter; José Antonio Frizzone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT. Several mathematical models have been developed for applications in the hydraulics of irrigation systems and several performance indicators of these models are used and suggested by the literature. Thus, the objective of this work was to investigate the performance of statistical indicators for the evaluation of models in irrigation hydraulics. For this, three case studies which represent typical irrigation hydraulics modeling were used to assess the indicators. A set of indicators were analyzed: a) difference-based: mean absolute error, mean square error, root mean square error, scaled root mean square error, and percent mean absolute error; b) efficiency-based: Nash-Sutcliffe and Legates-McCabe; c) correlation coefficient (r); d) coefficient of determination (R2); e) index of agreement index (d); f) Camargo and Sentelhas index (c); and g) graphical methods: regression error characteristic curve based on relative absolute error and 1:1 scatter plot. For the evaluated cases, which are physical phenomena, differentiable indicators are similar measures and it is appropriate to report either or both indices. The assessment of models must also be supported by graphical analysis, which shows the real scenario of errors in the model evaluation processes. Efficiency-based indicators, r, R2, c, and d are not recommended and should be avoided in modeling of irrigation hydraulics.
f
Parameter distributions obtained from bootstrapping.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Oct 31, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoermann, Astrid; Becker, Kolja; Balsa-Canto, Eva; Banga, Julio R.; Janssens, Hilde; Jaeger, Johannes; Cicin-Sain, Damjan (2013). Parameter distributions obtained from bootstrapping. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001636413
Explore at:
Dataset updated
Oct 31, 2013
Authors
Hoermann, Astrid; Becker, Kolja; Balsa-Canto, Eva; Banga, Julio R.; Janssens, Hilde; Jaeger, Johannes; Cicin-Sain, Damjan
Description
This figure shows illustrative examples of scatter plots for parameter values derived from fits to simulated noisy data (sampled from the distributions of protein data measurements; see Figure 3 for mean and standard deviations of spatial expression profiles). Parameter values for Kr are shown in green (left column, A, D, G), for kni in red (centre column, B, E, H), and for gt in blue (right column, C, F, I). Parameter notation: (production rate), (decay rate), (diffusion rate), and (production delay; see equation 1). Black triangles indicate the original parameter estimate obtained with unperturbed data. Dashed ellipse around parameter values for Kr (in D) indicates parameters selected for further analysis. Arrow in E indicates striped interference pattern in the distribution of kni parameter values. See text for details.
S&T Project 22021 Code: Evaluating Zinc Anodes for Use in Freshwater...
data.usbr.gov
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Bureau of Reclamation (2025). S&T Project 22021 Code: Evaluating Zinc Anodes for Use in Freshwater Applications Python Script [Dataset]. https://data.usbr.gov/catalog/8513/item/133834
Explore at:
Dataset updated
Sep 30, 2025
Dataset authored and provided by
United States Bureau of Reclamationhttp://www.usbr.gov/
Area covered

Description
Jupyter notebook files containing the Python script used for analyzing interacting effects of water chemistry features on zinc anode passivation. Includes code to evaluate Master Dataset with histograms, correlation matrices, scatter plots, Dunn's tests, and logistic regression models.
Iris dataset
kaggle.com
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ehsan Zafari (2024). Iris dataset [Dataset]. https://www.kaggle.com/datasets/ehsanzafari/iris-dataset
Explore at:
zip(955 bytes)Available download formats
Dataset updated
Jan 16, 2024
Authors
Ehsan Zafari
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The Iris dataset is a classic dataset in the field of machine learning and statistics. It's often used for demonstrating various data analysis, machine learning, and statistical techniques. Here are some key details about it:

Background - Origin: The dataset was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper titled "The use of multiple measurements in taxonomic problems." - Purpose: Fisher developed the dataset as an example of linear discriminant analysis.

Data Composition - Data Points: The dataset consists of 150 samples from three species of Iris flowers: Iris Setosa, Iris Versicolour, and Iris Virginica. - Features: There are four features measured in centimeters for each sample: 1. Sepal Length 2. Sepal Width 3. Petal Length 4. Petal Width - Classes: The dataset contains three classes, corresponding to the three species of Iris. Each class has 50 samples.

Usage - Classification: The Iris dataset is widely used for classification tasks, especially to illustrate the principles of supervised machine learning algorithms. - Testing Algorithms: It's often used to test out algorithms for linear regression, classification, and clustering due to its simplicity and small size. - Educational Purpose: Because of its clarity and simplicity, it's frequently used in teaching data science and machine learning.

Characteristics - Simple and Clean: The dataset is straightforward, with minimal preprocessing required, making it ideal for beginners. - Well-Behaved Classes: The species are relatively well separated, though there's some overlap between Versicolor and Virginica. - Multivariate Data: It involves understanding the relationship between multiple variables (the four features).

Applications - Benchmarking: The Iris dataset serves as a benchmark for evaluating the performance of different algorithms. - Visualization**: It's great for practicing data visualization, especially for exploring techniques like scatter plots, box plots, and pair plots to understand feature relationships.

Despite its simplicity, the Iris dataset remains one of the most famous datasets in the world of data science and machine learning. It serves as an excellent starting point for anyone new to the field and remains a baseline for testing algorithms and teaching concepts.
n
UK National Databank of Moored Current Meter Data (1967-)
data-search.nerc.ac.uk
bodc.ac.uk
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). UK National Databank of Moored Current Meter Data (1967-) [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/resources/datasets/EDMED157
Explore at:
Dataset updated
Nov 8, 2025
Area covered
United Kingdom
Description
The data set comprises more than 7000 time series of ocean currents from moored instruments. The records contain horizontal current speed and direction and often concurrent temperature data. They may also contain vertical velocities, pressure and conductivity data. The majority of data originate from the continental shelf seas around the British Isles (for example, the North Sea, Irish Sea, Celtic Sea) and the North Atlantic. Measurements are also available for the South Atlantic, Indian, Arctic and Southern Oceans and the Mediterranean Sea. Data collection commenced in 1967 and is currently ongoing. Sampling intervals normally vary between 5 and 60 minutes. Current meter deployments are typically 2-8 weeks duration in shelf areas but up to 6-12 months in the open ocean. About 25 per cent of the data come from water depths of greater than 200m. The data are processed and stored by the British Oceanographic Data Centre (BODC) and a computerised inventory is available online. Data are quality controlled prior to loading to the databank. Data cycles are visually inspected by means of a sophisticated screening software package. Data from current meters on the same mooring or adjacent moorings can be overplotted and the data can also be displayed as time series or scatter plots. Series header information accompanying the data is checked and documentation compiled detailing data collection and processing methods.
Interactive plots.
plos.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinjun Li; Fan Feng; Hongxi Pu; Wai Yan Leung; Jie Liu (2023). Interactive plots. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008978.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008978.s003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xinjun Li; Fan Feng; Hongxi Pu; Wai Yan Leung; Jie Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file includes examples of interactive 2D and 3D scatter plots of cells from Nagano et al. (ZIP)
Z
Organic matter content (om) soil maps of the Upper Colorado River Basin
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Travis Nauman (2024). Organic matter content (om) soil maps of the Upper Colorado River Basin [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2550935
Explore at:
Dataset updated
Jul 22, 2024
Authors
Travis Nauman
Area covered
Colorado River
Description
The data here were originally posted to facilitate timely and transparent peer review. The final public data release with formal metadata is now available from at the following location:

Nauman, T.W., and Duniway, M.C., 2020, Predictive soil property maps with prediction uncertainty at 30 meter resolution for the Colorado River Basin above Lake Mead: U.S. Geological Survey data release, https://doi.org/10.5066/P9SK0DO2.

Associated publication:

Nauman, T. W., and Duniway, M. C., 2020, A hybrid approach for predictive soil property mapping using conventional soil survey data: Soil Science Society of America Journal, v. 84, no. 4, p. 1170-1194. https://doi.org/10.1002/saj2.20080.

UPDATE: WE FOUND A RENDERING ERROR IN MANY AREAS OF THE 5 CM MAP. WE HAVE RECREATED THE MAP AND INCLUDED IN THIS VERSION OF THE REPOSITORY.

Repository includes maps of organic matter content (% wt) as defined by United States soil survey program.

These data are preliminary or provisional and are subject to revision. They are being provided to meet the need for timely best science. The data have not received final approval by the U.S. Geological Survey (USGS) and are provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the data.

This data should be used in combination with a soil depth or depth to restriction layer map (both layers that will be released soon as part of this project) to eliminate areas mapped at deeper depths than the soil actually goes. This is a limitation of this data which will hopefully be updated in future updates.

The creation and interpretation of this data is documented in the following article. Please note this article has not been reviewed yet and this citation will be updated as the peer review process proceeds.

Nauman, T. W., Duniway, M. C., In Preparation. Predictive reconstruction of soil survey property maps for field scale adaptive land management. Soil Science Society of America Journal.

File Name Details:

ACCURACY!! Please see manuscript and Github repository (https://github.com/naumi421/SoilReconProps) for full details on accuracy. We do provide cross validation (CV) accuracy plots in this repository for both the overall sample (_CV_plots.tif). These plots compare CV predictions with observed values relative to a 1:1 line. Values plotted near the 1:1 line are more accurate. Note that values are plotted in hex-bin density scatter plots because of the large number of observations (most are >3000). Predictions are also evaluated with the U.S. soil survey laboratory database soil organic carbon (SOC) data. The SOC measurements were coverted to OM matter values using the common 1.724 conversion factor. The converted OM values are compared to predicted OM values using an accuracy plot (OM_SOC_plots.tif).

Elements are separated by underscore (_) in the following sequence:

property_r_depth_cm_geometry_model_additional_elements.extension

Example: om_r_0_cm_2D_QRF_bt.tif

Indicates soil organic matter content (om) at 0 cm depth using a 2D model (separate model for each depth) employing a quantile regression forest. This file is the raster prediction map for this model. There may be additional GIS files associated with this file (e.g. pyramids) that have the same file name, but different extensions. The _bt indicates that the map has been back transformed from ln or sqrt transformation used in modeling.

The following elements may also exist on the end of filenames indicating other spatial files that characterize a given model's uncertainty (see below).

_95PI_h: Indicates the layer is the upper 95% prediction interval value.

_95PI_l: Indicates the layer is the lower 95% prediction interval value.

_95PI_relwidth: Indicates the layer is the 95% relative prediction interval (RPI). The RPI is a standardization of the prediction interval that indicates that model is constraining uncertainty relative to the original sample. RPI values less than one represent uncertainty is being improved by the model relative to the original sample, and values less than 0.5 indicate low uncertainty in predictions. See paper listed above and also Nauman and Duniway (In revision) for more details on RPI.

References

Nauman, T. W., and Duniway, M. C., In Revision, Relative prediction intervals reveal larger uncertainty in 3D approaches to predictive digital soil mapping of soil properties with legacy data: Geoderma
Data Visualization (Anscombe’s Quartet)
kaggle.com
zip
Updated May 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubham Keshari (2023). Data Visualization (Anscombe’s Quartet) [Dataset]. https://www.kaggle.com/datasets/keshariji/data-visualization-anscombes-quartet
Explore at:
zip(18719 bytes)Available download formats
Dataset updated
May 27, 2023
Authors
Shubham Keshari
Description
Hi Folks,

Let's understand the importance of Data Visualization.

Here below, we have four different data sets and they are paired in the sense of x and y.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F4f6c696e3ad5e2c887b01a0bdd14b355%2Fdata_set.png?generation=1685190700223447&alt=media" alt="">

Next let's calculate some descriptive statistics such as mean, standard deviation and correlation of each variables.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F14765ba12bdc18b8ff67cb6a9f2d7c7a%2Fstatistics.png?generation=1685192394142325&alt=media" alt="">

After examining the descriptive statistics the above four data sets have nearly identical or similar simple descriptive statistics.

However, when we graphically plot the datasets on scatter plot, we can see the difference that these 4 datasets looks very different.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2Fdbccf9dc638d3de28930b9f660e5f5a4%2Fgarph.png?generation=1685191588780934&alt=media" alt="">

Data 1 has a clear linear relationship, Data 2 has a curved relationship that is not linear, Data 3 has a tight linear relationship with one outlier and Data 4 has a linear relationship with one large outlier.

Such datasets are known as Anscombe's Quartet

Anscombe's quartet is a classic example of the importance of data visualization.

Anscombe's quartet is a set of four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphically represented. Each dataset consists of eleven (x,y) points.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F2b964d437afe17db949c57988b5fba05%2Fanscombes_quartet.png?generation=1685192626504792&alt=media" alt="">

Anscombe's quartet illustrates the importance of plotting data before we analyze it. Descriptive statistics can be misleading, and they can't tell us everything we need to know about a dataset. Plotting the data on charts can help us to understand the shape of the distribution and to identify any outliers.

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdelrahman Mohamed (2024). All Seaborn Built-in Datasets 📊✨ [Dataset]. https://www.kaggle.com/datasets/abdoomoh/all-seaborn-built-in-datasets

All Seaborn Built-in Datasets 📊✨

A Complete Set of Seaborn Datasets for Analysis and Visualization

Explore at:

zip(1383218 bytes)Available download formats

Dataset updated

Aug 27, 2024

Authors

Abdelrahman Mohamed

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.

Included Datasets:
- Anagrams: Analysis of word anagram patterns.
- Anscombe: Anscombe's quartet demonstrating the importance of data visualization.
- Attention: Data on attention span variations in different scenarios.
- Brain Networks: Connectivity data within brain networks.
- Car Crashes: US car crash statistics.
- Diamonds: Data on diamond properties including price, cut, and clarity.
- Dots: Randomly generated data for scatter plot visualization.
- Dow Jones: Historical records of the Dow Jones Industrial Average.
- Exercise: The relationship between exercise and health metrics.
- Flights: Monthly passenger numbers on flights.
- FMRI: Functional MRI data capturing brain activity.
- Geyser: Eruption times of the Old Faithful geyser.
- Glue: Strength of glue under different conditions.
- Health Expenditure: Health expenditure statistics across countries.
- Iris: Famous dataset for classifying Iris species.
- MPG: Miles per gallon for various vehicles.
- Penguins: Data on penguin species and their features.
- Planets: Characteristics of discovered exoplanets.
- Sea Ice: Measurements of sea ice extent.
- Taxis: Taxi trips data in a city.
- Tips: Tipping data collected from a restaurant.
- Titanic: Survival data from the Titanic disaster.

This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.

Clear search

Close search

Google apps

Main menu

All Seaborn Built-in Datasets 📊✨

EXAMPLE - MAHI FY24 SRF Projects (scatter chart)

E-commerce Sales Prediction Dataset

E-commerce Sales Prediction Dataset

📂 Dataset Overview

📊 Data Summary

General Properties

📈 Data Visualizations

💡 How the Data Was Created

🛠 Example Usage: Sales Prediction Model

Written in python

Created isotopes, half-life, reaction, and threshold reaction energy.

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Ultimate_Analysis

Plotly Dashboard Healthcare

Context

Content

Acknowledgements

Inspiration

Additional file 3: of DMDtoolkit: a tool for visualizing the mutated...

OsterlundJBC_Figure 6 & SFig6

Additional file 1 of ChromoMap: an R package for interactive visualization...

Large-Scale Preference Dataset

Large-Scale Preference Dataset

Training Powerful Reward & Critic Models with Aligned Language Models

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

EXAMPLE - MAHI FY24 SRF Projects (scatter chart) - Dataset - Datopian CKAN...

Data from: Indicators for evaluation of model performance: irrigation...

Parameter distributions obtained from bootstrapping.

S&T Project 22021 Code: Evaluating Zinc Anodes for Use in Freshwater...

Iris dataset

UK National Databank of Moored Current Meter Data (1967-)

Interactive plots.

Organic matter content (om) soil maps of the Upper Colorado River Basin

Data Visualization (Anscombe’s Quartet)

All Seaborn Built-in Datasets 📊✨

A Complete Set of Seaborn Datasets for Analysis and Visualization