38 datasets found

Data from: Area-Proportional Visualization for Circular Data
tandf.figshare.com
txt
Updated Feb 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danli Xu; Yong Wang (2024). Area-Proportional Visualization for Circular Data [Dataset]. http://doi.org/10.6084/m9.figshare.9638858.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9638858.v2
Dataset updated
Feb 13, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Danli Xu; Yong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data visualization is important for statistical analysis, as it helps convey information efficiently and shed lights on the hidden patterns behind data in a visual context. It is particularly helpful to display circular data in a two-dimensional space to accommodate its nonlinear support space and reveal the underlying circular structure which is otherwise not obvious in one-dimension. In this article, we first formally categorize circular plots into two types, either height- or area-proportional, and then describe a new general methodology that can be used to produce circular plots, particularly in the area-proportional manner, which in our opinion is the more appropriate choice. Formulas are given that are fairly simple yet effective to produce various circular plots, such as smooth density curves, histograms, rose diagrams, dot plots, and plots for multiclass data. Supplemental materials for this article are available online.
t
Dataset for the plots of "collective photon emission patterns from two atoms...
service.tib.eu
Updated Nov 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dataset for the plots of "collective photon emission patterns from two atoms in free space" [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-22000-560
Explore at:
Dataset updated
Nov 28, 2024
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Abstract: The dataset contains the data underlying the plots and histograms in the article "Collective photon emission patterns from two atoms in free space". The data show the photon statistics of a trapped two-ion crystal observed in the far-field. Details of the measurement process and experimental setup can be found in the journal publication. The data for each individual plot are presented as a table in a separate sheet in an xlsx-spreadsheet.
h
llm-distribution
huggingface.co
Updated Oct 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rob van Volt (2025). llm-distribution [Dataset]. https://huggingface.co/datasets/robvanvolt/llm-distribution
Explore at:
Dataset updated
Oct 5, 2025
Authors
Rob van Volt
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
LLM Distribution Evaluation Dataset

This dataset contains 50,000 synthetic graphs with questions and answers about statistical distributions, designed to evaluate large language models' ability to analyze data visualizations.

Dataset Description Dataset Summary

This dataset contains diverse statistical visualizations (bar charts, line plots, scatter plots, histograms, area charts, and step plots) with associated questions about:

Normality testing Distribution… See the full description on the dataset page: https://huggingface.co/datasets/robvanvolt/llm-distribution.
f
Data from: Data on Particle Distributions by Image Measurement, Gabor...
datasetcatalog.nlm.nih.gov
jstagedata.jst.go.jp
Updated Feb 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nakai, Dai; Tanaka, Yohsuke (2023). Data on Particle Distributions by Image Measurement, Gabor Holography, and Phase Retrieval Holography [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000995152
Explore at:
Dataset updated
Feb 3, 2023
Authors
Nakai, Dai; Tanaka, Yohsuke
Description
The Excel spreadsheet (with the comma-separated-value: CSV files using the same names) contains the three tables and the raw data used to plot histograms of Fig. 8, Fig. 11(a), and Fig. 11(b). Here, the range is from 0 to 100 μm. Each Tab corresponds to Figure number in the paper. The Feret diameter is obtained by using image analysis software (Fiji 2.3.0: Schindelin et al., 2012). The histogram bin width is calculated using the formula (Scott, 1979).
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm
plos.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic (2023). Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm [Dataset]. http://doi.org/10.1371/journal.pbio.1002128
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pbio.1002128
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.
Customer Sale Dataset for Data Visualization
kaggle.com
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atul (2025). Customer Sale Dataset for Data Visualization [Dataset]. https://www.kaggle.com/datasets/atulkgoyl/customer-sale-dataset-for-visualization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atul
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.

Unlike most public datasets, this one includes a diverse mix of column types:

📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)

Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.

Feel free to:

Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!

Hope you find this helpful. Looking forward to hearing from you all.
f
Histogram plot of the average alignment accuracy (averaged over 10 runs) for...
datasetcatalog.nlm.nih.gov
Updated Oct 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferretti, Vincent; Watt, Stuart N.; Borozan, Ivan (2013). Histogram plot of the average alignment accuracy (averaged over 10 runs) for each viral genome. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001630033
Explore at:
Dataset updated
Oct 30, 2013
Authors
Ferretti, Vincent; Watt, Stuart N.; Borozan, Ivan
Description
Histogram plot of the average alignment accuracy averaged over 10 runs for each viral genome shown in Table 1 and each aligner. Reads crossing splice junction regions are shown in pink, reads not crossing splice junction regions are shown in blue).
f
Figures S1 to S10: A data-driven approach to understanding the relations...
geolsoc.figshare.com
zip
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu-Ting Yu; H. Sebnem Duzgun; Andrew Sabin (2025). Figures S1 to S10: A data-driven approach to understanding the relations between geothermal exploration parameters: insights from Coso, Brady and Desert Peak, USA [Dataset]. http://doi.org/10.6084/m9.figshare.30366421.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30366421.v1
Dataset updated
Oct 15, 2025
Dataset provided by
Geological Society of London
Authors
Yu-Ting Yu; H. Sebnem Duzgun; Andrew Sabin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Figure S1. Illustration of indicator mineral map datasets. Figure S2. Illustration of fault map datasets. Figure S3. Fault system at CGF. Figure S4. Fault system at BGF and DPGF. Figure S5. Illustration of LST datasets. Figure S6. Histograms and CDF plots of Two-Class Mineral Maps versus Fault Distance Maps. Figure S7. Histograms and CDF plots of Two-Class Mineral Maps versus Fault Density Maps. Figure S8. Histograms and CDF plots of Two-class Temperature Maps versus fault datasets. The top two rows correspond to Fault Distance Maps, while the bottom two rows correspond to Fault Density Maps. Figure S9. Histograms and CDF plots of Two-Class Mineral Maps versus Multiclass Temperature Map. Figure S10. The multiple comparisons of ANOVA. The plots show the mean estimates (circles) and 95% confidence intervals (bars) for each group of SGP. Red symbols highlight groups with Significant Differences from the control group (blue). Grey symbols indicate groups with Insignificant Differences where confidence intervals overlap with the control group.
f
Excel spreadsheet containing, on separate sheets, the underlying numerical...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Missinou, Anani Amegan; Renault, Louis; Nunes, Magda Teixeira; Retailleau, Pascal; Comisso, Martine; Mechold, Undine; Velours, Christophe; Ladant, Daniel; Plancqueel, Stéphane; Raoux-Barbot, Dorothée (2023). Excel spreadsheet containing, on separate sheets, the underlying numerical data that were used to generate plots or histograms for Figs 1A–1D, 1E—left panel, 1E—right panel, 2A-2F, 3B, 3C, 5A, 5C, 6C, 6D, 7B, 7C, S1, S2A, S2B, S13A–S13D. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000937280
Explore at:
Dataset updated
Sep 25, 2023
Authors
Missinou, Anani Amegan; Renault, Louis; Nunes, Magda Teixeira; Retailleau, Pascal; Comisso, Martine; Mechold, Undine; Velours, Christophe; Ladant, Daniel; Plancqueel, Stéphane; Raoux-Barbot, Dorothée
Description
Excel spreadsheet containing, on separate sheets, the underlying numerical data that were used to generate plots or histograms for Figs 1A–1D, 1E—left panel, 1E—right panel, 2A-2F, 3B, 3C, 5A, 5C, 6C, 6D, 7B, 7C, S1, S2A, S2B, S13A–S13D.
Z
Protein secondary-structure description with a coarse-grained model: code...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kneller, Gerald R.; Hinsen, Konrad (2020). Protein secondary-structure description with a coarse-grained model: code and datasets in ActivePapers format [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_21690
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Université d'Orléans
CNRS
Authors
Kneller, Gerald R.; Hinsen, Konrad
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file contains the supplementary material for the publication

Protein secondary-structure description with a coarse-grained model by Gerald R. Kneller and K. Hinsen http://dx.doi.org/10.1107/S1399004715007191 Acta Cryst. (2015). D71, 1411-1422

Datasets in this file

1) ScrewFit and ScrewFrame parameters for ideal secondary-structure elements

Scripts: /code/import_ideal_structures /code/analyze_ideal_structures

1.1) The PDB files generated with Chimera

/data/ideal_structures/3-10.pdb /data/ideal_structures/alpha.pdb /data/ideal_structures/beta-antiparallel.pdb /data/ideal_structures/beta-parallel.pdb /data/ideal_structures/pi.pdb

1.2) The corresponding MOSAIC datasets

/data/ideal_structures/3-10 /data/ideal_structures/alpha /data/ideal_structures/beta-antiparallel /data/ideal_structures/beta-parallel /data/ideal_structures/pi

1.3) The ScrewFit parameters

/data/ideal_structures/screwfit/3-10 /data/ideal_structures/screwfit/alpha /data/ideal_structures/screwfit/beta-antiparallel /data/ideal_structures/screwfit/beta-parallel /data/ideal_structures/screwfit/pi

1.4) The ScrewFrame parameters

/data/ideal_structures/screwframe/3-10 /data/ideal_structures/screwframe/alpha /data/ideal_structures/screwframe/beta-antiparallel /data/ideal_structures/screwframe/beta-parallel /data/ideal_structures/screwframe/pi

2) Statistics for ScrewFit and ScrewFrame parameters computed for the ASTRAL SCOPe subset with less than 40% sequence identity.

Scripts: /code/astral_analysis /code/fit_rho_distributions /code/plot_histograms

2.1) The ASTRAL database (link to published ActivePaper)

/data/astral_2.04

2.2) The histograms for the ScrewFit and ScrewFrame parameters for the all-alpha and all-beta subsets

/data/histograms/astral_alpha/screwfit /data/histograms/astral_alpha/screwframe

/data/histograms/astral_beta/screwfit /data/histograms/astral_beta/screwframe

2.3) The Gaussians fitted to the peaks in the distributions for rho

/data/fitted_rho_distributions/screwfit /data/fitted_rho_distributions/screwframe

2.4) Plots

/documentation/delta.pdf /documentation/delta_q.pdf /documentation/delta_r.pdf /documentation/p.pdf /documentation/rho-detail.pdf /documentation/rho.pdf /documentation/sigma.pdf /documentation/tau.pdf

3) Comparison of secondary-structure identification between ScrewFrame and DSSP.

Script: /code/compare_secondary_structure_assignments /code/plot_histograms

3.1) The histograms of the lengths of secondary-structure elements

/data/histograms/secondary_structure/length-alpha-dssp /data/histograms/secondary_structure/length-alpha-screwframe /data/histograms/secondary_structure/length-beta-dssp /data/histograms/secondary_structure/length-beta-screwframe

3.2) The 2D histograms of the number of residues inside identified secondary-structure elements

/data/histograms/secondary_structure/n-alpha /data/histograms/secondary_structure/n-beta

3.3) The distribution of rho inside alpha helices

/data/histograms/secondary_structure/rho-alpha-dssp

3.3) Plots

/documentation/lengths-alpha.pdf /documentation/lengths-beta.pdf /documentation/n-alpha.pdf /documentation/n-beta.pdf /documentation/rho-alpha-dssp.pdf

4) Illustration for myoglobin and VADC-1

Scripts: /code/import_myoglobin_vdac /code/analyze_myoglobin /code/analyze_vdac /code/perturbation_analysis

4.1) Imported structures in MOSAIC format: PDB code 1A6G for myoglobin PDB code 2K4T for VDAC-1

/data/myoglobin /data/VDAC-1

4.2) Plots showing rho and delta

/documentation/rho-myoglobin.pdf /documentation/delta-myoglobin.pdf

4.3) Tube models for visualization with Chimera

/documentation/myoglobin-tube.bld /documentation/VDAC-1-tube.bld

4.4) Sensitivity to perturbations in the coordinates

/documentation/rho-perturbed-myoglobin.pdf /documentation/delta-perturbed-VDAC-1.pdf /documentation/rho-perturbed-myoglobin.pdf /documentation/delta-perturbed-VDAC-1.pdf /documentation/myoglobin-perturbation.pdf /documentation/VDAC-1-perturbation.pdf

5) Analysis of CA-only structures in the PDB

Scripts: /code/ca_analysis /code/import_calpha_structures /code/plot_histograms

5.1) Imported CA-only structures in MOSAIC format

/data/pdb_ca_only_structures

5.2) Histograms for ScrewFrame parameters

/data/histograms/ca_only_structures

5.3) Plots

/documentation/delta_ca.pdf /documentation/delta_q_ca.pdf /documentation/delta_r_ca.pdf /documentation/p_ca.pdf /documentation/rho_ca.pdf /documentation/sigma_ca.pdf /documentation/tau_ca.pdf
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Bicycle Geometry Dataset
kaggle.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frequent4242 (2023). Bicycle Geometry Dataset [Dataset]. https://www.kaggle.com/datasets/frequent4242/bicycle-geometry
Explore at:
zip(425300 bytes)Available download formats
Dataset updated
Sep 5, 2023
Authors
Frequent4242
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
dataset repo: https://github.com/dorianprill/dataset-bicycle-geometry

The data set contains more than 6400 observations of the following 30 variables:

columns = [ 'URL', 'Brand', 'Model', 'Year', 'Category', 'Motorized', 'Frame Size', 'Frame Config', 'Wheel Size', 'Reach', 'Stack', 'STR', 'Front Center', 'Head Tube Angle', 'Seat Tube Angle Effective', 'Seat Tube Angle Real', 'Top Tube Length', 'Top Tube Length Horizontal', 'Head Tube Length', 'Seat Tube Length', 'Standover Height', 'Chainstay Length', 'Wheelbase', 'Bottom Bracket Offset', 'Bottom Bracket Height', 'Fork Installation Height', 'Fork Offset', 'Fork Trail', 'Suspension Travel (rear)', 'Suspension Travel (front)', ]

Multiple variants may be recorded for each model. Variants depend mostly on Frame Size, Frame Config, Wheel Size, Suspension Travel (rear), Suspension Travel (front).

Most of the columns are self-explanatory if you are into bikes. There may be many nulls in the numeric columns since different manfuacturers may use a slightly different set of values and some values are normally only stated for a certain category of bikes.
Some of these values can be computed with simple geometry.

The URL column contains the URL of the page from which the data was extracted. The last number in the URL is the database ID of the bike.

Ideas

Plot a facet grid of scatterplots for stack/reach vs other variables of a certain class of bikes (e.g. Mountain).

Plot the median head tube angle of a certain class of bikes (e.g. > 140 mm rear travel MTBs) as a line chart over the years.

Plot histograms of some geometry values for a certain class of bikes (e.g. <= 130 mm rear travel MTBs) over the years.

Manufacturer sizing is very arbitrary. Develop a model for sizing based on variables stack and reach. Find a relationship using a scatterplot and some regression model

Try to predict the category of a bike given only a limited set of geometry values.

Try to predict the model year given only a limited set of geometry values.

Try to predict the suspension travel given only a limited set of geometry values.

Errata

The API currently has no category for electric bikes. It has a variable has_motor but it is always false. This is probably a bug in the API or not recorded in the database (yet).
I have included it in the data set as Motorized for future-proofing but you can safely drop it for now.
Phishing URL Content Dataset
kaggle.com
zip
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
Explore at:
zip(62701 bytes)Available download formats
Dataset updated
Nov 25, 2024
Authors
Aaditey Pillai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Phishing URL Content Dataset

Executive Summary

Motivation:
Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

Applications:
- Building robust phishing detection systems.
- Enhancing security measures in email filtering and web browsing.
- Training cybersecurity practitioners in identifying malicious URLs.

The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

Description of Data

This dataset comprises two types of URLs:
1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

Key Features:
- URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
- Content-based features: Link density, iframe presence, external/internal links, and metadata.
- Certificate-based features: SSL/TLS details like validity period and organization.
- WHOIS data: Registration details like creation and expiration dates.

Statistics:
- Total Samples: 800 (400 phishing, 400 benign).
- Features: 22 including URL, domain, link density, and SSL attributes.

Power Analysis

To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

Exploratory Data Analysis (EDA)

Insights from EDA:
- Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

EDA visualizations are provided in the repository.

Link to Publicly Available Data and Code

Dataset: Phishing URL Dataset

Code Repository: GitHub - Phishing Detection

The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

Ethics Statement

Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
1. Protects User Privacy: No personally identifiable information is included.
2. Promotes Ethical Use: Intended solely for academic and research purposes.
3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

Risks:
- Misuse of the dataset for creating more deceptive phishing attacks.
- Over-reliance on outdated features as phishing tactics evolve.

Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

Open Source License

This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.
f
Appendix B. Figures containing a histogram of frequency of effect sizes on...
datasetcatalog.nlm.nih.gov
wiley.figshare.com
Updated Aug 9, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McKenzie, Scott W.; Jones, T. Hefin; Clark, Katherine E.; Johnson, Scott N.; Hartley, Susan E.; Koricheva, Julia (2016). Appendix B. Figures containing a histogram of frequency of effect sizes on AG and BG herbivores and a funnel plot of effect size and sample sizes indicating absence of publication bias. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001518356
Explore at:
Dataset updated
Aug 9, 2016
Authors
McKenzie, Scott W.; Jones, T. Hefin; Clark, Katherine E.; Johnson, Scott N.; Hartley, Susan E.; Koricheva, Julia
Description
Figures containing a histogram of frequency of effect sizes on AG and BG herbivores and a funnel plot of effect size and sample sizes indicating absence of publication bias.
h
online-shoppers-eda
huggingface.co
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shira bash (2025). online-shoppers-eda [Dataset]. https://huggingface.co/datasets/shiraBASH/online-shoppers-eda
Explore at:
Dataset updated
Nov 17, 2025
Authors
shira bash
Description
Exploratory Data Analysis (EDA) on the Online Shoppers Purchasing Intention Dataset

Author: Shira Bash

Project Overview

This project performs Exploratory Data Analysis (EDA) on the Online Shoppers Purchasing Intention dataset.The goal is to understand which behavioral patterns influence the likelihood that a website visitor completes a purchase(Revenue = True). The analysis includes:

Data exploration & validation
Visualizations (histograms, scatter plots, box… See the full description on the dataset page: https://huggingface.co/datasets/shiraBASH/online-shoppers-eda.
D
Data from: Code for: Experimental Investigations of the Flow-Following...
darus.uni-stuttgart.de
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Hofmann; Ryan Rautenbach (2023). Code for: Experimental Investigations of the Flow-Following Capabilities and Hydrodynamic Characteristics of Lagrangian Sensor Particles With Respect to Their Centre of Mass [Dataset]. http://doi.org/10.18419/DARUS-3314
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3314
Dataset updated
Mar 13, 2023
Dataset provided by
DaRUS
Authors
Sebastian Hofmann; Ryan Rautenbach
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Dataset funded by
DFG
Description
Data for 2D Lagrangian Particle tracking and evaluation for their hydrodynamic characteristics ## Abstract This dataset entails PYTHON code for fluid mechanic evaluation of Lagrangian Particles with the "Consensus-Based tracking with Selective Rejection of Tracklets" (CSRT) algorithm in the "OpenCV" library, written by Ryan Rautenbach in the framework of his Master thesis. ## Workflow for Lagrangian Particle tracking and evaluatio via OpenCV In the following a brief introduction and guide based on the folders in the repository is laid out. More code specific instructions can be found in the respective codes. working_env_RMR.yml --> Contains the entire environment including software versions (here used with Spyder IDE and Conda) with which the datasets were evaluated. 01 --> The tracking always begins with the same 01_milti[...] folder in which the python code with OpenCV algorithm is located. For tracking the tracking to work certain directories are required in which the raw images are to be stored (separate from anything else) as well as a directory in which the results are to be save (not the same directory as the raw data). After tracking is completed for all respective experiments and the results directories are adequately labelled and stored any of the other code files can be used for respective analyses. The order of folders beyond the first 01 directory has no relevance to the order of evaluation however can ease the understanding of evaluated data if followed. 02 --> Evaluation of amount of circulations and respective circulation time in experimental vat. (code can be extended to calculate the circulation time based on the various plains that are artificially set) 03 --> Code for the calculation of the amount of contacts with the vat floor. Code requires certain visual evaluations based on the LP trajectories, as the plain/barrier for the contact evaluation has to be manually set. 04 --> Contains two codes that can be applied to results data to combine individual results into larger more processable arrays within python. 05 --> Contains the code to plot the trajectory of single experiments of Lagrangian particles based on their positional results and velocity at respective position, highlighting the trajectory over the experiment. 06 --> Condes to create 1D histograms based on the probability density distribution and velocity distributions in cumulative experiments. 07 --> Codes for plotting the 2D probability density distribution (2D Histograms) of Lagrangian Particles based on the cumulative experiments. Code provides values for the 2D grid, plotting is conducted in Origin Lab or similar graphing tools, graphing can also be conducted in python whereby the seaborn (matplotlib) library is suggested. 08 --> Contain the code for the dimensionless evaluation of the results based on the respective Stokes number approaches and weighted averages. 2D histograms are also vital to this evaluation, whereby the plotting is again conducted in Origin Lab as values are only calculated in code. 09 --> Directory does not contain any python codes but instead contains the respective Origin Lab files for the graphing, plotting and evaluation of results calculated via python is given. Respective tables, histograms and heat maps are hereby given to be used as templates if necessary.
Data from: [Dataset:] How do size distributions relate to concurrently...
smithsonian.figshare.com
search.dataone.org
+1more
txt
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renato A.F. Lima; S. Joseph Wright; P.I. Prado; Richard Condit (2024). [Dataset:] How do size distributions relate to concurrently measured demographic rates? Evidence from over 150 tree species in Panama: Supporting Data [Dataset]. http://doi.org/10.25573/data.25620870.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25573/data.25620870.v1
Dataset updated
Apr 18, 2024
Dataset provided by
Smithsonian Tropical Research Institute
Authors
Renato A.F. Lima; S. Joseph Wright; P.I. Prado; Richard Condit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Panama
Description
This study evaluated how interspecific variation in diameter distributions relates to their growth-diameter and mortality-diameter curves and to population growth rates, using 25 y of demographic data from the 50-ha Barro Colorado Island plot. More specifically, this document presents the truncated Weibull fits to diameter distributions, the corresponding truncated Weibull parameters, the parameters of the growth-diameter and mortality-diameter curves and the population growth rates for Barro Colorado Island species included in the above mentioned study.CITATION FOR SUPPORTING DATA: Lima, R.A.F., Muller-Landau, H.C., Prado, P.I. & Condit, R. 2016. How do size distributions relate to concurrently measured demographic rates? Evidence from over 150 tree species in Panama: Supporting data. http://dx.doi.org/10.5479/10088/28131 CITATION FOR ORIGINAL ARTICLE: Lima, R.A.F., Muller-Landau, H.C., Prado, P.I. & Condit, R. 2016. How do size distributions relate to concurrently measured demographic rates? Evidence from over 150 tree species in Panama. Journal of Tropical Ecology. DOI doi: 10.1017/S0266467416000146. FILES INCLUDED WITH SUPPORTING DATA: Table S1. Parameters of the truncated Weibull fits to size distributions (beta and alpha), parameters of the growth-dbh and mortality-dbh curves and population growth rates (lambda) for the 174 Barro Colorado Island species included in this study. Figure S1. Diameter distributions of the species evaluated in this study and their truncated Weibull fits for the 2010 census of the Barro Colorado Island 50-ha plot. Since the truncated Weibull was fitted directly to data and not to the histograms show below, therefore there may be some disparities between them. Each histogram contains the density on the y-axis and the diameter in mm on the x-axis. Species full names are given in Table S1.
Z
A Dataset for Inertial Measurements of Scoliotic Patients during Timed-Up...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Costantini, Simone; Storm, Fabio Alexander (2025). A Dataset for Inertial Measurements of Scoliotic Patients during Timed-Up and Go Tests in Unbraced and Braced Scenarios [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8401830
Explore at:
Dataset updated
Jan 16, 2025
Dataset provided by
Scientific Institute, I.R.C.C.S. "E.Medea", Bosisio Parini, Italy
Authors
Costantini, Simone; Storm, Fabio Alexander
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Repository composition:

*** Dataset ***

Anonymous data pertaining to each participant is stored in a single folder, named S##.

Each of these folders contains the following:

1) Raw IMU data from the G-Walk sensor (3-axis acceleration, 3-axis gyroscope, 3-axis magnetometer) recorded during the Timed-Up and Go tests in .txt files. Specifically, the 3 experimental conditions are "Unbraced", "Conventional" and "3DPrinted". Each condition was recorded three times (01, 02, 03).

2) A .xlsx file named "TUG_Metrics" with the values of the TUG metrics for each condition.

3) A .xlsx file named "Segmentation_Times" with the start and end timepoints of the TUG phases for each condition.

*** Boxplot ***

This is a folder containing boxplots in .png files for each TUG metric, comparing the three experimental conditions.

*** Histogram ***

This is a folder containing histograms in .png files for each TUG metric, comparing the three experimental conditions.

*** QQ Plot ***

This is a folder containing qq-plots in .png files for each TUG metric, comparing the three experimental conditions.
Chart the Story: Infographic Challenge
kaggle.com
zip
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajarajeswari P (2025). Chart the Story: Infographic Challenge [Dataset]. https://www.kaggle.com/datasets/rajarajeswariprr/chart-the-story-infographic-challenge
Explore at:
zip(14077 bytes)Available download formats
Dataset updated
Nov 25, 2025
Authors
Rajarajeswari P
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Activity Title: "Chart the Story: Infographic Challenge" (This activity is prepared for students to practice data visualization)

Description: Each group chooses a theme (e.g., sports stats, movies, weather) and creates a multi-plot visual infographic using: • Histograms • Proper legends, color schemes • Subplots for comparison • Informative annotations and custom styles

Outcome: Poster or presentation gallery walk where peers rate clarity and visual storytelling.
S&T Project 22021 Code: Evaluating Zinc Anodes for Use in Freshwater...
data.usbr.gov
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Bureau of Reclamation (2025). S&T Project 22021 Code: Evaluating Zinc Anodes for Use in Freshwater Applications Python Script [Dataset]. https://data.usbr.gov/catalog/8513/item/133834
Explore at:
Dataset updated
Sep 30, 2025
Dataset authored and provided by
United States Bureau of Reclamationhttp://www.usbr.gov/
Area covered

Description
Jupyter notebook files containing the Python script used for analyzing interacting effects of water chemistry features on zinc anode passivation. Includes code to evaluate Master Dataset with histograms, correlation matrices, scatter plots, Dunn's tests, and logistic regression models.

Facebook

Twitter

Click to copy link

Link copied

Cite

Danli Xu; Yong Wang (2024). Area-Proportional Visualization for Circular Data [Dataset]. http://doi.org/10.6084/m9.figshare.9638858.v2

Data from: Area-Proportional Visualization for Circular Data

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.9638858.v2

Dataset updated

Feb 13, 2024

Dataset provided by

Taylor & Francishttps://taylorandfrancis.com/

Authors

Danli Xu; Yong Wang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data visualization is important for statistical analysis, as it helps convey information efficiently and shed lights on the hidden patterns behind data in a visual context. It is particularly helpful to display circular data in a two-dimensional space to accommodate its nonlinear support space and reveal the underlying circular structure which is otherwise not obvious in one-dimension. In this article, we first formally categorize circular plots into two types, either height- or area-proportional, and then describe a new general methodology that can be used to produce circular plots, particularly in the area-proportional manner, which in our opinion is the more appropriate choice. Formulas are given that are fairly simple yet effective to produce various circular plots, such as smooth density curves, histograms, rose diagrams, dot plots, and plots for multiclass data. Supplemental materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Data from: Area-Proportional Visualization for Circular Data

Dataset for the plots of "collective photon emission patterns from two atoms...

llm-distribution

Data from: Data on Particle Distributions by Image Measurement, Gabor...

Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm

Customer Sale Dataset for Data Visualization

Histogram plot of the average alignment accuracy (averaged over 10 runs) for...

Figures S1 to S10: A data-driven approach to understanding the relations...

Excel spreadsheet containing, on separate sheets, the underlying numerical...

Protein secondary-structure description with a coarse-grained model: code...

Cdd Dataset

Bicycle Geometry Dataset

Ideas

Errata

Phishing URL Content Dataset

Phishing URL Content Dataset

Executive Summary

Description of Data

Power Analysis

Exploratory Data Analysis (EDA)

Link to Publicly Available Data and Code

Ethics Statement

Open Source License

Appendix B. Figures containing a histogram of frequency of effect sizes on...

online-shoppers-eda

Data from: Code for: Experimental Investigations of the Flow-Following...

Data from: [Dataset:] How do size distributions relate to concurrently...

A Dataset for Inertial Measurements of Scoliotic Patients during Timed-Up...

Chart the Story: Infographic Challenge

S&T Project 22021 Code: Evaluating Zinc Anodes for Use in Freshwater...

Data from: Area-Proportional Visualization for Circular Data