88 datasets found

Electronics Store Sales Dataset for EDA
kaggle.com
zip
Updated Feb 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sinjoy Saha (2021). Electronics Store Sales Dataset for EDA [Dataset]. https://www.kaggle.com/sinjoysaha/sales-analysis-dataset
Explore at:
zip(2505035 bytes)Available download formats
Dataset updated
Feb 13, 2021
Authors
Sinjoy Saha
Description
Content

This is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019. The naming convention is as follows: Sales_[MONTH_NAME]_2019 Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows: Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address There are around 186851 data points combining all the 12-month files. There may be null values in some rows.

Inspiration

Keith Galli

Acknowledgements

Keith Galli's Youtube video - Solving real world data science tasks with Python Pandas!

Keith Galli's GitHub Repo - Pandas-Data-Science-Tasks
Complete Google Playstore EDA 2025
kaggle.com
zip
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Shayan (2025). Complete Google Playstore EDA 2025 [Dataset]. https://www.kaggle.com/datasets/muhammadshayan5839/complete-google-playstore-eda-2025
Explore at:
zip(20150127 bytes)Available download formats
Dataset updated
Jul 29, 2025
Authors
Muhammad Shayan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
- About Dataset

Description
The Data Set was downloaded from Kaggle, from the following link

Context
Google PlayStore Android App Data. (2.3 Million+ App Data) Backup repo: https://github.com/gauthamp10/Google-Playstore-Dataset

Content
I've collected the data with the help of Python script (Scrapy) running on a cloud vm instance. The data was collected in the month of june 2025.

Also checkout:

Apple AppStore Apps dataset: https://www.kaggle.com/gauthamp10/apple-appstore-apps Android App Permission dataset: https://www.kaggle.com/gauthamp10/app-permissions-android

Acknowledgements
I couldn't have build this dataset without the help of Github Education and switched to facundoolano/google-play-scraper for sane reasons

Inspiration
Took inspiration from: https://www.kaggle.com/lava18/google-play-store-apps to build a big database for students and researchers.
Most Popular Python Projects on GitHub (2018-)
kaggle.com
zip
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bogoconic1 (2024). Most Popular Python Projects on GitHub (2018-) [Dataset]. https://www.kaggle.com/yeoyunsianggeremie/most-popular-python-projects-on-github-2018-2023
Explore at:
zip(14095145 bytes)Available download formats
Dataset updated
Feb 3, 2024
Authors
bogoconic1
Description
[UPDATED EVERY WEEK]

Have you wondered how popular are the Python libraries you use regularly on Kaggle (such as pandas, numpy) ?

This dataset lists the top 100 Python projects (or libraries) PER DAY, ranked based on the number of Github Stars, starting from 18 December 2018, almost 5 years back!

Attributes

date: Date where the record was collected

rank: 1-100, rank based on number of Github stars, sorted in decreasing order

item: Python

repo_name: Name of the Github repository of the Python project (library)

stars: Number of stars of the github repo

forks: Number of forks of the github repo

language: The language the repository is written in

repo_url: The link to the github repository

username: Creator of the github repository

issues: Number of active issues raised in the github repository

last_commit: The time of the most recent commit

description: Description of the Python project (library)

Reference https://github.com/EvanLi/Github-Ranking

EDA: https://www.kaggle.com/code/yeoyunsianggeremie/eda-of-popular-python-libraries-used-in-kaggle
h
vgen_cpp
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nwang227, vgen_cpp [Dataset]. https://huggingface.co/datasets/LLM-EDA/vgen_cpp
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
nwang227
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Opencores

In the process of continual pre-training, we utilized the publicly available VGen dataset. VGen aggregates Verilog repositories from GitHub, systematically filters out duplicates and excessively large files, and retains only those files containing \texttt{module} and \texttt{endmodule} statements. We also incorporated the CodeSearchNet dataset \cite{codesearchnet}, which contains approximately 40MB function codes and their documentation.… See the full description on the dataset page: https://huggingface.co/datasets/LLM-EDA/vgen_cpp.
Aviation EDA - on plane accidents
kaggle.com
zip
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
victor munyaradzi (2024). Aviation EDA - on plane accidents [Dataset]. https://www.kaggle.com/datasets/victormunyaradzi/aviation-eda-on-plane-accidents
Explore at:
zip(628563 bytes)Available download formats
Dataset updated
Nov 27, 2024
Authors
victor munyaradzi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
this is my first EDA analysis took the data off Kaggle took a sample of all accidents since 1919 did an EDA analysis on them using MATPLOTLIb, Python, Pandas and Numpy.

not so familiar with Git or kaggle as an aspiring Data Analysist/ scientist so please forgive any github errors
h
Zara Sales for EDA
hmsditera.com
csv
Updated Dec 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EGGI SATRIA (2025). Zara Sales for EDA [Dataset]. https://www.hmsditera.com/datasets/9c5f10d7-0fec-4fee-8039-bb9b4e75e357
Explore at:
csv(6.24 MB)Available download formats
Dataset updated
Dec 1, 2025
Dataset provided by
Akademik Dan Keprofesian
Authors
EGGI SATRIA
Description
Dataset ini, bernama Zara Sales for EDA, dibuat dengan menggabungkan beberapa dataset fashion publik dari GitHub dan Kaggle. Dataset ini berfokus pada produk Zara dan mencakup informasi seperti nama produk, deskripsi, harga, kategori, dan volume penjualan. Kolom tambahan seperti season (musim) dan url juga ditambahkan. Misalnya, kolom season ditentukan berdasarkan nama produk (contoh: “jacket” → Winter/Autumn), dan kolom url dibuat dengan menggabungkan tautan dasar situs Zara dengan judul produk. Dataset asli memiliki sekitar 7 ribu baris, sehingga dilakukan oversampling untuk menambah jumlah data dan menyeimbangkan kategori agar analisis menjadi lebih baik.
Bestsellers books (Amazon, eBay, and Barnes&Noble)
kaggle.com
Updated May 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diógenes Silva (2022). Bestsellers books (Amazon, eBay, and Barnes&Noble) [Dataset]. https://www.kaggle.com/datasets/digenessilva/bestsellers-books-amazon-ebay-and-barnesnoble/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2022
Dataset provided by
Kaggle
Authors
Diógenes Silva
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
This dataset contains data about bestseller books from big companies such as Amazon, eBay, and Barnes&Noble. The goal of this dataset is to use this data to get insights into what books would be more profitable. We have 6 files, 3 of them are cleaned and the others are data directly collected using web scraping. You can see more details on github.
h
entity-deduction-arena
huggingface.co
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yizhe Zhang (2024). entity-deduction-arena [Dataset]. https://huggingface.co/datasets/yizheapple/entity-deduction-arena
Explore at:
Dataset updated
Jan 27, 2024
Authors
Yizhe Zhang
Description
Entity-Deduction Arena (EDA)

This dataset complements the paper Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games, presented in ACL 2024 main conference. The main repo can be found at https://github.com/apple/ml-entity-deduction-arena

Motivation

There is a demand to assessing the capability of LLM to clarify with questions in order to effectively resolve ambiguities, when confronted with vague queries. This capability demands a sophisticated… See the full description on the dataset page: https://huggingface.co/datasets/yizheapple/entity-deduction-arena.
h
watch-market-gnn
huggingface.co
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mukundan (2024). watch-market-gnn [Dataset]. https://huggingface.co/datasets/TMVishnu/watch-market-gnn
Explore at:
Dataset updated
Nov 26, 2024
Authors
Mukundan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Watch Market Analysis Graph Neural Network Dataset

Link:

Github link to the code through which this dataset was generated from: watch-market-gnn-code Link to interactive EDA that is hosted on a website : Watch Market Analysis Report

SummaryDataset DescriptionTechnical DetailsExploratory Data AnalysisEthics and LimitationsUsage

Detailed Table of Contents

Summary Key Statistics Primary Use Cases

Dataset Description Data Structure Features Network Properties… See the full description on the dataset page: https://huggingface.co/datasets/TMVishnu/watch-market-gnn.
Datasets for manuscript "Tracking end-of-life stage of chemicals: a scalable...
catalog.data.gov
s.cnmilf.com
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). Datasets for manuscript "Tracking end-of-life stage of chemicals: a scalable data-centric and chemical-centric approach" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-tracking-end-of-life-stage-of-chemicals-a-scalable-data-centric-an
Explore at:
Dataset updated
May 30, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
As described in the README.md file, the GitHub repository PRTR_transfers are Python scripts written to run a data-centric and chemical-centric framework for tracking EoL chemical flow transfers, identifying potential EoL exposure scenarios, and performing Chemical Flow Analysis (CFA). Also, the created Extract, Transform, and Load (ETL) pipeline leverages publicly-accessible Pollutant Release and Transfer Register (PRTR) systems belonging to Organization for Economic Cooperation and Development (OECD) member countries. The Life Cycle Inventory (LCI) data obtained by the ETL is stored in a Structured Query Language (SQL) database called PRTR_transfers that could be connected to Machine Learning Operations (MLOps) in production environments, making the framework scalable for real-world applications. The data ingestion pipeline can supply data at an annual rate, ensuring labeled data can be ingested into data-driven models if retraining is needed, especially to face problems like data and concept drift that could drastically affect the performance of data-driven models. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures (file Manuscript Figures-EDA.ipynb) and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. Martín. Tracking end-of-life stage of chemicals: A scalable data-centric and chemical-centric approach. Resources, Conservation and Recycling. Elsevier Science BV, Amsterdam, NETHERLANDS, 196: 107031, (2023).
BIMCV-Prostate-Dataset V1
zenodo.org
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesus Alejandro Alzate-Grisales; Jesus Alejandro Alzate-Grisales; Maria de la Iglesia Vaya; Maria de la Iglesia Vaya (2024). BIMCV-Prostate-Dataset V1 [Dataset]. http://doi.org/10.5281/zenodo.13254318
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13254318
Dataset updated
Aug 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jesus Alejandro Alzate-Grisales; Jesus Alejandro Alzate-Grisales; Maria de la Iglesia Vaya; Maria de la Iglesia Vaya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BIMCV Prostate Dataset is a comprehensive and diverse dataset that includes a total of 9,341 prostate MRI sessions, distributed among 8,441 subjects, collected from 16 healthcare centers in the Valencian Community, Spain. This dataset is structured according to the MIDS (Medical Imaging Data Structure) standard, ensuring consistent and accessible organization for researchers, facilitating data use and analysis.

The first version of the dataset focuses on sessions that contain the three mentioned imaging modalities (T2W, DWI, and ADC), resulting in a total of 1,730 complete sessions, with a total of 4,663 samples for training, of which 2,594 are csPCa positive and 2,069 are csPCa negative. This information can be found in the table available on GitHub.

The dataset includes MRI images in three modalities: T2-weighted images (T2W), diffusion-weighted images (DWI), and apparent diffusion coefficient (ADC) maps. In total, the dataset includes 32,662 T2W images (62.97%), 8,036 DWI images (15.49%), and 11,167 ADC maps (21.53%), including both the original maps and those calculated from the available DWI images. This additional calculation process was carried out to ensure the dataset's integrity and consistency, allowing for comprehensive analysis in the field of prostate oncology.

The exploratory data analysis (EDA) performed on this dataset has provided insights into the characteristics and distribution of the images, ensuring the dataset's representativeness and diversity. For example, it was found that Health Center 5 contributed the highest proportion of sessions (15.6%), followed by Health Center 7 (12.3%) and Health Center 17 (10.5%). This level of diversity in data sources ensures that the dataset encompasses a wide range of imaging acquisition practices and patient demographics, improving the generalization of artificial intelligence models developed with this data.

Additionally, the analysis of the distribution by MRI equipment manufacturer revealed that most images were acquired with General Electric equipment (66.7%), followed by Philips (25.1%) and Siemens (8.13%). Similarly, most sessions were conducted with 1.5 Tesla machines (63%), followed by 3.0 Tesla machines (36.5%), reflecting standard clinical practices in the region.

Regarding the distribution of labels within the dataset, of the total cases, 4,871 (approximately 52%) are labeled as csPCa positive, while 3,514 cases (approximately 37%) are labeled as csPCa negative.

To access the dataset, please fill out the following survey: https://forms.office.com/e/frV3A5dT6r

El BIMCV Prostate Dataset es un conjunto de datos integral y diverso que incluye un total de 9,341 sesiones de resonancia magnética de próstata, distribuidas entre 8,441 sujetos, recopiladas en 16 centros de salud de la Comunidad Valenciana, España. Este conjunto de datos está estructurado según el estándar MIDS (Medical Imaging Data Structure), lo que garantiza una organización coherente y accesible para los investigadores, facilitando la utilización y análisis de los datos.

La primera versión del dataset se centra en las sesiones que contienen las tres modalidades de imagen mencionadas (T2W, DWI y ADC), lo que ha dado lugar a un total de 1,730 sesiones completas, con un total de 4663 muestras para entrenamiento, de los cuales 2,594 son csPCa positivo y 2,069 csPCa negativo. Esta información puede ser encontrada en la tabla presente en GitHub.

El conjunto de datos incluye imágenes de resonancia magnética en tres modalidades: imágenes ponderadas en T2 (T2W), imágenes de difusión ponderada (DWI) y mapas de coeficiente de difusión aparente (ADC). En total, el dataset incluye 32,662 imágenes T2W (62.97%), 8,036 imágenes DWI (15.49%), y 11,167 mapas ADC (21.53%), que incluyen tanto los mapas originales como los calculados a partir de las imágenes DWI disponibles. Este proceso de cálculo adicional se realizó para asegurar la integridad y consistencia del conjunto de datos, permitiendo un análisis exhaustivo en el ámbito de la oncología de próstata.

El análisis exploratorio de datos (EDA) realizado sobre este conjunto de datos ha permitido entender las características y la distribución de las imágenes, lo que garantiza la representatividad y diversidad del dataset. Por ejemplo, se ha encontrado que el Centro de Salud 5 contribuyó con la mayor proporción de sesiones (15.6%), seguido del Centro de Salud 7 (12.3%) y el Centro de Salud 17 (10.5%). Este nivel de diversidad en las fuentes de datos asegura que el dataset abarca una amplia gama de prácticas de adquisición de imágenes y demografías de pacientes, mejorando la generalización de los modelos de inteligencia artificial desarrollados con estos datos.

Además, el análisis de la distribución por el fabricante del equipo de resonancia magnética reveló que la mayoría de las imágenes fueron adquiridas con equipos de General Electric (66.7%), seguidos por Philips (25.1%) y Siemens (8.13%). Asimismo, la mayoría de las sesiones se realizaron con máquinas de 1.5 Tesla (63%), seguidas de máquinas de 3.0 Tesla (36.5%), reflejando las prácticas clínicas estándar en la región.

En cuanto a la distribución de las etiquetas dentro del dataset, de los casos totales, 4871 (aproximadamente el 52%) están etiquetados como csPCa positivo, mientras que 3514 casos (aproximadamente el 37%) están etiquetados como csPCa negativo.

Para acceder al dataset por favor rellene la siguiente encuesta: https://forms.office.com/e/frV3A5dT6r
Smartphones Dataset (August 2024)
kaggle.com
zip
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilkush Singh (2024). Smartphones Dataset (August 2024) [Dataset]. https://www.kaggle.com/datasets/dilkushsingh/smartphones-dataset-upto-july24
Explore at:
zip(605033 bytes)Available download formats
Dataset updated
Aug 24, 2024
Authors
Dilkush Singh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Smartphones Dataset (August 2024)

This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
- If you want to know about the web scrapping process then read the blog Medium Article - If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo GitHub Repo

Dataset Versions:

Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.

Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.

Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.

Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.
h
MMCircuitEval
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlie Zhao, MMCircuitEval [Dataset]. https://huggingface.co/datasets/charlie314159/MMCircuitEval
Explore at:
Authors
Charlie Zhao
Description
MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs

Paper GitHub

Introduction

MMCircuitEval is a multimodal benchmark specifically designed to assess MLLM performance comprehensively across diverse EDA tasks. MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs covering:

Both digital and analog circuits Critical EDA stages—ranging from general knowledge and specifications to front-end and back-end design A… See the full description on the dataset page: https://huggingface.co/datasets/charlie314159/MMCircuitEval.
Exploratory Data Analysis (EDA) for COVIND-19
kaggle.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Badea-Matei Iuliana (2024). Exploratory Data Analysis (EDA) for COVIND-19 [Dataset]. https://www.kaggle.com/datasets/mateiiuliana/exploratory-data-analysis-eda-for-covind-19
Explore at:
zip(26972 bytes)Available download formats
Dataset updated
Apr 8, 2024
Authors
Badea-Matei Iuliana
Description
Description: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.

Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.

Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.

Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.

License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.

Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.
Smart TV and their specifications from smartprix
kaggle.com
zip
Updated Feb 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J PAWAN KUMAR (2023). Smart TV and their specifications from smartprix [Dataset]. https://www.kaggle.com/datasets/justperky/smart-tv-and-their-specifications-from-smartprix/code
Explore at:
zip(55701 bytes)Available download formats
Dataset updated
Feb 17, 2023
Authors
J PAWAN KUMAR
Description
This dataset was created using Web-scraping from smartprix.com under TV's page. The goal of creating this dataset is to develop a TV price predictor using machine learning Techniques This is a untidy and messy dataset as many columns has values which are not split a correct manner. We need to first clean the dataset for further analysis and predictions.

If you want to now the source and the process of the Dataset you can visit my GITHUB profile https://github.com/JUSTPERKY/Data-Gathering-From-Websites

'This is my first time creating a dataset through Web-scraping'

Note: I updated the Dataset again with more columns as some values have been shifted to the new columns and we need to rearrange the dataset in-order to clean it
Phishing URL Content Dataset
kaggle.com
zip
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
Explore at:
zip(62701 bytes)Available download formats
Dataset updated
Nov 25, 2024
Authors
Aaditey Pillai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Phishing URL Content Dataset

Executive Summary

Motivation:
Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

Applications:
- Building robust phishing detection systems.
- Enhancing security measures in email filtering and web browsing.
- Training cybersecurity practitioners in identifying malicious URLs.

The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

Description of Data

This dataset comprises two types of URLs:
1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

Key Features:
- URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
- Content-based features: Link density, iframe presence, external/internal links, and metadata.
- Certificate-based features: SSL/TLS details like validity period and organization.
- WHOIS data: Registration details like creation and expiration dates.

Statistics:
- Total Samples: 800 (400 phishing, 400 benign).
- Features: 22 including URL, domain, link density, and SSL attributes.

Power Analysis

To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

Exploratory Data Analysis (EDA)

Insights from EDA:
- Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

EDA visualizations are provided in the repository.

Link to Publicly Available Data and Code

Dataset: Phishing URL Dataset

Code Repository: GitHub - Phishing Detection

The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

Ethics Statement

Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
1. Protects User Privacy: No personally identifiable information is included.
2. Promotes Ethical Use: Intended solely for academic and research purposes.
3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

Risks:
- Misuse of the dataset for creating more deceptive phishing attacks.
- Over-reliance on outdated features as phishing tactics evolve.

Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

Open Source License

This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.
ΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΜΕΠΕΡΑ - π. ΕΔΑ...
staging.data.gov.gr
catalog.data.gov.gr
Updated Oct 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.gov.gr (2025). ΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΜΕΠΕΡΑ - π. ΕΔΑ ΜΕΤΑΦΟΡΩΝ [Dataset]. https://staging.data.gov.gr/dataset/espa-2007-2013-pra3eis-ana-forea-diaxeirishs-eydep-ymepera-p-eda-metaforwn
Explore at:
Dataset updated
Oct 19, 2025
Dataset provided by
Data.govhttps://data.gov/
Description
ΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΠΟΔΟΜΕΣ ΜΕΤΑΦΟΡΩΝ, ΠΕΡΙΒΑΛΛΟΝ ΚΑΙ ΑΕΙΦΟΡΟΣ ΑΝΑΠΤΥΞΗ (π. ΕΔΑ ΜΕΤΑΦΟΡΩΝ)
Life Quality and Crime Rate
kaggle.com
zip
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niful Islam (2023). Life Quality and Crime Rate [Dataset]. https://www.kaggle.com/datasets/naifislam/life-quality-and-crime-rate/discussion
Explore at:
zip(3424 bytes)Available download formats
Dataset updated
Jun 22, 2023
Authors
Niful Islam
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
The problem involves scrapping two websites (Life Quality and Crime Rate) for collecting life quality and crime rate data, merging them by country name and conducting EDA on Tabealu for finging insights. For details find : https://github.com/NifulIslam/Life-Quality-and-Crime-Rate-Scrapping-and-EDA
Nepali News Data
kaggle.com
zip
Updated May 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Durga Pokharel (2022). Nepali News Data [Dataset]. https://www.kaggle.com/datasets/durgapokharel/nepalinewsdata
Explore at:
zip(11891865 bytes)Available download formats
Dataset updated
May 27, 2022
Authors
Durga Pokharel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
नेपाल
Description
This repo contains the news data in tabular format. Please read about how the data was prepared from different sources: * Nepali News (Gorkhapatra) Scrapping Using BeautifulSoup and Python * Nepali News (ekantipur) Scrapping Using BeautifulSoup and Python * Nepali News (Onlinekhabar Post) Scrapping Using BeautifulSoup and Python

I have also done some work on this dataset which can be found below blogs: * EDA on Nepali News * Nepali News Classification with Naive Bayes * Nepali News Classification with Logistic Regression
Marketing Analytics
kaggle.com
zip
Updated Mar 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Daoud (2022). Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/jackdaoud/marketing-data/discussion
Explore at:
zip(658411 bytes)Available download formats
Dataset updated
Mar 6, 2022
Authors
Jack Daoud
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations.

Content

The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance

Acknowledgement

I do not own this dataset. I am simply making it accessible on this platform via the public GitHub link.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sinjoy Saha (2021). Electronics Store Sales Dataset for EDA [Dataset]. https://www.kaggle.com/sinjoysaha/sales-analysis-dataset

Electronics Store Sales Dataset for EDA

Sales data of an electronics store chain in the US for Exploratory Data Analysis

Explore at:

zip(2505035 bytes)Available download formats

Dataset updated

Feb 13, 2021

Authors

Sinjoy Saha

Description

Content

This is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019. The naming convention is as follows: Sales_[MONTH_NAME]_2019 Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows: Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address There are around 186851 data points combining all the 12-month files. There may be null values in some rows.

Inspiration

Keith Galli

Acknowledgements

Clear search

Close search

Google apps

Main menu

Electronics Store Sales Dataset for EDA

Content

Inspiration

Acknowledgements

Complete Google Playstore EDA 2025

Most Popular Python Projects on GitHub (2018-)

vgen_cpp

Aviation EDA - on plane accidents

Zara Sales for EDA

Bestsellers books (Amazon, eBay, and Barnes&Noble)

entity-deduction-arena

watch-market-gnn

Datasets for manuscript "Tracking end-of-life stage of chemicals: a scalable...

BIMCV-Prostate-Dataset V1

Smartphones Dataset (August 2024)

Smartphones Dataset (August 2024)

Dataset Versions:

Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

MMCircuitEval

Exploratory Data Analysis (EDA) for COVIND-19

Smart TV and their specifications from smartprix

Phishing URL Content Dataset

Phishing URL Content Dataset

Executive Summary

Description of Data

Power Analysis

Exploratory Data Analysis (EDA)

Link to Publicly Available Data and Code

Ethics Statement

Open Source License

ΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΜΕΠΕΡΑ - π. ΕΔΑ...

Life Quality and Crime Rate

Nepali News Data

Marketing Analytics

Context

Content

Acknowledgement

Electronics Store Sales Dataset for EDA

Sales data of an electronics store chain in the US for Exploratory Data Analysis

Content

Inspiration

Acknowledgements