88 datasets found
  1. Electronics Store Sales Dataset for EDA

    • kaggle.com
    zip
    Updated Feb 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinjoy Saha (2021). Electronics Store Sales Dataset for EDA [Dataset]. https://www.kaggle.com/sinjoysaha/sales-analysis-dataset
    Explore at:
    zip(2505035 bytes)Available download formats
    Dataset updated
    Feb 13, 2021
    Authors
    Sinjoy Saha
    Description

    Content

    This is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019. The naming convention is as follows: Sales_[MONTH_NAME]_2019 Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows: Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address There are around 186851 data points combining all the 12-month files. There may be null values in some rows.

    Inspiration

    Keith Galli

    Acknowledgements

  2. Complete Google Playstore EDA 2025

    • kaggle.com
    zip
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Shayan (2025). Complete Google Playstore EDA 2025 [Dataset]. https://www.kaggle.com/datasets/muhammadshayan5839/complete-google-playstore-eda-2025
    Explore at:
    zip(20150127 bytes)Available download formats
    Dataset updated
    Jul 29, 2025
    Authors
    Muhammad Shayan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    - About Dataset

  3. Most Popular Python Projects on GitHub (2018-)

    • kaggle.com
    zip
    Updated Feb 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bogoconic1 (2024). Most Popular Python Projects on GitHub (2018-) [Dataset]. https://www.kaggle.com/yeoyunsianggeremie/most-popular-python-projects-on-github-2018-2023
    Explore at:
    zip(14095145 bytes)Available download formats
    Dataset updated
    Feb 3, 2024
    Authors
    bogoconic1
    Description

    [UPDATED EVERY WEEK]

    Have you wondered how popular are the Python libraries you use regularly on Kaggle (such as pandas, numpy) ?

    This dataset lists the top 100 Python projects (or libraries) PER DAY, ranked based on the number of Github Stars, starting from 18 December 2018, almost 5 years back!

    Attributes

    date: Date where the record was collected

    rank: 1-100, rank based on number of Github stars, sorted in decreasing order

    item: Python

    repo_name: Name of the Github repository of the Python project (library)

    stars: Number of stars of the github repo

    forks: Number of forks of the github repo

    language: The language the repository is written in

    repo_url: The link to the github repository

    username: Creator of the github repository

    issues: Number of active issues raised in the github repository

    last_commit: The time of the most recent commit

    description: Description of the Python project (library)

    Reference https://github.com/EvanLi/Github-Ranking

    EDA: https://www.kaggle.com/code/yeoyunsianggeremie/eda-of-popular-python-libraries-used-in-kaggle

  4. h

    vgen_cpp

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nwang227, vgen_cpp [Dataset]. https://huggingface.co/datasets/LLM-EDA/vgen_cpp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    nwang227
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Opencores

    In the process of continual pre-training, we utilized the publicly available VGen dataset. VGen aggregates Verilog repositories from GitHub, systematically filters out duplicates and excessively large files, and retains only those files containing \texttt{module} and \texttt{endmodule} statements. We also incorporated the CodeSearchNet dataset \cite{codesearchnet}, which contains approximately 40MB function codes and their documentation.… See the full description on the dataset page: https://huggingface.co/datasets/LLM-EDA/vgen_cpp.

  5. Aviation EDA - on plane accidents

    • kaggle.com
    zip
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    victor munyaradzi (2024). Aviation EDA - on plane accidents [Dataset]. https://www.kaggle.com/datasets/victormunyaradzi/aviation-eda-on-plane-accidents
    Explore at:
    zip(628563 bytes)Available download formats
    Dataset updated
    Nov 27, 2024
    Authors
    victor munyaradzi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    this is my first EDA analysis took the data off Kaggle took a sample of all accidents since 1919 did an EDA analysis on them using MATPLOTLIb, Python, Pandas and Numpy.

    not so familiar with Git or kaggle as an aspiring Data Analysist/ scientist so please forgive any github errors

  6. h

    Zara Sales for EDA

    • hmsditera.com
    csv
    Updated Dec 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EGGI SATRIA (2025). Zara Sales for EDA [Dataset]. https://www.hmsditera.com/datasets/9c5f10d7-0fec-4fee-8039-bb9b4e75e357
    Explore at:
    csv(6.24 MB)Available download formats
    Dataset updated
    Dec 1, 2025
    Dataset provided by
    Akademik Dan Keprofesian
    Authors
    EGGI SATRIA
    Description

    Dataset ini, bernama Zara Sales for EDA, dibuat dengan menggabungkan beberapa dataset fashion publik dari GitHub dan Kaggle. Dataset ini berfokus pada produk Zara dan mencakup informasi seperti nama produk, deskripsi, harga, kategori, dan volume penjualan. Kolom tambahan seperti season (musim) dan url juga ditambahkan. Misalnya, kolom season ditentukan berdasarkan nama produk (contoh: “jacket” → Winter/Autumn), dan kolom url dibuat dengan menggabungkan tautan dasar situs Zara dengan judul produk. Dataset asli memiliki sekitar 7 ribu baris, sehingga dilakukan oversampling untuk menambah jumlah data dan menyeimbangkan kategori agar analisis menjadi lebih baik.

  7. Bestsellers books (Amazon, eBay, and Barnes&Noble)

    • kaggle.com
    Updated May 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diógenes Silva (2022). Bestsellers books (Amazon, eBay, and Barnes&Noble) [Dataset]. https://www.kaggle.com/datasets/digenessilva/bestsellers-books-amazon-ebay-and-barnesnoble/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2022
    Dataset provided by
    Kaggle
    Authors
    Diógenes Silva
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    This dataset contains data about bestseller books from big companies such as Amazon, eBay, and Barnes&Noble. The goal of this dataset is to use this data to get insights into what books would be more profitable. We have 6 files, 3 of them are cleaned and the others are data directly collected using web scraping. You can see more details on github.

  8. h

    entity-deduction-arena

    • huggingface.co
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yizhe Zhang (2024). entity-deduction-arena [Dataset]. https://huggingface.co/datasets/yizheapple/entity-deduction-arena
    Explore at:
    Dataset updated
    Jan 27, 2024
    Authors
    Yizhe Zhang
    Description

    Entity-Deduction Arena (EDA)

    This dataset complements the paper Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games, presented in ACL 2024 main conference. The main repo can be found at https://github.com/apple/ml-entity-deduction-arena

      Motivation
    

    There is a demand to assessing the capability of LLM to clarify with questions in order to effectively resolve ambiguities, when confronted with vague queries. This capability demands a sophisticated… See the full description on the dataset page: https://huggingface.co/datasets/yizheapple/entity-deduction-arena.

  9. h

    watch-market-gnn

    • huggingface.co
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mukundan (2024). watch-market-gnn [Dataset]. https://huggingface.co/datasets/TMVishnu/watch-market-gnn
    Explore at:
    Dataset updated
    Nov 26, 2024
    Authors
    Mukundan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Watch Market Analysis Graph Neural Network Dataset

      Link:
    

    Github link to the code through which this dataset was generated from: watch-market-gnn-code Link to interactive EDA that is hosted on a website : Watch Market Analysis Report

    SummaryDataset DescriptionTechnical DetailsExploratory Data AnalysisEthics and LimitationsUsage

    Detailed Table of Contents

    Summary Key Statistics Primary Use Cases

    Dataset Description Data Structure Features Network Properties… See the full description on the dataset page: https://huggingface.co/datasets/TMVishnu/watch-market-gnn.

  10. Datasets for manuscript "Tracking end-of-life stage of chemicals: a scalable...

    • catalog.data.gov
    • s.cnmilf.com
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Datasets for manuscript "Tracking end-of-life stage of chemicals: a scalable data-centric and chemical-centric approach" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-tracking-end-of-life-stage-of-chemicals-a-scalable-data-centric-an
    Explore at:
    Dataset updated
    May 30, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    As described in the README.md file, the GitHub repository PRTR_transfers are Python scripts written to run a data-centric and chemical-centric framework for tracking EoL chemical flow transfers, identifying potential EoL exposure scenarios, and performing Chemical Flow Analysis (CFA). Also, the created Extract, Transform, and Load (ETL) pipeline leverages publicly-accessible Pollutant Release and Transfer Register (PRTR) systems belonging to Organization for Economic Cooperation and Development (OECD) member countries. The Life Cycle Inventory (LCI) data obtained by the ETL is stored in a Structured Query Language (SQL) database called PRTR_transfers that could be connected to Machine Learning Operations (MLOps) in production environments, making the framework scalable for real-world applications. The data ingestion pipeline can supply data at an annual rate, ensuring labeled data can be ingested into data-driven models if retraining is needed, especially to face problems like data and concept drift that could drastically affect the performance of data-driven models. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures (file Manuscript Figures-EDA.ipynb) and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. Martín. Tracking end-of-life stage of chemicals: A scalable data-centric and chemical-centric approach. Resources, Conservation and Recycling. Elsevier Science BV, Amsterdam, NETHERLANDS, 196: 107031, (2023).

  11. BIMCV-Prostate-Dataset V1

    • zenodo.org
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Alejandro Alzate-Grisales; Jesus Alejandro Alzate-Grisales; Maria de la Iglesia Vaya; Maria de la Iglesia Vaya (2024). BIMCV-Prostate-Dataset V1 [Dataset]. http://doi.org/10.5281/zenodo.13254318
    Explore at:
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jesus Alejandro Alzate-Grisales; Jesus Alejandro Alzate-Grisales; Maria de la Iglesia Vaya; Maria de la Iglesia Vaya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BIMCV Prostate Dataset is a comprehensive and diverse dataset that includes a total of 9,341 prostate MRI sessions, distributed among 8,441 subjects, collected from 16 healthcare centers in the Valencian Community, Spain. This dataset is structured according to the MIDS (Medical Imaging Data Structure) standard, ensuring consistent and accessible organization for researchers, facilitating data use and analysis.

    The first version of the dataset focuses on sessions that contain the three mentioned imaging modalities (T2W, DWI, and ADC), resulting in a total of 1,730 complete sessions, with a total of 4,663 samples for training, of which 2,594 are csPCa positive and 2,069 are csPCa negative. This information can be found in the table available on GitHub.

    The dataset includes MRI images in three modalities: T2-weighted images (T2W), diffusion-weighted images (DWI), and apparent diffusion coefficient (ADC) maps. In total, the dataset includes 32,662 T2W images (62.97%), 8,036 DWI images (15.49%), and 11,167 ADC maps (21.53%), including both the original maps and those calculated from the available DWI images. This additional calculation process was carried out to ensure the dataset's integrity and consistency, allowing for comprehensive analysis in the field of prostate oncology.

    The exploratory data analysis (EDA) performed on this dataset has provided insights into the characteristics and distribution of the images, ensuring the dataset's representativeness and diversity. For example, it was found that Health Center 5 contributed the highest proportion of sessions (15.6%), followed by Health Center 7 (12.3%) and Health Center 17 (10.5%). This level of diversity in data sources ensures that the dataset encompasses a wide range of imaging acquisition practices and patient demographics, improving the generalization of artificial intelligence models developed with this data.

    Additionally, the analysis of the distribution by MRI equipment manufacturer revealed that most images were acquired with General Electric equipment (66.7%), followed by Philips (25.1%) and Siemens (8.13%). Similarly, most sessions were conducted with 1.5 Tesla machines (63%), followed by 3.0 Tesla machines (36.5%), reflecting standard clinical practices in the region.

    Regarding the distribution of labels within the dataset, of the total cases, 4,871 (approximately 52%) are labeled as csPCa positive, while 3,514 cases (approximately 37%) are labeled as csPCa negative.

    To access the dataset, please fill out the following survey: https://forms.office.com/e/frV3A5dT6r



    El BIMCV Prostate Dataset es un conjunto de datos integral y diverso que incluye un total de 9,341 sesiones de resonancia magnética de próstata, distribuidas entre 8,441 sujetos, recopiladas en 16 centros de salud de la Comunidad Valenciana, España. Este conjunto de datos está estructurado según el estándar MIDS (Medical Imaging Data Structure), lo que garantiza una organización coherente y accesible para los investigadores, facilitando la utilización y análisis de los datos.

    La primera versión del dataset se centra en las sesiones que contienen las tres modalidades de imagen mencionadas (T2W, DWI y ADC), lo que ha dado lugar a un total de 1,730 sesiones completas, con un total de 4663 muestras para entrenamiento, de los cuales 2,594 son csPCa positivo y 2,069 csPCa negativo. Esta información puede ser encontrada en la tabla presente en GitHub.

    El conjunto de datos incluye imágenes de resonancia magnética en tres modalidades: imágenes ponderadas en T2 (T2W), imágenes de difusión ponderada (DWI) y mapas de coeficiente de difusión aparente (ADC). En total, el dataset incluye 32,662 imágenes T2W (62.97%), 8,036 imágenes DWI (15.49%), y 11,167 mapas ADC (21.53%), que incluyen tanto los mapas originales como los calculados a partir de las imágenes DWI disponibles. Este proceso de cálculo adicional se realizó para asegurar la integridad y consistencia del conjunto de datos, permitiendo un análisis exhaustivo en el ámbito de la oncología de próstata.

    El análisis exploratorio de datos (EDA) realizado sobre este conjunto de datos ha permitido entender las características y la distribución de las imágenes, lo que garantiza la representatividad y diversidad del dataset. Por ejemplo, se ha encontrado que el Centro de Salud 5 contribuyó con la mayor proporción de sesiones (15.6%), seguido del Centro de Salud 7 (12.3%) y el Centro de Salud 17 (10.5%). Este nivel de diversidad en las fuentes de datos asegura que el dataset abarca una amplia gama de prácticas de adquisición de imágenes y demografías de pacientes, mejorando la generalización de los modelos de inteligencia artificial desarrollados con estos datos.

    Además, el análisis de la distribución por el fabricante del equipo de resonancia magnética reveló que la mayoría de las imágenes fueron adquiridas con equipos de General Electric (66.7%), seguidos por Philips (25.1%) y Siemens (8.13%). Asimismo, la mayoría de las sesiones se realizaron con máquinas de 1.5 Tesla (63%), seguidas de máquinas de 3.0 Tesla (36.5%), reflejando las prácticas clínicas estándar en la región.

    En cuanto a la distribución de las etiquetas dentro del dataset, de los casos totales, 4871 (aproximadamente el 52%) están etiquetados como csPCa positivo, mientras que 3514 casos (aproximadamente el 37%) están etiquetados como csPCa negativo.

    Para acceder al dataset por favor rellene la siguiente encuesta: https://forms.office.com/e/frV3A5dT6r

  12. Smartphones Dataset (August 2024)

    • kaggle.com
    zip
    Updated Aug 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilkush Singh (2024). Smartphones Dataset (August 2024) [Dataset]. https://www.kaggle.com/datasets/dilkushsingh/smartphones-dataset-upto-july24
    Explore at:
    zip(605033 bytes)Available download formats
    Dataset updated
    Aug 24, 2024
    Authors
    Dilkush Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Smartphones Dataset (August 2024)

    This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
    - If you want to know about the web scrapping process then read the blog Medium Article - If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo GitHub Repo

    Dataset Versions:

    Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

    This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.

    Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

    Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.

    Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

    Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.

    Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

    This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.

  13. h

    MMCircuitEval

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlie Zhao, MMCircuitEval [Dataset]. https://huggingface.co/datasets/charlie314159/MMCircuitEval
    Explore at:
    Authors
    Charlie Zhao
    Description

    MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs

    Paper GitHub

      Introduction
    

    MMCircuitEval is a multimodal benchmark specifically designed to assess MLLM performance comprehensively across diverse EDA tasks. MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs covering:

    Both digital and analog circuits Critical EDA stages—ranging from general knowledge and specifications to front-end and back-end design A… See the full description on the dataset page: https://huggingface.co/datasets/charlie314159/MMCircuitEval.

  14. Exploratory Data Analysis (EDA) for COVIND-19

    • kaggle.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Badea-Matei Iuliana (2024). Exploratory Data Analysis (EDA) for COVIND-19 [Dataset]. https://www.kaggle.com/datasets/mateiiuliana/exploratory-data-analysis-eda-for-covind-19
    Explore at:
    zip(26972 bytes)Available download formats
    Dataset updated
    Apr 8, 2024
    Authors
    Badea-Matei Iuliana
    Description

    Description: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.

    Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.

    Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.

    Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.

    License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.

    Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.

  15. Smart TV and their specifications from smartprix

    • kaggle.com
    zip
    Updated Feb 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J PAWAN KUMAR (2023). Smart TV and their specifications from smartprix [Dataset]. https://www.kaggle.com/datasets/justperky/smart-tv-and-their-specifications-from-smartprix/code
    Explore at:
    zip(55701 bytes)Available download formats
    Dataset updated
    Feb 17, 2023
    Authors
    J PAWAN KUMAR
    Description

    This dataset was created using Web-scraping from smartprix.com under TV's page. The goal of creating this dataset is to develop a TV price predictor using machine learning Techniques This is a untidy and messy dataset as many columns has values which are not split a correct manner. We need to first clean the dataset for further analysis and predictions.

    If you want to now the source and the process of the Dataset you can visit my GITHUB profile https://github.com/JUSTPERKY/Data-Gathering-From-Websites

    'This is my first time creating a dataset through Web-scraping'

    Note: I updated the Dataset again with more columns as some values have been shifted to the new columns and we need to rearrange the dataset in-order to clean it

  16. Phishing URL Content Dataset

    • kaggle.com
    zip
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
    Explore at:
    zip(62701 bytes)Available download formats
    Dataset updated
    Nov 25, 2024
    Authors
    Aaditey Pillai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Phishing URL Content Dataset

    Executive Summary

    Motivation:
    Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

    Applications:
    - Building robust phishing detection systems.
    - Enhancing security measures in email filtering and web browsing.
    - Training cybersecurity practitioners in identifying malicious URLs.

    The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

    Description of Data

    This dataset comprises two types of URLs:
    1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

    Key Features:
    - URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
    - Content-based features: Link density, iframe presence, external/internal links, and metadata.
    - Certificate-based features: SSL/TLS details like validity period and organization.
    - WHOIS data: Registration details like creation and expiration dates.

    Statistics:
    - Total Samples: 800 (400 phishing, 400 benign).
    - Features: 22 including URL, domain, link density, and SSL attributes.

    Power Analysis

    To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

    Exploratory Data Analysis (EDA)

    Insights from EDA:
    - Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

    EDA visualizations are provided in the repository.

    Link to Publicly Available Data and Code

    The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

    Ethics Statement

    Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
    1. Protects User Privacy: No personally identifiable information is included.
    2. Promotes Ethical Use: Intended solely for academic and research purposes.
    3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

    Risks:
    - Misuse of the dataset for creating more deceptive phishing attacks.
    - Over-reliance on outdated features as phishing tactics evolve.

    Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

    Open Source License

    This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.

  17. ΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΜΕΠΕΡΑ - π. ΕΔΑ...

    • staging.data.gov.gr
    • catalog.data.gov.gr
    Updated Oct 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.gov.gr (2025). ΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΜΕΠΕΡΑ - π. ΕΔΑ ΜΕΤΑΦΟΡΩΝ [Dataset]. https://staging.data.gov.gr/dataset/espa-2007-2013-pra3eis-ana-forea-diaxeirishs-eydep-ymepera-p-eda-metaforwn
    Explore at:
    Dataset updated
    Oct 19, 2025
    Dataset provided by
    Data.govhttps://data.gov/
    Description

    ΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΠΟΔΟΜΕΣ ΜΕΤΑΦΟΡΩΝ, ΠΕΡΙΒΑΛΛΟΝ ΚΑΙ ΑΕΙΦΟΡΟΣ ΑΝΑΠΤΥΞΗ (π. ΕΔΑ ΜΕΤΑΦΟΡΩΝ)

  18. Life Quality and Crime Rate

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niful Islam (2023). Life Quality and Crime Rate [Dataset]. https://www.kaggle.com/datasets/naifislam/life-quality-and-crime-rate/discussion
    Explore at:
    zip(3424 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Niful Islam
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    The problem involves scrapping two websites (Life Quality and Crime Rate) for collecting life quality and crime rate data, merging them by country name and conducting EDA on Tabealu for finging insights. For details find : https://github.com/NifulIslam/Life-Quality-and-Crime-Rate-Scrapping-and-EDA

  19. Nepali News Data

    • kaggle.com
    zip
    Updated May 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Durga Pokharel (2022). Nepali News Data [Dataset]. https://www.kaggle.com/datasets/durgapokharel/nepalinewsdata
    Explore at:
    zip(11891865 bytes)Available download formats
    Dataset updated
    May 27, 2022
    Authors
    Durga Pokharel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    नेपाल
    Description

    This repo contains the news data in tabular format. Please read about how the data was prepared from different sources: * Nepali News (Gorkhapatra) Scrapping Using BeautifulSoup and Python * Nepali News (ekantipur) Scrapping Using BeautifulSoup and Python * Nepali News (Onlinekhabar Post) Scrapping Using BeautifulSoup and Python

    I have also done some work on this dataset which can be found below blogs: * EDA on Nepali News * Nepali News Classification with Naive Bayes * Nepali News Classification with Logistic Regression

  20. Marketing Analytics

    • kaggle.com
    zip
    Updated Mar 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Daoud (2022). Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/jackdaoud/marketing-data/discussion
    Explore at:
    zip(658411 bytes)Available download formats
    Dataset updated
    Mar 6, 2022
    Authors
    Jack Daoud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations.

    Content

    The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance

    Acknowledgement

    I do not own this dataset. I am simply making it accessible on this platform via the public GitHub link.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sinjoy Saha (2021). Electronics Store Sales Dataset for EDA [Dataset]. https://www.kaggle.com/sinjoysaha/sales-analysis-dataset
Organization logo

Electronics Store Sales Dataset for EDA

Sales data of an electronics store chain in the US for Exploratory Data Analysis

Explore at:
zip(2505035 bytes)Available download formats
Dataset updated
Feb 13, 2021
Authors
Sinjoy Saha
Description

Content

This is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019. The naming convention is as follows: Sales_[MONTH_NAME]_2019 Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows: Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address There are around 186851 data points combining all the 12-month files. There may be null values in some rows.

Inspiration

Keith Galli

Acknowledgements

Search
Clear search
Close search
Google apps
Main menu