Facebook
TwitterThis is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019.
The naming convention is as follows: Sales_[MONTH_NAME]_2019
Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows:
Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address
There are around 186851 data points combining all the 12-month files. There may be null values in some rows.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
- About Dataset
Description
The Data Set was downloaded from Kaggle, from the following link
Context
Google PlayStore Android App Data. (2.3 Million+ App Data)
Backup repo: https://github.com/gauthamp10/Google-Playstore-Dataset
Content
I've collected the data with the help of Python script (Scrapy) running on a cloud vm instance.
The data was collected in the month of june 2025.
Also checkout:
Apple AppStore Apps dataset: https://www.kaggle.com/gauthamp10/apple-appstore-apps Android App Permission dataset: https://www.kaggle.com/gauthamp10/app-permissions-android
Acknowledgements
I couldn't have build this dataset without the help of Github Education and switched to facundoolano/google-play-scraper for sane reasons
Inspiration
Took inspiration from: https://www.kaggle.com/lava18/google-play-store-apps to build a big database for students and researchers.
Facebook
Twitter[UPDATED EVERY WEEK]
Have you wondered how popular are the Python libraries you use regularly on Kaggle (such as pandas, numpy) ?
This dataset lists the top 100 Python projects (or libraries) PER DAY, ranked based on the number of Github Stars, starting from 18 December 2018, almost 5 years back!
Attributes
date: Date where the record was collected
rank: 1-100, rank based on number of Github stars, sorted in decreasing order
item: Python
repo_name: Name of the Github repository of the Python project (library)
stars: Number of stars of the github repo
forks: Number of forks of the github repo
language: The language the repository is written in
repo_url: The link to the github repository
username: Creator of the github repository
issues: Number of active issues raised in the github repository
last_commit: The time of the most recent commit
description: Description of the Python project (library)
Reference https://github.com/EvanLi/Github-Ranking
EDA: https://www.kaggle.com/code/yeoyunsianggeremie/eda-of-popular-python-libraries-used-in-kaggle
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Opencores
In the process of continual pre-training, we utilized the publicly available VGen dataset. VGen aggregates Verilog repositories from GitHub, systematically filters out duplicates and excessively large files, and retains only those files containing \texttt{module} and \texttt{endmodule} statements. We also incorporated the CodeSearchNet dataset \cite{codesearchnet}, which contains approximately 40MB function codes and their documentation.… See the full description on the dataset page: https://huggingface.co/datasets/LLM-EDA/vgen_cpp.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
this is my first EDA analysis took the data off Kaggle took a sample of all accidents since 1919 did an EDA analysis on them using MATPLOTLIb, Python, Pandas and Numpy.
not so familiar with Git or kaggle as an aspiring Data Analysist/ scientist so please forgive any github errors
Facebook
TwitterDataset ini, bernama Zara Sales for EDA, dibuat dengan menggabungkan beberapa dataset fashion publik dari GitHub dan Kaggle. Dataset ini berfokus pada produk Zara dan mencakup informasi seperti nama produk, deskripsi, harga, kategori, dan volume penjualan. Kolom tambahan seperti season (musim) dan url juga ditambahkan. Misalnya, kolom season ditentukan berdasarkan nama produk (contoh: “jacket” → Winter/Autumn), dan kolom url dibuat dengan menggabungkan tautan dasar situs Zara dengan judul produk. Dataset asli memiliki sekitar 7 ribu baris, sehingga dilakukan oversampling untuk menambah jumlah data dan menyeimbangkan kategori agar analisis menjadi lebih baik.
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset contains data about bestseller books from big companies such as Amazon, eBay, and Barnes&Noble. The goal of this dataset is to use this data to get insights into what books would be more profitable. We have 6 files, 3 of them are cleaned and the others are data directly collected using web scraping. You can see more details on github.
Facebook
TwitterEntity-Deduction Arena (EDA)
This dataset complements the paper Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games, presented in ACL 2024 main conference. The main repo can be found at https://github.com/apple/ml-entity-deduction-arena
Motivation
There is a demand to assessing the capability of LLM to clarify with questions in order to effectively resolve ambiguities, when confronted with vague queries. This capability demands a sophisticated… See the full description on the dataset page: https://huggingface.co/datasets/yizheapple/entity-deduction-arena.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Watch Market Analysis Graph Neural Network Dataset
Link:
Github link to the code through which this dataset was generated from: watch-market-gnn-code Link to interactive EDA that is hosted on a website : Watch Market Analysis Report
SummaryDataset DescriptionTechnical DetailsExploratory Data AnalysisEthics and LimitationsUsage
Detailed Table of Contents
Summary Key Statistics Primary Use Cases
Dataset Description Data Structure Features Network Properties… See the full description on the dataset page: https://huggingface.co/datasets/TMVishnu/watch-market-gnn.
Facebook
TwitterAs described in the README.md file, the GitHub repository PRTR_transfers are Python scripts written to run a data-centric and chemical-centric framework for tracking EoL chemical flow transfers, identifying potential EoL exposure scenarios, and performing Chemical Flow Analysis (CFA). Also, the created Extract, Transform, and Load (ETL) pipeline leverages publicly-accessible Pollutant Release and Transfer Register (PRTR) systems belonging to Organization for Economic Cooperation and Development (OECD) member countries. The Life Cycle Inventory (LCI) data obtained by the ETL is stored in a Structured Query Language (SQL) database called PRTR_transfers that could be connected to Machine Learning Operations (MLOps) in production environments, making the framework scalable for real-world applications. The data ingestion pipeline can supply data at an annual rate, ensuring labeled data can be ingested into data-driven models if retraining is needed, especially to face problems like data and concept drift that could drastically affect the performance of data-driven models. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures (file Manuscript Figures-EDA.ipynb) and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. Martín. Tracking end-of-life stage of chemicals: A scalable data-centric and chemical-centric approach. Resources, Conservation and Recycling. Elsevier Science BV, Amsterdam, NETHERLANDS, 196: 107031, (2023).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BIMCV Prostate Dataset is a comprehensive and diverse dataset that includes a total of 9,341 prostate MRI sessions, distributed among 8,441 subjects, collected from 16 healthcare centers in the Valencian Community, Spain. This dataset is structured according to the MIDS (Medical Imaging Data Structure) standard, ensuring consistent and accessible organization for researchers, facilitating data use and analysis.
The first version of the dataset focuses on sessions that contain the three mentioned imaging modalities (T2W, DWI, and ADC), resulting in a total of 1,730 complete sessions, with a total of 4,663 samples for training, of which 2,594 are csPCa positive and 2,069 are csPCa negative. This information can be found in the table available on GitHub.
The dataset includes MRI images in three modalities: T2-weighted images (T2W), diffusion-weighted images (DWI), and apparent diffusion coefficient (ADC) maps. In total, the dataset includes 32,662 T2W images (62.97%), 8,036 DWI images (15.49%), and 11,167 ADC maps (21.53%), including both the original maps and those calculated from the available DWI images. This additional calculation process was carried out to ensure the dataset's integrity and consistency, allowing for comprehensive analysis in the field of prostate oncology.
The exploratory data analysis (EDA) performed on this dataset has provided insights into the characteristics and distribution of the images, ensuring the dataset's representativeness and diversity. For example, it was found that Health Center 5 contributed the highest proportion of sessions (15.6%), followed by Health Center 7 (12.3%) and Health Center 17 (10.5%). This level of diversity in data sources ensures that the dataset encompasses a wide range of imaging acquisition practices and patient demographics, improving the generalization of artificial intelligence models developed with this data.
Additionally, the analysis of the distribution by MRI equipment manufacturer revealed that most images were acquired with General Electric equipment (66.7%), followed by Philips (25.1%) and Siemens (8.13%). Similarly, most sessions were conducted with 1.5 Tesla machines (63%), followed by 3.0 Tesla machines (36.5%), reflecting standard clinical practices in the region.
Regarding the distribution of labels within the dataset, of the total cases, 4,871 (approximately 52%) are labeled as csPCa positive, while 3,514 cases (approximately 37%) are labeled as csPCa negative.
To access the dataset, please fill out the following survey: https://forms.office.com/e/frV3A5dT6r
El BIMCV Prostate Dataset es un conjunto de datos integral y diverso que incluye un total de 9,341 sesiones de resonancia magnética de próstata, distribuidas entre 8,441 sujetos, recopiladas en 16 centros de salud de la Comunidad Valenciana, España. Este conjunto de datos está estructurado según el estándar MIDS (Medical Imaging Data Structure), lo que garantiza una organización coherente y accesible para los investigadores, facilitando la utilización y análisis de los datos.
La primera versión del dataset se centra en las sesiones que contienen las tres modalidades de imagen mencionadas (T2W, DWI y ADC), lo que ha dado lugar a un total de 1,730 sesiones completas, con un total de 4663 muestras para entrenamiento, de los cuales 2,594 son csPCa positivo y 2,069 csPCa negativo. Esta información puede ser encontrada en la tabla presente en GitHub.
El conjunto de datos incluye imágenes de resonancia magnética en tres modalidades: imágenes ponderadas en T2 (T2W), imágenes de difusión ponderada (DWI) y mapas de coeficiente de difusión aparente (ADC). En total, el dataset incluye 32,662 imágenes T2W (62.97%), 8,036 imágenes DWI (15.49%), y 11,167 mapas ADC (21.53%), que incluyen tanto los mapas originales como los calculados a partir de las imágenes DWI disponibles. Este proceso de cálculo adicional se realizó para asegurar la integridad y consistencia del conjunto de datos, permitiendo un análisis exhaustivo en el ámbito de la oncología de próstata.
El análisis exploratorio de datos (EDA) realizado sobre este conjunto de datos ha permitido entender las características y la distribución de las imágenes, lo que garantiza la representatividad y diversidad del dataset. Por ejemplo, se ha encontrado que el Centro de Salud 5 contribuyó con la mayor proporción de sesiones (15.6%), seguido del Centro de Salud 7 (12.3%) y el Centro de Salud 17 (10.5%). Este nivel de diversidad en las fuentes de datos asegura que el dataset abarca una amplia gama de prácticas de adquisición de imágenes y demografías de pacientes, mejorando la generalización de los modelos de inteligencia artificial desarrollados con estos datos.
Además, el análisis de la distribución por el fabricante del equipo de resonancia magnética reveló que la mayoría de las imágenes fueron adquiridas con equipos de General Electric (66.7%), seguidos por Philips (25.1%) y Siemens (8.13%). Asimismo, la mayoría de las sesiones se realizaron con máquinas de 1.5 Tesla (63%), seguidas de máquinas de 3.0 Tesla (36.5%), reflejando las prácticas clínicas estándar en la región.
En cuanto a la distribución de las etiquetas dentro del dataset, de los casos totales, 4871 (aproximadamente el 52%) están etiquetados como csPCa positivo, mientras que 3514 casos (aproximadamente el 37%) están etiquetados como csPCa negativo.
Para acceder al dataset por favor rellene la siguiente encuesta: https://forms.office.com/e/frV3A5dT6r
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
- If you want to know about the web scrapping process then read the blog Medium Article
- If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo
GitHub Repo
This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.
Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.
Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.
This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.
Facebook
TwitterMMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs
Paper GitHub
Introduction
MMCircuitEval is a multimodal benchmark specifically designed to assess MLLM performance comprehensively across diverse EDA tasks. MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs covering:
Both digital and analog circuits Critical EDA stages—ranging from general knowledge and specifications to front-end and back-end design A… See the full description on the dataset page: https://huggingface.co/datasets/charlie314159/MMCircuitEval.
Facebook
TwitterDescription: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.
Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.
Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.
Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.
License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.
Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.
Facebook
TwitterThis dataset was created using Web-scraping from smartprix.com under TV's page. The goal of creating this dataset is to develop a TV price predictor using machine learning Techniques This is a untidy and messy dataset as many columns has values which are not split a correct manner. We need to first clean the dataset for further analysis and predictions.
If you want to now the source and the process of the Dataset you can visit my GITHUB profile https://github.com/JUSTPERKY/Data-Gathering-From-Websites
'This is my first time creating a dataset through Web-scraping'
Note: I updated the Dataset again with more columns as some values have been shifted to the new columns and we need to rearrange the dataset in-order to clean it
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Motivation:
Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.
Applications:
- Building robust phishing detection systems.
- Enhancing security measures in email filtering and web browsing.
- Training cybersecurity practitioners in identifying malicious URLs.
The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.
This dataset comprises two types of URLs:
1. Phishing URLs: Malicious URLs designed to deceive users.
2. Benign URLs: Legitimate URLs posing no harm to users.
Key Features:
- URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
- Content-based features: Link density, iframe presence, external/internal links, and metadata.
- Certificate-based features: SSL/TLS details like validity period and organization.
- WHOIS data: Registration details like creation and expiration dates.
Statistics:
- Total Samples: 800 (400 phishing, 400 benign).
- Features: 22 including URL, domain, link density, and SSL attributes.
To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.
Insights from EDA:
- Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts.
- Bar Plots: Class distribution and protocol usage trends.
- Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns.
- Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.
EDA visualizations are provided in the repository.
The repository contains the Python code used to extract features, conduct EDA, and build the dataset.
Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
1. Protects User Privacy: No personally identifiable information is included.
2. Promotes Ethical Use: Intended solely for academic and research purposes.
3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.
Risks:
- Misuse of the dataset for creating more deceptive phishing attacks.
- Over-reliance on outdated features as phishing tactics evolve.
Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.
This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.
Facebook
TwitterΕΣΠΑ 2007 - 2013 - Πράξεις ανά Φορέα Διαχείρισης - ΕΥΔΕΠ ΥΠΟΔΟΜΕΣ ΜΕΤΑΦΟΡΩΝ, ΠΕΡΙΒΑΛΛΟΝ ΚΑΙ ΑΕΙΦΟΡΟΣ ΑΝΑΠΤΥΞΗ (π. ΕΔΑ ΜΕΤΑΦΟΡΩΝ)
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
The problem involves scrapping two websites (Life Quality and Crime Rate) for collecting life quality and crime rate data, merging them by country name and conducting EDA on Tabealu for finging insights. For details find : https://github.com/NifulIslam/Life-Quality-and-Crime-Rate-Scrapping-and-EDA
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repo contains the news data in tabular format. Please read about how the data was prepared from different sources: * Nepali News (Gorkhapatra) Scrapping Using BeautifulSoup and Python * Nepali News (ekantipur) Scrapping Using BeautifulSoup and Python * Nepali News (Onlinekhabar Post) Scrapping Using BeautifulSoup and Python
I have also done some work on this dataset which can be found below blogs: * EDA on Nepali News * Nepali News Classification with Naive Bayes * Nepali News Classification with Logistic Regression
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations.
The data set ifood_df.csv consists of 2206 customers of XYZ company with data on:
- Customer profiles
- Product preferences
- Campaign successes/failures
- Channel performance
I do not own this dataset. I am simply making it accessible on this platform via the public GitHub link.
Facebook
TwitterThis is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019.
The naming convention is as follows: Sales_[MONTH_NAME]_2019
Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows:
Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address
There are around 186851 data points combining all the 12-month files. There may be null values in some rows.