16 datasets found

Bank Data Analysis
kaggle.com
zip
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steve Gallegos (2022). Bank Data Analysis [Dataset]. https://www.kaggle.com/stevegallegos/bank-marketing-data-set
Explore at:
zip(376757 bytes)Available download formats
Dataset updated
Feb 23, 2022
Authors
Steve Gallegos
Description
Data Set Information

The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.

bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

Changed file name to bank.csv after delimited

Goal

The main goal is to predict if clients will subscribe to a term deposit or not.

Attribute Information

-Input Variables -

Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)

Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)

#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'

Source

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
UK_Flight_Data Statistics_2018
kaggle.com
zip
Updated Jan 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferhat Culfaz (2019). UK_Flight_Data Statistics_2018 [Dataset]. https://www.kaggle.com/ferhat00/uk-flight-stats-2018
Explore at:
zip(103089 bytes)Available download formats
Dataset updated
Jan 29, 2019
Authors
Ferhat Culfaz
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United Kingdom
Description
An analysis of the flight punctuality statistics using pandas and seaborn. Source data from: https://www.caa.co.uk/Data-and-analysis/UK-aviation-market/Flight-reliability/Datasets/Punctuality-data/Punctuality-statistics-2018/

Open the csv into a pandas dataframe and analyse using Seaborn.
Downsampled data from FlowRepository: FR-FCM-Z3WR
figshare.com
csv
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Tyrrell (2024). Downsampled data from FlowRepository: FR-FCM-Z3WR [Dataset]. http://doi.org/10.6084/m9.figshare.27940719.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27940719.v1
Dataset updated
Dec 2, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Daniel Tyrrell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spectral flow cytometry provides greater insights into cellular heterogeneity by simultaneous measurement of up to 50 markers. However, analyzing such high-dimensional (HD) data is complex through traditional manual gating strategy. To address this gap, we developed CAFE as an open-source Python-based web application with a graphical user interface. Built with Streamlit, CAFE incorporates libraries such as Scanpy for single-cell analysis, Pandas and PyArrow for efficient data handling, and Matplotlib, Seaborn, Plotly for creating customizable figures. Its robust toolset includes density-based down-sampling, dimensionality reduction, batch correction, Leiden-based clustering, cluster merging and annotation. Using CAFE, we demonstrated analysis of a human PBMC dataset of 350,000 cells identifying 16 distinct cell clusters. CAFE can generate publication-ready figures in real time via interactive slider controls and dropdown menus, eliminating the need for coding expertise and making HD data analysis accessible to all. CAFE is licensed under MIT and is freely available at https://github.com/mhbsiam/cafe.
Z
Supplementary material: Burial Analysis on the Middle Bronze Age in the...
data.niaid.nih.gov
zenodo.org
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laabs, Julian (2024). Supplementary material: Burial Analysis on the Middle Bronze Age in the Carpathian Basin (dataset and scripts) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7355008
Explore at:
Dataset updated
Dec 4, 2024
Dataset provided by
Kiel University
Authors
Laabs, Julian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pannonian Basin
Description
This is the supplementary material of the paper "Wealth Consumption, Sociopolitical Organization, and Change: A Perspective from Burial Analysis on the Middle Bronze Age in the Carpathian Basin" (accessible over doi: https://doi.org/10.1515/opar-2022-0281). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. The data tables comprise the burial data of the Hungarian Middle Bronze Age cemeteries of Dunaújváros-Duna-dűlő, Dömsöd, Adony, Lovasberény, Csanytelek-Palé, Kelebia, Hernádkak, Gelej, Pusztaszikszó and Streda nad Bodrogom. The script "supplementary_material_2_wealth_index_calculation.py" provides the calculation of a wealth index, based on grave goods, for the provided data. The script "supplementary_material_3_population_estimation.py" models the living population of Dunaújváros-Duna-dűlő. Both can be run by double-click. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/).
D
Data from: Data related to Panzer: A Machine Learning Based Approach to...
darus.uni-stuttgart.de
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Panzer (2024). Data related to Panzer: A Machine Learning Based Approach to Analyze Supersecondary Structures of Proteins [Dataset]. http://doi.org/10.18419/DARUS-4576
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4576
Dataset updated
Nov 27, 2024
Dataset provided by
DaRUS
Authors
Tim Panzer
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576
Time period covered
Nov 1, 1976 - Feb 29, 2024
Dataset funded by
DFG
Description
This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...
figshare.com
txt
Updated Apr 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahir Bhatti (2025). CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes: Tracing the Genomic Divergence From SARS-CoV (2003) to SARS-CoV-2 (2019) [Dataset]. http://doi.org/10.6084/m9.figshare.28736501.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28736501.v1
Dataset updated
Apr 5, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tahir Bhatti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501
Z
Supplementary information: Subsistence and Population development from the...
nde-dev.biothings.io
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bunbury, Magdalena M.E. (2024). Supplementary information: Subsistence and Population development from the Middle Neolithic B (2800-2350 BCE) to the Late Neolithic (2350-1700 BCE) in Southern Scandinavia [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8089547
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Mortensen, Morten Fischer
Laabs, Julian
Johannsen, Jens Winther
Bunbury, Magdalena M.E.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the supplementary information of the paper “Subsistence and Population development from the Middle Neolithic B (2800-2350 BCE) to the Late Neolithic (2350-1700 BCE) in Southern Scandinavia” (DOI: tba). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/). R (https://cran.r-project.org/) with the packages here (https://cran.r-project.org/web/packages/here/index.html) and rcarbon (https://cran.r-project.org/web/packages/rcarbon/index.html), tidyverse, vegan, ggplot2, reshape2, RcppRoll.
Salaries case study
kaggle.com
zip
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
Explore at:
zip(13105509 bytes)Available download formats
Dataset updated
Oct 2, 2024
Authors
Shobhit Chauhan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.
Heart_Attack_Raw_Dataset
kaggle.com
zip
Updated Nov 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
La Min Ko Ko (2025). Heart_Attack_Raw_Dataset [Dataset]. https://www.kaggle.com/datasets/laminkoko/heart-attack-raw-dataset
Explore at:
zip(10276 bytes)Available download formats
Dataset updated
Nov 23, 2025
Authors
La Min Ko Ko
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
About Dataset

This dataset is a pre-processed version of the popular Heart Attack Analysis & Prediction dataset.

While the original dataset uses label encoding (numerical values) for categorical variables, this version maps those integers to their descriptive string labels. This makes the dataset ideal for: * Data Visualization: Creating clear legends and axis labels in Tableau, PowerBI, Matplotlib, or Seaborn without needing manual mapping. * Exploratory Data Analysis (EDA): Quickly understanding the distribution of categories (e.g., "Typical Angina" vs. "Asymptomatic") at a glance.

Attribute Information & Mappings

The following categorical columns have been decoded for readability:

Sex:

Female (was 0)

Male (was 1)

CP (Chest Pain Type):

Typical Angina

Atypical Angina

Non-anginal Pain

Asymptomatic

FBS (Fasting Blood Sugar > 120 mg/dl):

True

False

RestECG (Resting Electrocardiographic Results):

Normal

ST-T Abnormality

LV Hypertrophy

Exng (Exercise Induced Angina):

Yes

No

Slp (Slope):

Upsloping

Flat

Downsloping

Thall (Thalassemia):

Null

Fixed Defect

Normal

Reversible Defect

Numerical Features (Unchanged)

Age: Age of the patient

Trtbps: Resting blood pressure (in mm Hg)

Chol: Cholesterol in mg/dl fetched via BMI sensor

Thalachh: Maximum heart rate achieved

Oldpeak: Previous peak

Caa: Number of major vessels (0-3)

Target

Output: 0 = Less chance of heart attack, 1 = More chance of heart attack

Acknowledgements

This data is derived from the original dataset uploaded by Juled Zaganjori. Original Source: UCI Machine Learning Repository (Cleveland, Hungary, Switzerland, Long Beach V databases).
Ultimate Statistical Tests Flowchart
kaggle.com
zip
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pruthvinath Jeripity Venkata (2025). Ultimate Statistical Tests Flowchart [Dataset]. https://www.kaggle.com/datasets/pruthvinathjv/ultimate-statistical-tests-flowchart
Explore at:
zip(89096 bytes)Available download formats
Dataset updated
Apr 1, 2025
Authors
Pruthvinath Jeripity Venkata
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context: This flowchart helps data scientists and researchers choose the right statistical test based on data characteristics like normality and variance. It simplifies test selection and improves decision-making.

Sources: Inspired by common statistical guidelines and resources such as "Practical Statistics for Data Scientists" and widely used online platforms like Khan Academy and Coursera.

Inspiration: Created to address the challenges of selecting appropriate statistical tests, this flowchart offers a clear, easy-to-follow decision path for users at all levels.
Supermarket Inventory Dataset
kaggle.com
zip
Updated Nov 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shafii Rajabu (2024). Supermarket Inventory Dataset [Dataset]. https://www.kaggle.com/datasets/shafiirajabu/supermarket-inventory-dataset/versions/1
Explore at:
zip(1303084 bytes)Available download formats
Dataset updated
Nov 15, 2024
Authors
Shafii Rajabu
Description
Dataset Overview This fictional dataset, generated by ChatGPT, is designed for those interested in learning and practicing data visualization, dashboard creation, and data analysis. It contains 10,000 rows of data reflecting the inventory and sales patterns of a typical supermarket, spanning a timeframe from January 1, 2024, to June 30, 2024.

The dataset aims to mimic real-world inventory dynamics and includes product details, stock levels, sales data, supplier performance, and restocking schedules. It's perfect for creating interactive dashboards in tools like Excel, Tableau, or Power BI or for practicing data cleaning and exploratory data analysis (EDA).

Key Features Comprehensive Columns:

Date: Record date. ProductID: Unique identifier for products. ProductName: Product names across diverse supermarket categories. Category: Categories like Dairy, Meat, Produce, etc. Supplier: Fictional supplier names for products. UnitPrice: Realistic product pricing. StockQuantity: Current stock levels. StockValue: Total value of inventory for each product. ReorderLevel: Threshold for triggering a reorder. ReorderQuantity: Recommended reorder quantity. UnitsSold: Number of units sold. SalesValue: Total sales value for each product. LastSoldDate: Last date of sale. LastRestockDate: Date of the last restock. NextRestockDate: Scheduled date for the next restock. DeliveryTimeDays: Delivery lead time from suppliers. DeliveryStatus: Status of the latest delivery (e.g., On Time, Delayed).

Realistic Data Generation:

Products include 50 common supermarket items across 9 categories (Dairy, Bakery, Beverages, Meat, Produce, Frozen, Snacks, Cleaning Supplies, Health & Beauty). Reflects seasonal trends and realistic stock replenishment behaviors. Randomized yet logical patterns for pricing, sales, and stock levels.

Versatile Use Cases:

Ideal for data visualization projects. Suitable for inventory management simulation. Can be used to practice time-series analysis.

Why Use This Dataset? This dataset is a learning resource, crafted to provide aspiring data enthusiasts and professionals with a sandbox to hone their skills in:

Building dashboards in Tableau, Power BI, or Excel. Analyzing inventory trends and forecasting demand. Visualizing data insights using tools like Matplotlib, Seaborn, or Plotly.

Disclaimer This dataset is entirely fictional and was generated by ChatGPT, a large language model created by OpenAI. While the data reflects patterns of a real supermarket, it is not based on any actual business or proprietary data.

Shoutout to ChatGPT for generating this comprehensive dataset and making it available to the Kaggle community! 🎉

Acknowledgments If you find this dataset helpful, feel free to share your visualizations and insights! Let’s make learning data visualization engaging and fun.
Image Classification by CNN
kaggle.com
zip
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Jaglan (2024). Image Classification by CNN [Dataset]. https://www.kaggle.com/datasets/harshjaglan01/image-classification-by-cnn/code
Explore at:
zip(311627190 bytes)Available download formats
Dataset updated
Mar 4, 2024
Authors
Harsh Jaglan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Automated Flower Identification Using Convolutional Neural Networks

This project aims to develop a model for identifying five different flower species (rose, tulip, sunflower, dandelion, daisy) using Convolutional Neural Networks (CNNs).

Description

The dataset consists of 5,000 images (1,000 images per class) collected from various online sources. The model achieved an accuracy of 98.58% on the test set. Usage

This project requires Python 3.x and the following libraries:

TensorFlow: For making Neural Networks numpy: For numerical computing and array operations. pandas: For data manipulation and analysis. matplotlib: For creating visualizations such as line plots, bar plots, and histograms. seaborn: For advanced data visualization and creating statistically-informed graphics. scikit-learn: For machine learning algorithms and model training. To run the project:

Clone this repository.

Install the required libraries. Run the Jupyter Notebook: jupyter notebook flower_classification.ipynb Additional Information Link to code: https://github.com/Harshjaglan01/flower-classification-cnn License: MIT License
Amazon Skincare Products
kaggle.com
zip
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NAMAN TRISOLIYA (2023). Amazon Skincare Products [Dataset]. https://www.kaggle.com/datasets/namantrisoliya/amazon-skincare-products/discussion
Explore at:
zip(116307 bytes)Available download formats
Dataset updated
Apr 25, 2023
Authors
NAMAN TRISOLIYA
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
The Amazon skincare products dataset is a large collection of data that includes information about various skincare products available on Amazon. It's perfect for beginners who want to gain hands-on experience in visualizing, preprocessing, and cleaning data. The dataset offers opportunities to practice data cleaning and visualization techniques using popular libraries like Matplotlib and Seaborn in Python. Overall, it's a valuable resource for beginners to learn essential data skills in a relevant and interesting context.
Phone Price Predict 2020-2024
kaggle.com
zip
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jerowai (2024). Phone Price Predict 2020-2024 [Dataset]. https://www.kaggle.com/datasets/jerowai/phone-price-predict-2020-2024
Explore at:
zip(1002 bytes)Available download formats
Dataset updated
Dec 10, 2024
Authors
Jerowai
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Overview This dataset provides a curated, example-based snapshot of selected Samsung smartphones released (or expected to be released) between 2020 and 2024. It includes various technical specifications such as camera details, processor type, RAM, internal storage, display size, GPU, battery capacity, operating system, and pricing. Note that these values are illustrative and may not reflect actual market data.

What’s Inside?

Phone Name & Release Year: Quickly reference the time frame and model. Camera Specs: Understand the rear camera configurations (e.g., “108+10+10+12 MP”) and compare imaging capabilities across models. Processor & GPU: Gain insights into the performance capabilities by checking the processor and graphics chip. Memory & Storage: Review RAM and internal storage options (e.g., “8 GB RAM” and “128 GB Internal Storage”). Display & Battery: Compare screen sizes (from 6.1 to over 7 inches) and battery capacities (e.g., 5000 mAh) to gauge device longevity and usability. Operating System: Note the Android version at release. Price (USD): Examine relative pricing trends over the years. How to Use This Dataset

Exploratory Data Analysis (EDA): Use Python libraries like Pandas and Matplotlib to explore pricing trends over time, changes in camera configurations, or the evolution of battery capacities.

Example: df.groupby('Release Year')['Price (USD)'].mean().plot(kind='bar') can show how average prices have fluctuated year to year. Feature Comparison & Filtering: Easily filter models based on specs. For instance, query phones with at least 8 GB RAM and a 5000 mAh battery to identify devices suitable for power users.

Example: df[(df['RAM (GB)'] >= 8) & (df['Battery Capacity (mAh)'] >= 5000)] Machine Learning & Predictive Analysis: Although this dataset is example-based and not suitable for precise forecasting, you could still practice predictive modeling. For example, create a simple regression model to predict price based on features like RAM and display size.

Example: Train a regression model (e.g., LinearRegression in scikit-learn) to see if increasing RAM or battery capacity correlates with higher prices. Comparing Release Trends: Investigate how flagship and mid-range specifications have evolved. See if there’s a noticeable shift towards larger displays, bigger batteries, or higher camera megapixels over the years.

Recommended Tools & Libraries

Python & Pandas: For data cleaning, manipulation, and initial analysis. Matplotlib & Seaborn: For creating visualizations to understand trends and distributions. scikit-learn: For modeling and basic predictive tasks, if you choose to use these example values as a training ground. Jupyter Notebooks or Kaggle Kernels: For interactive analysis and iterative exploration. Disclaimer This dataset is a synthetic, illustrative example and may not match real-world specifications, prices, or release timelines. It’s intended for learning, experimentation, and demonstration of various data analysis and machine learning techniques rather than as a factual source.
Financial Complaints Overview
kaggle.com
zip
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashwin Panbude (2025). Financial Complaints Overview [Dataset]. https://www.kaggle.com/datasets/ashwinpanbude18/financial-complaints
Explore at:
zip(1687037 bytes)Available download formats
Dataset updated
Nov 3, 2025
Authors
Ashwin Panbude
Description
This dataset contains real-world financial consumer complaints collected from various sources such as banks, credit card companies, and financial institutions. Each record captures customer sentiment, issue category, product type, company response, and resolution status, enabling deep exploration of customer experience and service quality within the financial domain.

Key Skills Demonstrated

🐍 Python (Pandas, NumPy, Matplotlib, Seaborn)

🧩 Data Cleaning & Preprocessing

📊 Exploratory Data Analysis (EDA)

💬 Text Analytics & Sentiment Analysis

🤖 Machine Learning for Complaint Categorization

📈 Interactive Visualization (Power BI / Tableau)

🏷️ Business Insight Generation & Storytelling

📚 Tags

DataAnalytics #Finance #CustomerExperience #SentimentAnalysis #Python #MachineLearning #BusinessIntelligence #KaggleProject
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Steve Gallegos (2022). Bank Data Analysis [Dataset]. https://www.kaggle.com/stevegallegos/bank-marketing-data-set

Bank Data Analysis

Python Project using Pandas, Matplotlib, NumPy, and Seaborn.

Explore at:

zip(376757 bytes)Available download formats

Dataset updated

Feb 23, 2022

Authors

Steve Gallegos

Description

Data Set Information

The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.

bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)
Changed file name to bank.csv after delimited

Goal

The main goal is to predict if clients will subscribe to a term deposit or not.

Attribute Information

-Input Variables -

Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)

Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)

#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'

Source

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Clear search

Close search

Google apps

Main menu

Bank Data Analysis

Data Set Information

Goal

Attribute Information

-Input Variables -

Source

UK_Flight_Data Statistics_2018

Downsampled data from FlowRepository: FR-FCM-Z3WR

Supplementary material: Burial Analysis on the Middle Bronze Age in the...

Data from: Data related to Panzer: A Machine Learning Based Approach to...

Ecommerce Dataset for Data Analysis

CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...

Supplementary information: Subsistence and Population development from the...

Salaries case study

Heart_Attack_Raw_Dataset

About Dataset

Attribute Information & Mappings

Numerical Features (Unchanged)

Target

Acknowledgements

Ultimate Statistical Tests Flowchart

Supermarket Inventory Dataset

Image Classification by CNN

Automated Flower Identification Using Convolutional Neural Networks

Description

This project requires Python 3.x and the following libraries:

Clone this repository.

Amazon Skincare Products

Phone Price Predict 2020-2024

Financial Complaints Overview

DataAnalytics #Finance #CustomerExperience #SentimentAnalysis #Python #MachineLearning #BusinessIntelligence #KaggleProject

Bank Data Analysis

Python Project using Pandas, Matplotlib, NumPy, and Seaborn.

Data Set Information

Goal

Attribute Information

-Input Variables -

Source