Facebook
TwitterThe bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.
The main goal is to predict if clients will subscribe to a term deposit or not.
Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)
Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)
#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
An analysis of the flight punctuality statistics using pandas and seaborn. Source data from: https://www.caa.co.uk/Data-and-analysis/UK-aviation-market/Flight-reliability/Datasets/Punctuality-data/Punctuality-statistics-2018/
Open the csv into a pandas dataframe and analyse using Seaborn.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spectral flow cytometry provides greater insights into cellular heterogeneity by simultaneous measurement of up to 50 markers. However, analyzing such high-dimensional (HD) data is complex through traditional manual gating strategy. To address this gap, we developed CAFE as an open-source Python-based web application with a graphical user interface. Built with Streamlit, CAFE incorporates libraries such as Scanpy for single-cell analysis, Pandas and PyArrow for efficient data handling, and Matplotlib, Seaborn, Plotly for creating customizable figures. Its robust toolset includes density-based down-sampling, dimensionality reduction, batch correction, Leiden-based clustering, cluster merging and annotation. Using CAFE, we demonstrated analysis of a human PBMC dataset of 350,000 cells identifying 16 distinct cell clusters. CAFE can generate publication-ready figures in real time via interactive slider controls and dropdown menus, eliminating the need for coding expertise and making HD data analysis accessible to all. CAFE is licensed under MIT and is freely available at https://github.com/mhbsiam/cafe.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the supplementary material of the paper "Wealth Consumption, Sociopolitical Organization, and Change: A Perspective from Burial Analysis on the Middle Bronze Age in the Carpathian Basin" (accessible over doi: https://doi.org/10.1515/opar-2022-0281). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. The data tables comprise the burial data of the Hungarian Middle Bronze Age cemeteries of Dunaújváros-Duna-dűlő, Dömsöd, Adony, Lovasberény, Csanytelek-Palé, Kelebia, Hernádkak, Gelej, Pusztaszikszó and Streda nad Bodrogom. The script "supplementary_material_2_wealth_index_calculation.py" provides the calculation of a wealth index, based on grave goods, for the provided data. The script "supplementary_material_3_population_estimation.py" models the living population of Dunaújváros-Duna-dűlő. Both can be run by double-click. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/).
Facebook
Twitterhttps://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4576
This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures. Abstract of the thesis: This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there. In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses. These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip. The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys. In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP. For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed). The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file. In addition, seaborn, h5py and umap-learn are required. These can be installed using the following commands: pip install h5py==3.12.1 pip install seaborn==0.13.2 umap-learn==0.5.7
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the supplementary information of the paper “Subsistence and Population development from the Middle Neolithic B (2800-2350 BCE) to the Late Neolithic (2350-1700 BCE) in Southern Scandinavia” (DOI: tba). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/). R (https://cran.r-project.org/) with the packages here (https://cran.r-project.org/web/packages/here/index.html) and rcarbon (https://cran.r-project.org/web/packages/rcarbon/index.html), tidyverse, vegan, ggplot2, reshape2, RcppRoll.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:
Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.
Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is a pre-processed version of the popular Heart Attack Analysis & Prediction dataset.
While the original dataset uses label encoding (numerical values) for categorical variables, this version maps those integers to their descriptive string labels. This makes the dataset ideal for: * Data Visualization: Creating clear legends and axis labels in Tableau, PowerBI, Matplotlib, or Seaborn without needing manual mapping. * Exploratory Data Analysis (EDA): Quickly understanding the distribution of categories (e.g., "Typical Angina" vs. "Asymptomatic") at a glance.
The following categorical columns have been decoded for readability:
Female (was 0)Male (was 1)Typical AnginaAtypical AnginaNon-anginal PainAsymptomaticTrueFalseNormalST-T AbnormalityLV HypertrophyYesNoUpslopingFlatDownslopingNullFixed DefectNormalReversible DefectThis data is derived from the original dataset uploaded by Juled Zaganjori. Original Source: UCI Machine Learning Repository (Cleveland, Hungary, Switzerland, Long Beach V databases).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Context: This flowchart helps data scientists and researchers choose the right statistical test based on data characteristics like normality and variance. It simplifies test selection and improves decision-making.
Sources: Inspired by common statistical guidelines and resources such as "Practical Statistics for Data Scientists" and widely used online platforms like Khan Academy and Coursera.
Inspiration: Created to address the challenges of selecting appropriate statistical tests, this flowchart offers a clear, easy-to-follow decision path for users at all levels.
Facebook
TwitterDataset Overview This fictional dataset, generated by ChatGPT, is designed for those interested in learning and practicing data visualization, dashboard creation, and data analysis. It contains 10,000 rows of data reflecting the inventory and sales patterns of a typical supermarket, spanning a timeframe from January 1, 2024, to June 30, 2024.
The dataset aims to mimic real-world inventory dynamics and includes product details, stock levels, sales data, supplier performance, and restocking schedules. It's perfect for creating interactive dashboards in tools like Excel, Tableau, or Power BI or for practicing data cleaning and exploratory data analysis (EDA).
Key Features Comprehensive Columns:
Date: Record date. ProductID: Unique identifier for products. ProductName: Product names across diverse supermarket categories. Category: Categories like Dairy, Meat, Produce, etc. Supplier: Fictional supplier names for products. UnitPrice: Realistic product pricing. StockQuantity: Current stock levels. StockValue: Total value of inventory for each product. ReorderLevel: Threshold for triggering a reorder. ReorderQuantity: Recommended reorder quantity. UnitsSold: Number of units sold. SalesValue: Total sales value for each product. LastSoldDate: Last date of sale. LastRestockDate: Date of the last restock. NextRestockDate: Scheduled date for the next restock. DeliveryTimeDays: Delivery lead time from suppliers. DeliveryStatus: Status of the latest delivery (e.g., On Time, Delayed).
Realistic Data Generation:
Products include 50 common supermarket items across 9 categories (Dairy, Bakery, Beverages, Meat, Produce, Frozen, Snacks, Cleaning Supplies, Health & Beauty). Reflects seasonal trends and realistic stock replenishment behaviors. Randomized yet logical patterns for pricing, sales, and stock levels.
Versatile Use Cases:
Ideal for data visualization projects. Suitable for inventory management simulation. Can be used to practice time-series analysis.
Why Use This Dataset? This dataset is a learning resource, crafted to provide aspiring data enthusiasts and professionals with a sandbox to hone their skills in:
Building dashboards in Tableau, Power BI, or Excel. Analyzing inventory trends and forecasting demand. Visualizing data insights using tools like Matplotlib, Seaborn, or Plotly.
Disclaimer This dataset is entirely fictional and was generated by ChatGPT, a large language model created by OpenAI. While the data reflects patterns of a real supermarket, it is not based on any actual business or proprietary data.
Shoutout to ChatGPT for generating this comprehensive dataset and making it available to the Kaggle community! 🎉
Acknowledgments If you find this dataset helpful, feel free to share your visualizations and insights! Let’s make learning data visualization engaging and fun.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This project aims to develop a model for identifying five different flower species (rose, tulip, sunflower, dandelion, daisy) using Convolutional Neural Networks (CNNs).
The dataset consists of 5,000 images (1,000 images per class) collected from various online sources. The model achieved an accuracy of 98.58% on the test set. Usage
TensorFlow: For making Neural Networks numpy: For numerical computing and array operations. pandas: For data manipulation and analysis. matplotlib: For creating visualizations such as line plots, bar plots, and histograms. seaborn: For advanced data visualization and creating statistically-informed graphics. scikit-learn: For machine learning algorithms and model training. To run the project:
Install the required libraries. Run the Jupyter Notebook: jupyter notebook flower_classification.ipynb Additional Information Link to code: https://github.com/Harshjaglan01/flower-classification-cnn License: MIT License
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The Amazon skincare products dataset is a large collection of data that includes information about various skincare products available on Amazon. It's perfect for beginners who want to gain hands-on experience in visualizing, preprocessing, and cleaning data. The dataset offers opportunities to practice data cleaning and visualization techniques using popular libraries like Matplotlib and Seaborn in Python. Overall, it's a valuable resource for beginners to learn essential data skills in a relevant and interesting context.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Overview This dataset provides a curated, example-based snapshot of selected Samsung smartphones released (or expected to be released) between 2020 and 2024. It includes various technical specifications such as camera details, processor type, RAM, internal storage, display size, GPU, battery capacity, operating system, and pricing. Note that these values are illustrative and may not reflect actual market data.
What’s Inside?
Phone Name & Release Year: Quickly reference the time frame and model. Camera Specs: Understand the rear camera configurations (e.g., “108+10+10+12 MP”) and compare imaging capabilities across models. Processor & GPU: Gain insights into the performance capabilities by checking the processor and graphics chip. Memory & Storage: Review RAM and internal storage options (e.g., “8 GB RAM” and “128 GB Internal Storage”). Display & Battery: Compare screen sizes (from 6.1 to over 7 inches) and battery capacities (e.g., 5000 mAh) to gauge device longevity and usability. Operating System: Note the Android version at release. Price (USD): Examine relative pricing trends over the years. How to Use This Dataset
Exploratory Data Analysis (EDA): Use Python libraries like Pandas and Matplotlib to explore pricing trends over time, changes in camera configurations, or the evolution of battery capacities.
Example: df.groupby('Release Year')['Price (USD)'].mean().plot(kind='bar') can show how average prices have fluctuated year to year. Feature Comparison & Filtering: Easily filter models based on specs. For instance, query phones with at least 8 GB RAM and a 5000 mAh battery to identify devices suitable for power users.
Example: df[(df['RAM (GB)'] >= 8) & (df['Battery Capacity (mAh)'] >= 5000)] Machine Learning & Predictive Analysis: Although this dataset is example-based and not suitable for precise forecasting, you could still practice predictive modeling. For example, create a simple regression model to predict price based on features like RAM and display size.
Example: Train a regression model (e.g., LinearRegression in scikit-learn) to see if increasing RAM or battery capacity correlates with higher prices. Comparing Release Trends: Investigate how flagship and mid-range specifications have evolved. See if there’s a noticeable shift towards larger displays, bigger batteries, or higher camera megapixels over the years.
Recommended Tools & Libraries
Python & Pandas: For data cleaning, manipulation, and initial analysis. Matplotlib & Seaborn: For creating visualizations to understand trends and distributions. scikit-learn: For modeling and basic predictive tasks, if you choose to use these example values as a training ground. Jupyter Notebooks or Kaggle Kernels: For interactive analysis and iterative exploration. Disclaimer This dataset is a synthetic, illustrative example and may not match real-world specifications, prices, or release timelines. It’s intended for learning, experimentation, and demonstration of various data analysis and machine learning techniques rather than as a factual source.
Facebook
TwitterThis dataset contains real-world financial consumer complaints collected from various sources such as banks, credit card companies, and financial institutions. Each record captures customer sentiment, issue category, product type, company response, and resolution status, enabling deep exploration of customer experience and service quality within the financial domain.
Key Skills Demonstrated
🐍 Python (Pandas, NumPy, Matplotlib, Seaborn)
🧩 Data Cleaning & Preprocessing
📊 Exploratory Data Analysis (EDA)
💬 Text Analytics & Sentiment Analysis
🤖 Machine Learning for Complaint Categorization
📈 Interactive Visualization (Power BI / Tableau)
🏷️ Business Insight Generation & Storytelling
📚 Tags
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThe bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.
The main goal is to predict if clients will subscribe to a term deposit or not.
Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)
Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)
#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'