Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context This dataset holds a list of approx 200 + books in the field of Data science related topics. The list of books was constructed using one of the popular websites Amazon which provide information on book ratings and many details given below.
There are 6 column
Book_name / book title
Publisher:-- name of the publisher or writer
Buyers ():--it means no of customer who purchase the same book
Cover_type:-- types of cover use to protect the book
stars:--out of 5 * how much rated
Price
Inspiration I’d like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas:
• What is the best-selling book?
• Find any hidden patterns if you can
. EDA of dataset
Facebook
TwitterDataset for Beginners to start Data Science process. The subject of data is about simple clinical data for problem definition and solving. range of data science tasks such as classification, clustering, EDA and statistical analysis are using with dataset.
columns in data set are present: Age: Numerical (Age of patient) Sex: Binary (Gender of patient) BP: Nominal (Blood Pressure of patient with values: Low, Normal and High) Cholesterol: Nominal (Cholesterol of patient with values: Normal and High) Na: Numerical (Sodium level of patient experiment) K: Numerical (Potassium level of patient experiment) Drug: Nominal (Type of Drug that prescribed with doctor, with values: A, B, C, X and Y)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive analysis of human cognitive performance based on various lifestyle factors, including sleep duration, stress levels, diet type, screen time, exercise frequency, and caffeine intake. Additionally, it includes a cognitive score computed using a weighted formula and an AI-predicted score, making it suitable for machine learning and AI-based predictive modeling.
The dataset contains 80,000 samples with diverse demographic attributes, making it an excellent resource for data science, AI, and human behavior analysis.
This dataset is suitable for various machine learning and AI applications:
Facebook
TwitterThis case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.
Business Understanding: The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.
When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:
If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:
The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:
Approved: The company has approved loan application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused Offer: Loan has been cancelled by the client but on different stages of the process. In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.
Business Objectives: It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.
In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.
To develop your understanding of the domain, you are advised to independently research a little about risk analytics – understanding the types of variables and their significance should be enough).
Data Understanding: Download the Dataset using the link given under dataset section on the right.
application_data.csv contains all the information of the client at the time of application.
The data is about wheather a client has payment difficulties.
previous_application.csv contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.
columns_descrption.csv is data dictionary which describes the meaning of the variables.
You are required to provide a detailed report for the below data record mentioning the answer to the questions that follows:
Present the overall approach of the analysis. Mention the problem statement and the analysis approach briefly Indentify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value) Hint: Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach. Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points. Identify if there is data imbalance in the data. Find the ratio of data imbalance. Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights. Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms. Find the top 10 c...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset History
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.
Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.
Tasks to perform
The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.
Masked in the column description means already converted from categorical value to numerical column.
Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.
DATA PREPROCESSING
Check the basic statistics of the dataset
Check for missing values in the data
Check for unique values in data
Perform EDA
Purchase Distribution
Check for outliers
Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc
Drop unnecessary fields
Convert categorical data into integer using map function (e.g 'Gender' column)
Missing value treatment
Rename columns
Fill nan values
map range variables into integers (e.g 'Age' column)
Data Visualisation
All the Best!!
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A Time Series Forecasting Project from a Data Science Internship This project showcases a real-world approach to forecasting monthly retail sales using Python and SARIMAX, completed as part of a 25-day Data Science Internship.
To forecast the next 6 months of retail sales based on 4 years of historical monthly data, with the goal of improving inventory management, marketing strategy, and financial planning.
Synthetic sales dataset (Jan 2020 – Dec 2023)
| Column Name | Description |
|---|---|
Date | First day of each month (format: YYYY-MM-DD), representing the sales period |
SalesAmount | Total monthly sales amount (includes trend, seasonality, and promotion effects) |
Promotion | Binary flag: 1 = promotional campaign active, 0 = no promotion |
HolidayMonth | Binary flag: 1 = holiday month (e.g., December), 0 = non-holiday month |
SalesAmount Promotion flag HolidayMonth flag
How to perform time series EDA (decomposition, ACF/PACF, stationarity tests) Feature engineering for time series (lags, rolling means, exogenous variables) Training and tuning SARIMAX models Making 6-month forecasts with confidence intervals Translating insights into business recommendations
Model Used: SARIMAX(1,1,1)(0,1,1,12)
MAE: 814.37 RMSE: 1012.38 Promotion events shown to significantly increase sales
Month Forecast Jan 2024 15,387 Feb 2024 18,653 Mar 2024 14,954 Apr 2024 13,468 May 2024 11,059 Jun 2024 10,514
Full EDA and modeling steps with markdown explanations Forecasting charts and insights Clean, beginner-friendly code structure
Data science learners looking for an internship project idea Anyone learning time series forecasting Retail businesses wanting a forecasting template
📌 Want to explore the full code and report? 🔗 Also available on GitHub: https://github.com/muhammad-zamin/retail-sales-forecasting
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🔥 This dataset provides a comprehensive snapshot of Russia's job market, derived from over 560,000 job listings. The diverse set of attributes, which include job titles, salaries, job types, descriptions, key skills required, and more, offers an extensive overview of the employment landscape in Russia.
Source:
The dataset is sourced from hh.ru, a prominent online employment portal in Russia. Leveraging their API, the data was meticulously gathered and compiled to create this rich repository of job market insights.
Inspiration:
The primary motivation behind creating and sharing this dataset was to build a job recommendation model utilizing graph-based models. With a significant portion of the data in Russian, it poses a fascinating challenge in data preprocessing and feature engineering. Some potential new features could be extracted from existing ones, such as min/max experience, min/max salary, job type split, and others.
Moreover, the complexity and richness of the dataset make it a suitable and intriguing field for the data science community to explore and analyze. It's not just about the translation of the data, but also about understanding the trends, identifying patterns, and even predicting future trajectories in Russia's job market. The dataset could lead to an array of innovative applications, models, and analyses.
In sharing this dataset, the hope is to inspire the Kaggle community to bring their diverse skills to bear in exploring this unique data, unveiling new insights, and building transformative models. The results can then be used to advance the field and build better job recommendation systems for diverse and multilingual job markets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data contains salaries of University of Vermont (UVM) faculty from 2009 to 2021. We present two datasets. The second dataset is richer because it contains information on faculty departments/colleges; however, it contains less rows due to how we chose to join this data.
1. salaries_without_dept.csv contains all of the data we extracted from the PDFs. The four columns are: Year, Faculty Name, Primary Job Title, and Base Pay. There are 47,479 rows.
2. salaries_final.csv contains the same columns as [1], but also joins with data about the faculty's "Department" and "College" (for a total of six columns). There are only 14,470 rows in this dataset because we removed rows for which we could not identify the Department/College of the faculty.
All data is publicly available on the University of Vermont website. I downloaded all PDFs from https://www.uvm.edu/oir/faculty-and-staff. Then I used a Python package (Camelot) to parse the tabular PDFs and used regex matching to ensure data was correctly parsed. I performed some initial cleaning (removed dollar signs from monetary values, etc.). At this stage, I saved the data to salaries_without_dept.csv.
I also wanted to know what department and college each faculty belonged to. I used http://catalogue.uvm.edu/undergraduate/faculty/fulltime (plus Python's lxml package to parse the HTML) to determine "Department" and then manually built an encoding to map "Department" to "College". Note that this link provides faculty information for 2020, thus after joining we end up only with faculty that are still employed as of 2020 (this should be taken into consideration). Secondly, this link does not include UVM administration (and possibly some other personnel) so they are not present in this dataset. Thirdly, there were several different ways names were reported (sometimes even the same person has their name reported differently in different years). We tried joining first on LastName+FirstName and then on LastName+FirstInitial but did not bother using middle name. To handle ambiguity, we removed duplicates (e.g. we removed Martin, Jacob and Martin, Jacob William as they were not distinguishable by our criteria). The joined data is available in salaries_final.csv.
Note: perhaps "College" was not the best naming, since faculty of UVM Libraries and other miscellaneous fields are included.
The column definitions are self-explanatory, but the "College" abbreviation meanings are unclear to a non-UVM-affiliate. We've included data_dictionary.csv to explain what each "College" abbreviation means. You can use this dictionary to filter out miscellaneous "colleges" (e.g. UVM Libraries) and only include colleges within the undergraduate program (e.g. filter out College of Medicine).
Despite there only being a few (six) columns, I think this is quite a rich dataset and could also be paired with other UVM data or combined with data from other universities. This dataset is mainly for data analytics and exploratory data analysis (EDA), but perhaps it could also be used for forecasting (however, there's only 12 time values so you'd probably want to make use of "College" or "Primary Job Title"). Interesting EDA questions could be:
1. "Are the faculty in arts & humanities departments being paid less?" This news article -- UVM to eliminate 23 programs in the College of Arts and Sciences -- suggests so. Give a quantitative answer.
2. "Are lecturers declining in quantity and pay?" This news article -- ‘I’m going to miss this:’ Three cut lecturers reflect on time at UVM -- suggests so. Give a quantitative answer.
3. "How does the College of Medicine compare to the undergraduate colleges in terms of number of faculty and pay?" See data_dictionay.csv for which colleges are in the undergraduate program.
4. "How long does it take for a faculty member to become a full professor?" Yes, this is also answerable from the data because Primary Job Title updates when a faculty member is promoted.
I do not plan to maintain this dataset. If I get the chance, I may update it with future year salaries.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Code from https://github.com/jasonwei20/eda_nlp was run on the training dataset for the Jigsaw Unintended Bias in Toxicity Classification competition to create augmented training dataset. Number of augmentations was set to 16 and alpha value was set to 0.05.
train_augmented1605.zip - augmented training dataset for Jigsaw Unintended Bias in Toxicity Classification competition.
Code provided by: https://github.com/jasonwei20/eda_nlp
Code for the paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196
Special thanks to ErvTong \ @papasmurfff for sharing the eda_nlp repo with me. https://www.kaggle.com/papasmurfff
https://mlwhiz.com/blog/2019/02/19/siver_medal_kaggle_learnings/
The above article talks about how the 1st place competitors for the Quora Insincere Question competition stated they:
"We do not pad sequences to the same length based on the whole data, but just on a batch level. That means we conduct padding and truncation on the data generator level for each batch separately, so that length of the sentences in a batch can vary in size. Additionally, we further improved this by not truncating based on the length of the longest sequence in the batch but based on the 95% percentile of lengths within the sequence. This improved runtime heavily and kept accuracy quite robust on single model level, and improved it by being able to average more models."
This got @papasmurfff and I thinking about text augmentation and from there @papasmurfff found the eda_nlp repo.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.
This toy dataset features 150000 rows and 6 columns.
Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.
Number: A simple index number for each row
City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)
Gender: Gender of a person (Male or Female)
Age: The age of a person (Ranging from 25 to 65 years)
Income: Annual income of a person (Ranging from -674 to 177175)
Illness: Is the person Ill? (Yes or No)
Stock photo by Mika Baumeister on Unsplash.
Facebook
TwitterThis is an analysis of the data on Spotify tracks from 1921-2020 with Jupyter Notebook and Python Data Science tools.
The Spotify dataset (titled data.csv) consists of 160,000+ tracks sorted by name, from 1921-2020 found in Spotify as of June 2020. Collected by Kaggle user and Turkish Data Scientist Yamaç Eren Ay, the data was retrieved and tabulated from the Spotify Web API. Each row in the dataset corresponds to a track, with variables such as the title, artist, and year located in their respective columns. Aside from the fundamental variables, musical elements of each track, such as the tempo, danceability, and key, were likewise extracted; the algorithm for these values were generated by Spotify based on a range of technical parameters.
Spotify Data.ipynb is the main notebook where the data is imported for EDA and FII.data.csv is the dataset downloaded from Kaggle.spotify_eda.html is the HTML file for the comprehensive EDA done using the Pandas Profiling module.Credits to gabminamedez for the original dataset.
Facebook
TwitterA data science approach to predict and understand the applicant’s profile to minimize the risk of future loan defaults.
The dataset contains information about credit applicants. Banks, globally, use this kind of dataset and type of informative data to create models to help in deciding on who to accept/refuse for a loan. After all the exploratory data analysis, cleansing and dealing with all the anomalies we might (will) find along the way, the patterns of a good/bad applicant will be exposed to be learned by machine learning models.
Machine Learning issue and objectives We’re dealing with a supervised binary classification problem. The goal is to train the best machine learning model to maximize the predictive capability of deeply understanding the past customer’s profile minimizing the risk of future loan defaults.
Performance Metric The metric used for the models’ evaluation is the ROC AUC given that we’re dealing with a highly unbalanced data.
Project structure The project divides into three categories: EDA: Exploratory Data Analysis Data Wrangling: Cleansing and Feature Selection Machine Learning: Predictive Modelling
The dataset You can download the data set here. Feature description
id: Unique ID of the loan application.
grade: LC assigned loan grade.
annual_inc: The self-reported annual income provided by the borrower during registration.
short_emp: 1 when employed for 1 year or less.
emp_length_num: Employment length in years. Possible values are - between 0 and 10 where 0 means less than one year and 10 means ten or more years.
home_ownership: Type of home ownership.
dti (Debt-To-Income Ratio): A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
purpose: A category provided by the borrower for the loan request.
term: The number of payments on the loan. Values are in months and can be either 36 or 60.
last_delinq_none: 1 when the borrower had at least one event of delinquency.
last_major_derog_none: 1 borrower had at least 90 days of a bad rating.
revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_rec_late_fee: Late fees received to date.
od_ratio: Overdraft ratio.
bad_loan: 1 when a loan was not paid.
Note😃😃😃😃 This data is for training how using data analysis 🤝🎉
Please appreciate the effort with an upvote 👍 😃😃
Thank You ❤️❤️❤️
Facebook
TwitterThe Ecommerce Customer Dataset contains customer-related information from an online retail platform. It is often used for data analysis, customer segmentation, predictive modeling, and business intelligence tasks.
The dataset provides key details about customers, their demographics, and their purchasing behavior, which can help businesses understand their audience better and optimize decision-making.
📂 Features
Typical columns in an ecommerce dataset may include (depending on the version of the dataset you have):
CustomerID – Unique identifier for each customer
Gender – Male/Female/Other
Age – Customer’s age
Annual Income – Customer’s yearly income
Spending Score – A score assigned based on customer spending behavior
Purchase History – Past transactions or order details
Product Categories – Types of products bought
Date of Purchase – Timestamp of transactions
🎯 Use Cases
This dataset is widely used for:
Customer Segmentation (e.g., using K-Means clustering or RFM analysis)
Recommendation Systems (suggesting products to customers)
Churn Prediction (identifying customers likely to stop buying)
Sales Forecasting (predicting future purchases or revenue trends)
Marketing Analytics (targeting specific customer groups with campaigns)
🛠️ Suitable For
Data Science and Machine Learning projects
Beginner to intermediate learners exploring EDA, clustering, regression, or classification
Business analysts focusing on customer insights and ecommerce growth
Facebook
TwitterThe dataset contains a total of 25,161 rows, each row representing the stock market data for a specific company on a given date. The information collected through web scraping from www.nasdaq.com includes the stock prices and trading volumes for the companies listed, such as Apple, Starbucks, Microsoft, Cisco Systems, Qualcomm, Meta, Amazon.com, Tesla, Advanced Micro Devices, and Netflix.
Data Analysis Tasks:
1) Exploratory Data Analysis (EDA): Analyze the distribution of stock prices and volumes for each company over time. Visualize trends, seasonality, and patterns in the stock market data using line charts, bar plots, and heatmaps.
2)Correlation Analysis: Investigate the correlations between the closing prices of different companies to identify potential relationships. Calculate correlation coefficients and visualize correlation matrices.
3)Top Performers Identification: Identify the top-performing companies based on their stock price growth and trading volumes over a specific time period.
4)Market Sentiment Analysis: Perform sentiment analysis using Natural Language Processing (NLP) techniques on news headlines related to each company. Determine whether positive or negative news impacts the stock prices and volumes.
5)Volatility Analysis: Calculate the volatility of each company's stock prices using metrics like Standard Deviation or Bollinger Bands. Analyze how volatile stocks are in comparison to others.
Machine Learning Tasks:
1)Stock Price Prediction: Use time-series forecasting models like ARIMA, SARIMA, or Prophet to predict future stock prices for a particular company. Evaluate the models' performance using metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).
2)Classification of Stock Movements: Create a binary classification model to predict whether a stock will rise or fall on the next trading day. Utilize features like historical price changes, volumes, and technical indicators for the predictions. Implement classifiers such as Logistic Regression, Random Forest, or Support Vector Machines (SVM).
3)Clustering Analysis: Cluster companies based on their historical stock performance using unsupervised learning algorithms like K-means clustering. Explore if companies with similar stock price patterns belong to specific industry sectors.
4)Anomaly Detection: Detect anomalies in stock prices or trading volumes that deviate significantly from the historical trends. Use techniques like Isolation Forest or One-Class SVM for anomaly detection.
5)Reinforcement Learning for Portfolio Optimization: Formulate the stock market data as a reinforcement learning problem to optimize a portfolio's performance. Apply algorithms like Q-Learning or Deep Q-Networks (DQN) to learn the optimal trading strategy.
The dataset provided on Kaggle, titled "Stock Market Stars: Historical Data of Top 10 Companies," is intended for learning purposes only. The data has been gathered from public sources, specifically from web scraping www.nasdaq.com, and is presented in good faith to facilitate educational and research endeavors related to stock market analysis and data science.
It is essential to acknowledge that while we have taken reasonable measures to ensure the accuracy and reliability of the data, we do not guarantee its completeness or correctness. The information provided in this dataset may contain errors, inaccuracies, or omissions. Users are advised to use this dataset at their own risk and are responsible for verifying the data's integrity for their specific applications.
This dataset is not intended for any commercial or legal use, and any reliance on the data for financial or investment decisions is not recommended. We disclaim any responsibility or liability for any damages, losses, or consequences arising from the use of this dataset.
By accessing and utilizing this dataset on Kaggle, you agree to abide by these terms and conditions and understand that it is solely intended for educational and research purposes.
Please note that the dataset's contents, including the stock market data and company names, are subject to copyright and other proprietary rights of the respective sources. Users are advised to adhere to all applicable laws and regulations related to data usage, intellectual property, and any other relevant legal obligations.
In summary, this dataset is provided "as is" for learning purposes, without any warranties or guarantees, and users should exercise due diligence and judgment when using the data for any purpose.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
When I first came across the competition a few weeks ago, I knew I just had to participate, Quickly I started going through the data provided, and man o man...was it not large? most of us haven't dealt with the type of information before, at least if you are not working or a student like me.
I went over the EDA Provided by Paul_Mooney and noticed a way of slicing the data-frame fetching values from it, I wanted to have a simpler solution that would be easily understood by many,
I reviewed all previous datasets from 2018 to 2021 and found there are common questions, plus a few added over the years. We will call these questions the "Look up Questions"
I manually made an Excel sheet aka the "Look up Table" listing these questions row-by-row, for all 5 years. Most importantly, I started adding their question tag (Q1, Q3. Q26_A, Q33_B, etc) for every year.
Now what we have is,
A. Look u Table
B. Unique Questions listed row-by-row
C. For every Question, It's a column name for every year
https://imgur.com/fddPb94.jpg" alt="">
Note. Blake space / empty field means that particular question was not asked in that specific year
https://imgur.com/3BQLZUS.jpg" alt="">
https://imgur.com/aQrumcx.jpg" alt="">
A. Quick referencing, Spend more time analyzing, and less on fiddling B. With a few custom functions (Added below), A single line of code will get you any sort of Data, Filtered & categorized based on ANY other column C. Works with previous years as well future kaggle survey analytics (Given that the question format doesn't change, didn't change for the past 5yrs)
Heres a demo notebook -> https://www.kaggle.com/code/pranav941/kaggle-analytics-helper-functions-2017-2022
Facebook
TwitterThe loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants are capable of repaying the loans are not rejected. When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision: If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios: The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample, All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:
Approved: The Company has approved loan Application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused offer: Loan has been cancelled by the client but at different stages of the process.
The case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (too risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants, using EDA is the aim of this case study. In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e., the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.
1. application_data.csv It contains all the information of the client at the time of application. The data is about whether a client has payment difficulties. 2. previous_application.csv It contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. 3. columns_description.csv It is data dictionary which describes the meaning of the variables.
The solution is made in 2 different ipymb files First file contains detailed analysis (EDA) on application_data to identify the important features which help us to identify the defaulters Second file contains data where we inner join the records (application_data, previous_application) with same the SK_ID_CURR
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Title: Exploring Mars: A Comprehensive Dataset of Rover Photos and Metadata Description
This dataset provides an extensive collection of Mars rover images paired with in-depth metadata. Sourced from various Mars missions, this dataset is a treasure trove for anyone interested in space exploration, planetary science, or computer vision.
Components:
Dataset Origin
The dataset was compiled from various Mars missions conducted over the years. Special care has been taken to include a diverse set of images to enable a wide range of analyses and applications. Objective
As a learner delving into the field of Computer Vision, my objectives for this project are multi-fold:
Research Questions
Tools and Technologies
I plan to utilize Python for this project, particularly libraries like OpenCV for image processing, Pandas for data manipulation, and Matplotlib/Seaborn for data visualization. For machine learning tasks, I will likely use scikit-learn or TensorFlow.
Learning and Development
This project serves as both a learning exercise and a stepping stone toward more complex computer vision projects. I aim to document my learning journey, challenges, and milestones in a series of Kaggle notebooks. Collaboration and Feedback
I warmly invite the Kaggle community to offer suggestions, critiques, or even collaborate on this venture. Your insights could be invaluable in enhancing the depth and breadth of this project.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
amazon_cell_phones_updated.csv — this new version includes updated features and additional data fields, while preserving the core structure and main features of the original dataset.amazon_cell_phones_original_old.csv. The current dataset description refers to this version.📌 Note: The description below is still based on the original dataset (amazon_cell_phones_original_old.csv). For the latest structure and feature details, please refer directly to amazon_cell_phones_updated.csv.
Various details about cell phones listed on Amazon USA, such as product name, price, rating, number of ratings, and technical specifications like RAM, storage, screen size, and more.
This dataset contains detailed information about cellphones listed on Amazon, scraped using Selenium and BeautifulSoup. It includes product details such as the name, price, ratings, specifications (RAM, storage, screen size, etc.), and additional metadata like the number of ratings and discount percentage. The dataset was designed to provide insights into cellphone features, pricing trends, and customer feedback on one of the world's largest e-commerce platforms.
The data was scraped from Amazon's cellphone category pages over multiple pages (up to 250 pages). Given Amazon's structure, the dataset includes a wide variety of cellphone brands and models, including older and newer releases.
The dataset has been cleaned to remove duplicates and standardize data entries. Missing values were handled where possible, and units of measurement (e.g., RAM, storage) have been converted for consistency.
Facebook
TwitterImportant Note: The "Top 250 French Movies" dataset comprises information on the highest-rated French movies according to user ratings on various platforms. This dataset contains 250 unique French movies that have garnered critical acclaim and popularity among viewers. Each movie is associated with essential details, including its rank, title, release year, duration, genre, IMDb rating, image source link, and a brief description.
This dataset is intended for learning, research, and analysis purposes. The movie ratings and details provided in the dataset are based on publicly available information at the time of scraping. As IMDb ratings and movie information may change over time, it is essential to verify and update the data for the latest information.
By using this dataset, you acknowledge that the accuracy and completeness of the information cannot be guaranteed, and you assume responsibility for any analysis or decision-making based on the data. Additionally, please adhere to IMDb's terms of use and copyright policies when using the data for any public dissemination or commercial purposes.
Data Analysis Tasks:
1.Exploratory Data Analysis (EDA): Explore the distribution of movies by genres, release years, and IMDb ratings. Visualize the top-rated French movies and their IMDb ratings using bar charts or histograms.
2.Year-wise Trends: Observe trends in French movie production over the years using line charts or area plots. Analyze if there's any correlation between release year and IMDb ratings.
3.Word Cloud Analysis: Create word clouds from movie descriptions to visualize the most common words and themes among the top-rated French movies. This can provide insights into popular topics and genres.
4.Network Analysis: Build a network graph connecting French movies that share common actors or directors. Analyze the interconnectedness of movies based on their production teams.
Machine Learning Tasks:
1.Movie Recommendation System: Implement a content-based recommendation system that suggests French movies based on similarities in genre, release year, and IMDb ratings. Use techniques like cosine similarity or Jaccard similarity to measure movie similarities.
2.Movie Genre Classification: Build a multi-class classification model to predict the genre of an French movie based on its description. Utilize Natural Language Processing (NLP) techniques like text preprocessing, TF-IDF, or word embeddings. Use classifiers like Logistic Regression, Naive Bayes, or Support Vector Machines.
3.Movie Sentiment Analysis: Perform sentiment analysis on movie descriptions to determine the overall sentiment (positive, negative, neutral) of each movie. Use sentiment lexicons or pre-trained sentiment analysis models.
4.Movie Rating Prediction: Develop a regression model to predict the IMDb rating of an French movie based on features like genre, release year, and description sentiment. Employ regression algorithms like Linear Regression, Decision Trees, or Random Forests.
5.Movie Clustering: Apply unsupervised clustering algorithms to group French movies with similar attributes. Use features like genre, IMDb rating, and release year to identify movie clusters. Experiment with algorithms like K-means clustering or hierarchical clustering.
Important Note: Ensure that the data is appropriately preprocessed and encoded for machine learning tasks. Handle any missing values, perform feature engineering, and split the dataset into training and testing sets. Evaluate the performance of each machine learning model using appropriate metrics such as accuracy, precision, recall, or Mean Squared Error (MSE) depending on the task.
It is crucial to remember that the performance of machine learning models may vary based on the dataset's size and quality. Interpret the results carefully and consider using cross-validation techniques to assess model generalization.
Lastly, please adhere to IMDb's terms of use and any applicable data usage policies while conducting data analysis and implementing machine learning models with this dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context This dataset holds a list of approx 200 + books in the field of Data science related topics. The list of books was constructed using one of the popular websites Amazon which provide information on book ratings and many details given below.
There are 6 column
Book_name / book title
Publisher:-- name of the publisher or writer
Buyers ():--it means no of customer who purchase the same book
Cover_type:-- types of cover use to protect the book
stars:--out of 5 * how much rated
Price
Inspiration I’d like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas:
• What is the best-selling book?
• Find any hidden patterns if you can
. EDA of dataset