Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
30 Year Mortgage Rate in the United States decreased to 6.77 percent in June 26 from 6.81 percent in the previous week. This dataset includes a chart with historical data for the United States 30 Year Mortgage Rate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fixed 30-year mortgage rates in the United States averaged 6.88 percent in the week ending June 20 of 2025. This dataset provides the latest reported value for - United States MBA 30-Yr Mortgage Rate - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
DESCRIPTION
Create a model that predicts whether or not a loan will be default using the historical data.
Problem Statement:
For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.
Domain: Finance
Analysis to be done: Perform data preprocessing and build a deep learning prediction model.
Content:
Dataset columns and definition:
credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
Steps to perform:
Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.
Tasks:
Transform categorical values into numerical values (discrete)
Exploratory data analysis of different factors of the dataset.
Additional Feature Engineering
You will check the correlation between features and will drop those features which have a strong correlation
This will help reduce the number of features and will leave you with the most relevant features
After applying EDA and feature engineering, you are now ready to build the predictive models
In this part, you will create a deep learning model using Keras with Tensorflow backend
Data Description
1 id : To uniquely identify every loan in the dataset.
2 member_id : To identify the borrower to who has applied for the loan. 3 loan_amnt : The listed amount of the loan applied for by the borrower. 4 funded_amnt : The amount that was sanctioned by the LC. 5 term : The number of payments on the loan. Values are in months and can be either 36 or 60. 6 int_rate : Interest Rate on the loan 7 installment : The monthly payment owed by the borrower if the loan originates. 8 grade : LC assigned loan grade which depends on the borrower’s credit score. 9 sub_grade : LC assigned loan subgrade 10 emp_title : The job title supplied by the Borrower when applying for the loan.* 11 emp_length : Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. 12 home_ownership : The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER 13 annual_inc : The self-reported annual income provided by the borrower during registration. 14 verification_status : Indicates if income was verified by LC, not verified, or if the income source was verified 15 issue_d : The month which the loan was funded 16 loan_status : Current status of the loan 17 purpose : A category provided in the form of a code to indicate the purpose for the loan. 18 title : Explaining the ‘purpose’ of the loan. 19 dti : The debt to income ratio is the ratio of how much the borrower owes every month to the borrower’s income every month. 20 delinq_2yrs : The number of delinquencies(late installment payment) by the borrower in the past 2 years. 21 earliest_cr_line : The month-year the borrower's earliest reported credit line was opened 22 inq_last_6mths : Inquiries for loans made by the borrower over the past 6 months. 23 mths_since_last_delinq : Months that have passed since the borrower last missed the timely payment of installment. 24 open_acc : The number of open credit lines in the borrower’s credit file. 25 pub_rec Number of derogatory public records 26 revol_bal : Total credit revolving balance 27 revol_util : Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. 28 total_acc : The total number of credit lines currently in the borrower's credit file 29 initial_list_status : The initial listing status of the loan. Possible values are – W(whole), F(fractional) 30 out_prncp : Remaining outstanding principal for total amount funded 31 total_pymnt : Payments received to date for the total amount funded. 32 total_rec_prncp : Principal received till date. 33 total_rec_int Interest received till date. 34 total_rec_late_fee : Late fees received to date. 35 recoveries : Total recovery procedures initiated against the borrower. 36 collection_recovery_fee : The fees collected during the recovery procedures. 37 last_pymnt_d The last month when payment was received. 38 last_pymnt_amnt : The last payment amount received. 39 next_pymnt_d : Next scheduled payment date. 40 last_credit_pull_d : The most recent month LC pulled credit for this loan 41 collections_12_mths_ex_med : Number of collections in 12 months excluding medical collections 42 mths_since_last_major_derog : Months since most recent 90-day delinquency or worse rating 43 application_type Indicates whether the loan is an individual application or a joint application with two co-borrowers 44 annual_inc_joint : The combined self-reported annual income provided by the co-borrowers during registration 45 dti_joint : A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers' combined self-reported monthly income 46 acc_now_delinq : The number of accounts on which the borrower is now delinquent 47 tot_coll_amt : Total collection amounts ever owed by the borrower 48 tot_cur_bal : Total current balance of all accounts owned by the borrower 49 total_rev_hi_lim : Total high credit/credit limit
Lending Club offers peer-to-peer (P2P) loans through a technological platform for various personal finance purposes and is today one of the companies that dominate the US P2P lending market. The original dataset is publicly available on Kaggle and corresponds to all the loans issued by Lending Club between 2007 and 2018. The present version of the dataset is for constructing a granting model, that is, a model designed to make decisions on whether to grant a loan based on information available at the time of the loan application. Consequently, our dataset only has a selection of variables from the original one, which are the variables known at the moment the loan request is made. Furthermore, the target variable of a granting model represents the final status of the loan, that are "default" or "fully paid". Thus, we filtered out from the original dataset all the loans in transitory states. Our dataset comprises 1,347,681 records or obligations (approximately 60% of the original) and it was also cleaned for completeness and consistency (less than 1% of our dataset was filtered out).
TARGET VARIABLE
The dataset includes a target variable based on the final resolution of the credit: the default category corresponds to the event charged off and the non-default category to the event fully paid. It does not consider other values in the loan status variable since this variable represents the state of the loan at the end of the considered time window. Thus, there are no loans in transitory states. The original dataset includes the target variable “loan status”, which contains several categories ('Fully Paid', 'Current', 'Charged Off', 'In Grace Period', 'Late (31-120 days)', 'Late (16-30 days)', 'Default'). However, in our dataset, we just consider loans that are either “Fully Paid” or “Default” and transform this variable into a binary variable called “Default”, with a 0 for fully paid loans and a 1 for defaulted loans.
EXPLANATORY VARIABLES
The explanatory variables that we use correspond only to the information available at the time of the application. Variables such as the interest rate, grade, or subgrade are generated by the company as a result of a credit risk assessment process, so they were filtered out from the dataset as they must not be considered in risk models to predict the default in granting of credit.
FULL LIST OF VARIABLES
Loan identification variables:
id: Loan id (unique identifier).
issue_d: Month and year in which the loan was approved.
Quantitative variables:
revenue: Borrower's self-declared annual income during registration.
dti_n: Indebtedness ratio for obligations excluding mortgage. Monthly information. This ratio has been calculated considering the indebtedness of the whole group of applicants. It is estimated as the ratio calculated using the co-borrowers’ total payments on the total debt obligations divided by the co-borrowers’ combined monthly income.
loan_amnt: Amount of credit requested by the borrower.
fico_n: Defined between 300 and 850, reported by Fair Isaac Corporation as a risk measure based on historical credit information reported at the time of application. This value has been calculated as the average of the variables “fico_range_low” and “fico_range_high” in the original dataset.
experience_c: Binary variable that indicates whether the borrower is new to the entity. This variable is constructed from the credit date of the previous obligation in LC and the credit date of the current obligation; if the difference between dates is positive, it is not considered as a new experience with LC.
Categorical variables:
emp_length: Categorical variable with the employment length of the borrower (includes the no information category)
purpose: Credit purpose category for the loan request.
home_ownership_n: Homeownership status provided by the borrower in the registration process. Categories defined by LC: “mortgage”, “rent”, “own”, “other”, “any”, “none”. We merged the categories “other”, “any” and “none” as “other”.
addr_state: Borrower's residence state from the USA.
zip_code: Zip code of the borrower's residence.
Textual variables
title: Title of the credit request description provided by the borrower.
desc: Description of the credit request provided by the borrower.
We cleaned the textual variables. First, we removed all those descriptions that contained the default description provided by Lending Club on its web form (“Tell your story. What is your loan for?”). Moreover, we removed the prefix “Borrower added on DD/MM/YYYY >” from the descriptions to avoid any temporal background on them. Finally, as these descriptions came from a web form, we substituted all the HTML elements by their character (e.g. “&” was substituted by “&”, “<” was substituted by “<”, etc.).
RELATED WORKS
This dataset has been used in the following academic articles:
Sanz-Guerrero, M. Arroyo, J. (2024). Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending. arXiv preprint arXiv:2401.16458. https://doi.org/10.48550/arXiv.2401.16458
Ariza-Garzón, M.J., Arroyo, J., Caparrini, A., Segovia-Vargas, M.J. (2020). Explainability of a machine learning granting scoring model in peer-to-peer lending. IEEE Access 8, 64873 - 64890. https://doi.org/10.1109/ACCESS.2020.2984412
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Mortgage and Loans Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Mortgage and Loans] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-mortgage-loans-llm-chatbot-training-dataset.
The CoreLogic Loan-Level Market Analytics (LLMA) for primary mortgages dataset contains detailed loan data, including origination, events, performance, forbearance and inferred modification data.
CoreLogic sources the Loan-Level Market Analytics data directly from loan servicers. CoreLogic cleans and augments the contributed records with modeled data. The Data Dictionary indicates which fields are contributed and which are inferred.
The Loan-Level Market Analytics data is aimed at providing lenders, servicers, investors, and advisory firms with the insights they need to make trustworthy assessments and accurate decisions. Stanford Libraries has purchased the Loan-Level Market Analytics data for researchers interested in housing, economics, finance and other topics related to prime and subprime first lien data.
CoreLogic provided the data to Stanford Libraries as pipe-delimited text files, which we have uploaded to Data Farm (Redivis) for preview, extraction and analysis.
For more information about how the data was prepared for Redivis, please see CoreLogic 2024 GitLab.
Per the End User License Agreement, the LLMA Data cannot be commingled (i.e. merged, mixed or combined) with Tax and Deed Data that Stanford University has licensed from CoreLogic, or other data which includes the same or similar data elements or that can otherwise be used to identify individual persons or loan servicers.
The 2015 major release of CoreLogic Loan-Level Market Analytics (for primary mortgages) was intended to enhance the CoreLogic servicing consortium through data quality improvements and integrated analytics. See **CL_LLMA_ReleaseNotes.pdf **for more information about these changes.
For more information about included variables, please see CL_LLMA_Data_Dictionary.pdf.
**
For more information about how the database was set up, please see LLMA_Download_Guide.pdf.
Data access is required to view this section.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘ Zillow Housing Aspirations Report’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/zillow-housing-aspirations-reporte on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Additional Data Products
Product: Zillow Housing Aspirations Report
Date: April 2017
Definitions
Home Types and Housing Stock
- All Homes: Zillow defines all homes as single-family, condominium and co-operative homes with a county record. Unless specified, all series cover this segment of the housing stock.
- Condo/Co-op: Condominium and co-operative homes.
- Multifamily 5+ units: Units in buildings with 5 or more housing units, that are not a condominiums or co-ops.
- Duplex/Triplex: Housing units in buildings with 2 or 3 housing units.
Additional Data Products
- Zillow Home Value Forecast (ZHVF): The ZHVF is the one-year forecast of the ZHVI. Our forecast methodology is methodology post.
- Zillow creates our negative equity data using our own data in conjunction with data received through our partnership with TransUnion, a leading credit bureau. We match estimated home values against actual outstanding home-related debt amounts provided by TransUnion. To read more about how we calculate our negative equity metrics, please see our here.
- Cash Buyers: The share of homes in a given area purchased without financing/in cash. To read about how we calculate our cash buyer data, please see our research brief.
- Mortgage Affordability, Rental Affordability, Price-to-Income Ratio, Historical ZHVI, Historical ZHVI and Houshold Income are calculated as a part of Zillow’s quarterly Affordability Indices. To calculate mortgage affordability, we first calculate the mortgage payment for the median-valued home in a metropolitan area by using the metro-level Zillow Home Value Index for a given quarter and the 30-year fixed mortgage interest rate during that time period, provided by the Freddie Mac Primary Mortgage Market Survey (based on a 20 percent down payment). Then, we consider what portion of the monthly median household income (U.S. Census) goes toward this monthly mortgage payment. Median household income is available with a lag. For quarters where median income is not available from the U.S. Census Bureau, we calculate future quarters of median household income by estimating it using the Bureau of Labor Statistics’ Employment Cost Index. The affordability forecast is calculated similarly to the current affordability index but uses the one year Zillow Home Value Forecast instead of the current Zillow Home Value Index and a specified interest rate in lieu of PMMS. It also assumes a 20 percent down payment. We calculate rent affordability similarly to mortgage affordability; however we use the Zillow Rent Index, which tracks the monthly median rent in particular geographical regions, to capture rental prices. Rents are chained back in time by using U.S. Census Bureau American Community Survey data from 2006 to the start of the Zillow Rent Index, and Decennial Census for all other years.
- The mortgage rate series is the average mortgage rate quoted on Zillow Mortgages for a 30-year, fixed-rate mortgage in 15-minute increments during business hours, 6:00 AM to 5:00 PM Pacific. It does not include quotes for jumbo loans, FHA loans, VA loans, loans with mortgage insurance or quotes to consumers with credit scores below 720. Federal holidays are excluded. The jumbo mortgage rate series is the average jumbo mortgage rate quoted on Zillow Mortgages for a 30-year, fixed-rate, jumbo mortgage in one-hour increments during business hours, 6:00 AM to 5:00 PM Pacific Time. It does not include quotes to consumers with credit scores below 720. Traditional federal holidays and hours with insufficient sample sizes are excluded.
About Zillow Data (and Terms of Use Information)
- Zillow is in the process of transitioning some data sources with the goal of producing published data that is more comprehensive, reliable, accurate and timely. As this new data is incorporated, the publication of select metrics may be delayed or temporarily suspended. We look forward to resuming our usual publication schedule for all of our established datasets as soon as possible, and we apologize for any inconvenience. Thank you for your patience and understanding.
- All data accessed and downloaded from this page is free for public use by consumers, media, analysts, academics etc., consistent with our published Terms of Use. Proper and clear attribution of all data to Zillow is required.
- For other data requests or inquiries for Zillow Real Estate Research, contact us here.
- All files are time series unless noted otherwise.
- To download all Zillow metrics for specific levels of geography, click here.
- To download a crosswalk between Zillow regions and federally defined regions for counties and metro areas, click here.
- Unless otherwise noted, all series cover single-family residences, condominiums and co-op homes only.
Source: https://www.zillow.com/research/data/
This dataset was created by Zillow Data and contains around 200 samples along with Unnamed: 1, Unnamed: 0, technical information and other features such as: - Unnamed: 1 - Unnamed: 0 - and more.
- Analyze Unnamed: 1 in relation to Unnamed: 0
- Study the influence of Unnamed: 1 on Unnamed: 0
- More datasets
If you use this dataset in your research, please credit Zillow Data
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The benchmark interest rate in the United States was last recorded at 4.50 percent. This dataset provides the latest reported value for - United States Fed Funds Rate - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides information about people applying for loans, including details on their personal background, finances, and loan specifics. It's meant to help us better understand how different personal factors impact whether a loan gets approved. The data includes things like the applicant's age, income, home ownership status, job history, and credit score, along with loan details such as the loan amount, interest rate, and purpose. It also shows whether the loan was approved or denied.
Features in the dataset:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Iran IR: Lending Interest Rate data was reported at 18.000 % pa in 2016. This records an increase from the previous number of 14.210 % pa for 2015. Iran IR: Lending Interest Rate data is updated yearly, averaging 12.000 % pa from Dec 2004 (Median) to 2016, with 13 observations. The data reached an all-time high of 18.000 % pa in 2016 and a record low of 11.000 % pa in 2013. Iran IR: Lending Interest Rate data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Iran – Table IR.World Bank.WDI: Interest Rates. Lending rate is the bank rate that usually meets the short- and medium-term financing needs of the private sector. This rate is normally differentiated according to creditworthiness of borrowers and objectives of financing. The terms and conditions attached to these rates differ by country, however, limiting their comparability.; ; International Monetary Fund, International Financial Statistics and data files.; ;
The FHA Office of Housing last conducted a series of mortgage loan sales under the Single Family Loan Sale (SFLS) Initiative in 2016. The current sales structure consisted of whole loan, competitive auctions, offering for purchase defaulted single family mortgages provided by FHA-approved loan servicers. The loans sold contained specified representations and warranties and may be sold with post-sale restrictions and/or reporting requirements. FHA sold loans in large national pools, as well as loan pools in designated geographical areas that are aimed at a neighborhood stabilization outcome (“NSO pools”).
This dataset contains loan application data aimed at detecting fraud. Each application has a loan status that indicates the outcome and is categorized into two groups: normal loans (value 0) and fraudulent loans (value 1).
Normal loans include statuses like Paid Off Loan, Charged Off Paid Off, and Settlement Paid Off. Fraudulent loans include Rejected, Internal Collection, and Charged Off. Other statuses are excluded from the classification process.
The dataset includes training data for building predictive models and evaluation data for testing. A detailed dictionary is provided to explain the columns for clarity.
In the data.zip
file, you will find 3 folders:
submission.csv
file should be filled with predictions based on this data.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘SBA Loans Case Data Set’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/larsen0966/sba-loans-case-data-set on 13 February 2022.
--- Dataset description provided by original source is as follows ---
If you like the data set and download it, an upvote would be appreciated.
The Small Business Administration (SBA) was founded in 1953 to assist small businesses in obtaining loans. Small businesses have been the primary source of employment in the United States. Helping small businesses help with job creation, which reduces unemployment. Small business growth also promotes economic growth. One of the ways the SBA helps small businesses is by guaranteeing bank loans. This guarantee reduces the risk to banks and encourages them to lend to small businesses. If the loan defaults, the SBA covers the amount guaranteed, and the bank suffers a loss for the remaining balance.
There have been several small business success stories like FedEx and Apple. However, the rate of default is very high. Many economists believe the banking market works better without the assistance of the SBA. Supporter claim that the social benefits and job creation outweigh any financial costs to the government in defaulted loans.
The original data set is from the U.S.SBA loan database, which includes historical data from 1987 through 2014 (899,164 observations) with 27 variables. The data set includes information on whether the loan was paid off in full or if the SMA had to charge off any amount and how much that amount was. The data set used is a subset of the original set. It contains loans about the Real Estate and Rental and Leasing industry in California. This file has 2,102 observations and 35 variables. The column Default is an integer of 1 or zero, and I had to change this column to a factor.
For more information on this data set go to https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If you like the data set and download it, an upvote would be appreciated.
The Small Business Administration (SBA) was founded in 1953 to assist small businesses in obtaining loans. Small businesses have been the primary source of employment in the United States. Helping small businesses help with job creation, which reduces unemployment. Small business growth also promotes economic growth. One of the ways the SBA helps small businesses is by guaranteeing bank loans. This guarantee reduces the risk to banks and encourages them to lend to small businesses. If the loan defaults, the SBA covers the amount guaranteed, and the bank suffers a loss for the remaining balance.
There have been several small business success stories like FedEx and Apple. However, the rate of default is very high. Many economists believe the banking market works better without the assistance of the SBA. Supporter claim that the social benefits and job creation outweigh any financial costs to the government in defaulted loans.
The original data set is from the U.S.SBA loan database, which includes historical data from 1987 through 2014 (899,164 observations) with 27 variables. The data set includes information on whether the loan was paid off in full or if the SMA had to charge off any amount and how much that amount was. The data set used is a subset of the original set. It contains loans about the Real Estate and Rental and Leasing industry in California. This file has 2,102 observations and 35 variables. The column Default is an integer of 1 or zero, and I had to change this column to a factor.
For more information on this data set go to https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342
Dataset of UK mortgage products with 1-year fixed terms, including initial rates, APRC, fees, and LTV percentages.
Weekly updated dataset of Santander mortgage offerings, including interest rates, APRC, fees, and LTV for each product.
DESCRIPTION
A banking institution requires actionable insights into mortgage-backed securities, geographic business investment, and real estate analysis. The mortgage bank would like to identify potential monthly mortgage expenses for each region based on monthly family income and rental of the real estate. A statistical model needs to be created to predict the potential demand in dollars amount of loan for each of the region in the USA. Also, there is a need to create a dashboard which would refresh periodically post data retrieval from the agencies. The dashboard must demonstrate relationships and trends for the key metrics as follows: number of loans, average rental income, monthly mortgage and owner’s cost, family income vs mortgage cost comparison across different regions. The metrics described here do not limit the dashboard to these few. Dataset Description
Variables
Description Second mortgage Households with a second mortgage statistics Home equity Households with a home equity loan statistics Debt Households with any type of debt statistics Mortgage Costs Statistics regarding mortgage payments, home equity loans, utilities, and property taxes Home Owner Costs Sum of utilities, and property taxes statistics Gross Rent Contract rent plus the estimated average monthly cost of utility features High school Graduation High school graduation statistics Population Demographics Population demographics statistics Age Demographics Age demographic statistics Household Income Total income of people residing in the household Family Income Total income of people related to the householder Project Task: Week 1
Data Import and Preparation:
Import data.
Figure out the primary key and look for the requirement of indexing.
Gauge the fill rate of the variables and devise plans for missing value treatment. Please explain explicitly the reason for the treatment chosen for each variable.
Exploratory Data Analysis (EDA):
Perform debt analysis. You may take the following steps:
Explore the top 2,500 locations where the percentage of households with a second mortgage is the highest and percent ownership is above 10 percent. Visualize using geo-map. You may keep the upper limit for the percent of households with a second mortgage to 50 percent
Use the following bad debt equation:
Bad Debt = P (Second Mortgage ∩ Home Equity Loan) Bad Debt = second_mortgage + home_equity - home_equity_second_mortgage Create pie charts to show overall debt and bad debt
Create Box and whisker plot and analyze the distribution for 2nd mortgage, home equity, good debt, and bad debt for different cities
Create a collated income distribution chart for family income, house hold income, and remaining income
Perform EDA and come out with insights into population density and age. You may have to derive new fields (make sure to weight averages for accurate measurements):
Use pop and ALand variables to create a new field called population density
Use male_age_median, female_age_median, male_pop, and female_pop to create a new field called median age
Visualize the findings using appropriate chart type
Create bins for population into a new variable by selecting appropriate class interval so that the number of categories don’t exceed 5 for the ease of analysis.
Analyze the married, separated, and divorced population for these population brackets
Visualize using appropriate chart type
Please detail your observations for rent as a percentage of income at an overall level, and for different states.
Perform correlation analysis for all the relevant variables by creating a heatmap. Describe your findings.
Project Task: Week 2
Data Pre-processing:
The economic multivariate data has a significant number of measured variables. The goal is to find where the measured variables depend on a number of smaller unobserved common factors or latent variables.
Each variable is assumed to be dependent upon a linear combination of the common factors, and the coefficients are known as loadings. Each measured variable also includes a component due to independent random variability, known as “specific variance” because it is specific to one variable. Obtain the common factors and then plot the loadings. Use factor analysis to find latent variables in our dataset and gain insight into the linear relationships in the data.
Following are the list of latent variables:
Highschool graduation rates
Median population age
Second mortgage statistics
Percent own
Bad debt expense
Data Modeling :
Build a linear Regression model to predict the total monthly expenditure for home mortgages loan.
Please refer deplotment_RE.xlsx. Column hc_mortgage_mean is predicted variable. This is the mean monthly mortgage and owner costs of specified geographical location.
Note: Exclude loans from prediction model which have NaN (Not a Numb...
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios: The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample, All other cases: All other cases when the payment is paid on time.
When a client applies for a loan, there are four types of decisions that could be taken by the client/company): Approved: The Company has approved loan Application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client, he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused offer: Loan has been cancelled by the client but at different stages of the process.
Weekly updated dataset of mortgage rates and offerings from TSB including details such as term length, initial interest rate, APRC, fees, and LTV.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
30 Year Mortgage Rate in the United States decreased to 6.77 percent in June 26 from 6.81 percent in the previous week. This dataset includes a chart with historical data for the United States 30 Year Mortgage Rate.