Credit card default risk is the chance that companies or Individuals will not be able to return the money lent on time.
You are given relevant information about the customers of a company.
You are required to build a machine learning model that can predict if there will be credit card default.
The dataset folder contains the following files:
The columns provided in the dataset are as follows:
Column name |
Description |
customer_id | Represents the unique identification of a customer |
name | Represents the name of a customer |
age | Represents the age of a customer ( in years ) |
gender | Represents the gender of a customer( F means Female and M means Male ) |
owns_car | Represents whether a customer owns a car ( Y means Yes and N means No ) |
owns_house | Represents whether a customer owns a house ( Y means Yes and N means No ) |
no_of_children | Represents the number of children of a customer |
net_yearly_income | Represents the net yearly income of a customer ( in USD ) |
no_of_days_employed | Represents the no of days employed |
occupation_type | Represents the occupation type of a customer ( IT staff, Managers, Accountants, Cooking staff, etc ) |
total_family_members | Represents the number of family members of a customer |
migrant_worker | Represents whether a customer is a migrant worker( 1 means Yes and 0 means No ) |
yearly_debt_payments | Represents the yearly debt payments of a customer ( in USD ) |
credit_limit | Represents the credit limit of a customer ( in USD ) |
credit_limit_used(%) | Represents the percentage of credit limit used by a customer |
credit_score | Represents the credit score of a customer |
prev_defaults | Represents the number of previous defaults |
default_in_last_6months | Represents whether a customer has defaulted in the last 6 months ( 1 means Yes and 0 means No ) |
credit_card_default | Represents whether there will be credit card default ( 1 means Yes and 0 means No ) |
score = 100*(metrics.f1_score(actual, predicted, average= "macro" ))
Note: Ensure that your submission file contains the following:
AV HackLive - Guided Community Hackathon!
Data Science competitions can be daunting for someone who has never participated in one. Some of them have hundreds of competitors with top notch industry knowledge and splendid past record in such hackathons.
Thus a lot of beginners are apprehensive about getting started with these hackathons
The top 3 questions that are commonly asked:
Is it even worth it if I have minimal chance of winning? How do I start? How can I improve my rank in the future? Let’s answer the first question before we go further.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset organized by the Open-Earth-Monitor (OEMC) project within the context of Hackathon 2023.
The dataset contains monthly mean FAPAR values aggregated by each ground station. FAPAR represents the fraction of the incoming (photosynthetic active) radiation that is absorbed by vegetation, and is given in the range 0-1
. It is a measure of vegetation health and ecosystem functioning, and a key parameter in light use efficiency models that model primary productivity.
For each monthly FAPAR value, a set of covariates / features were extracted from 32 raster spatial layers, including including satellite (spectral bands and indices) and temperature images (land surface temperature), climate images (precipitation) and digital terrain model (slope and elevation). The features are organized by columns, unique data points in time are identified by the sample_id
column, and data points points belonging to the same location are identified by station_number
.
Column names:
sample_id
: unique identifier of datapointstation
: ground station numberfapar
: monthly mean FAPARmonth
: month of measurementmodis_{..}
: NDVI, EVI, reflectance bands 1 (red), 2 (near-infrared), 3 (blue), and 7 (mid-infrared) based on MOD13Q1modis_lst_day_p{..}
: Land surface temperatures daytime of percentiles 5th, 50th and 95th based on MOD11A2modis_lst_night_p{..}
: Land surface temperatures nighttime of percentiles 5th, 50th and 95th based on MOD11A2wv_yearly_p{..}
: Water vapour aggregated yearly by percentiles 25th, 50th and 75th based on derived from MCD19A2wv_monthly_lt_p{..}
: Water vapour aggregated long-term monthly by percentiles 25th, 50th and 75th based on MCD19A2wv_monthly_lt_sd
: Water vapour aggregated long-term monthly standard deviation based on MCD19A2wv_monthly_ts_raw
: Water vapour monthly time series based on MCD19A2wv_monthly_ts_smooth
: Water vapour monthly time series smoothed using the Whittaker method based on MCD19A2accum_pr_monthly
: Monthly accumulated precipitation based on CHELSA timeseriesdtm_{..}
: Several DTM derivatives (Elevation, Slope, aspect (sine, cosine), curvature (up- and downslope), openness (negative, positive), compound topographic index (cti), valley bottom flatness (vbf)) based on MERIT DEMFiles
sample_id
- index column), ground station (station
), reference month (month
), measured FAPAR (fapar
), and 32 features / covariatessample_id
- index column), ground station (station
), reference month (month
) and 32 features / covariatessample_id
- index column) and measured FAPAR (fapar
)Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset organized by the Open-Earth-Monitor (OEMC) project within the context of Hackathon 2023.
The dataset (both train and test) was produced by stratified sampling of the ground-truth data provided by LUCAS Survey, funded by the European Commission. The target land cover considered level-3 classes from the harmonized legend, resulting in 72 classes distributed over 5 years (2006
, 2009
, 2012
, 2015
, 2018
):
All samples were overlaid with 416 raster spatial layers, including satellite (spectral bands and indices) and temperature images (land surface temperature), climate images (precipitation, air temperature), accessibility and distance maps (highways, water bodies, burned areas), digital terrain model (slope and elevation) and other existing maps (population count and snow covering). The result values were organized in columns, one for each spatial layers, which combined represent the feature space available for ML modeling.
Column names:
The columns are formed by six metadata fields separated by _
:
Column description:
All the columns can be aggregated in six thematic groups according to F1 and F2:
blue_landsat.glad.ard_{..}
: Quarterly time-series of Landsat blue band (Witjes et al., 2023)blue_mod13q1_{..}
: Monthly time-series of MOD13Q1 blue band (EarthData)evi_mod13q1.stl.trend.ols.alpha_{..}
: Alpha coefficient / intercept (derived by OLS) over the deseasonalized monthly time-series of MOD13Q1 Enhanced Vegetation Index (EVI) index (EarthData)evi_mod13q1.stl.trend.ols.beta_{..}
: Beta coefficient / trend (derived by OLS) over the deseasonalized monthly time-series of MOD13Q1 Enhanced Vegetation Index (EVI) index (EarthData)evi_mod13q1.stl.trend_{..}
: Deseasonalized monthly time-series (trend component of STL) for MOD13Q1 Enhanced Vegetation Index (EVI) index (EarthData)evi_mod13q1_{..}
: Monthly time-series of MOD13Q1 Enhanced Vegetation Index (EVI) index (EarthData)green_landsat.glad.ard_{..}
: Quarterly time-series of Landsat green band (Witjes et al., 2023)mir_mod13q1_{..}
: Monthly time-series of MOD13Q1 mid-infrared band (EarthData)ndvi_mod13q1_{..}
: Monthly time-series of MOD13Q1 normalized vegetation index (NDVI) (EarthData)nir_landsat.glad.ard_{..}
: Quarterly time-series of Landsat near-infrared band (Witjes et al., 2023)nir_mod13q1_{..}
: Monthly time-series of MOD13Q1 near-infrared band (EarthData)red_landsat.glad.ard_{..}
: Quarterly time-series of Landsat red band (Witjes et al., 2023)red_mod13q1_{..}
: Monthly time-series of MOD13Q1 red band (EarthData)swir1_landsat.glad.ard_{..}
: Quarterly time-series of Landsat short-wave infrared-1 band (Witjes et al., 2023)swir2_landsat.glad.ard_{..}
: Quarterly time-series of Landsat short-wave infrared-1 band (Witjes et al., 2023)</li>
<li><strong>Temperature images:</strong>
<ul>
<li><code>lst_mod11a2.daytime_{..}</code>: Monthly time-series of MOD13Q1 day time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.daytime.{month}_{..}</code>: Long-term monthly aggregation (2000—2022) for MOD13Q1 day time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.daytime.trend_{..}</code>: Deseasonalized monthly time-series (trend component of <a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.STL.html#statsmodels.tsa.seasonal.STL">STL</a>) for MOD13Q1 day time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.daytime.trend.ols.alpha_{..}</code>: Alpha coefficient / intercept (derived by <a href="https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html">OLS</a>) over the deseasonalized monthly time-series of MOD13Q1 day time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.daytime.trend.ols.beta_{..}</code>: Beta coefficient / trend (derived by <a href="https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html">OLS</a>) over the deseasonalized monthly time-series of MOD13Q1 day time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.nighttime_{..}</code>: Monthly time-series of MOD13Q1 night time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.nighttime.{month}_{..}</code>: Long-term monthly aggregation (2000—2022) for MOD13Q1 day time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.nighttime.trend_{..}</code>: Deseasonalized monthly time-series (trend component of <a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.STL.html#statsmodels.tsa.seasonal.STL">STL</a>) for MOD13Q1 night time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.nighttime.trend.ols.alpha_{..}</code>: Alpha coefficient / intercept (derived by <a href="https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html">OLS</a>) over the deseasonalized monthly time-series of MOD13Q1 night time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>lst_mod11a2.nighttime.trend.ols.beta_{..}</code>: Beta coefficient / trend (derived by <a href="https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html">OLS</a>) over the deseasonalized monthly time-series of MOD13Q1 night time land surface temperature (<a href="https://lpdaac.usgs.gov/products/mod11a2v006/">EarthData</a>)</li>
<li><code>thermal_landsat.glad.ard_{..}</code>: Quarterly time-series of Landsat thermal band (<a href="https://doi.org/10.7717/peerj.15478">Witjes et al., 2023</a>)</li>
</ul>
</li>
<li><strong>Climate layers:</strong>
<ul>
<li><code>accum.precipitation_chelsa.annual_{..}</code>: Accumulated precipitation over the entire year according to CHELSA timeseries in <code>mm</code> of water (<a href="https://doi.org/10.1038/sdata.2017.122">Karger et al., 2017</a>)</li>
<li><code>accum.precipitation_chelsa.annual.3years.dif_{..}</code>: 3-years difference considering the yearly accumulated precipitation according to CHELSA timeseries in <code>mm</code> of water (<a href="https://doi.org/10.1038/sdata.2017.122">Karger et al., 2017</a>)</li>
<li><code>accum.precipitation_chelsa.annual.log.csum_{..}</code>: Cumulative sum, in logarithmic space, consdering the yearly accumulated precipitation according to CHELSA timeseries (<a href="https://doi.org/10.1038/sdata.2017.122">Karger et al., 2017</a>)</li>
<li><code>accum.precipitation_chelsa.montlhy_{..}</code>: Accumulated precipitation for each month according to CHELSA timeseries in <code>mm</code> of water (<a href="https://doi.org/10.1038/sdata.2017.122">Karger et al., 2017</a>)</li>
<li><code>bioclim.var_chelsa.{variable_code}_{..}</code>: Bioclimatic variables derived variables from the monthly mean, max, mean temperature, and mean precipitation values. For <code>variable_code</code> descriptions see <a href="https://chelsa-climate.org/bioclim/">chelsa-climate.org</a> (<a href="https://doi.org/10.1038/sdata.2017.122">Karger et al.,
Overview Welcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such. This dataset has been collected across various property aggregators across India. In this competition, provided the 12 influencing factors your role as a data scientist is to predict the prices as accurately as possible.
Also, in this competition, you will get a lot of room for feature engineering and mastering advanced regression techniques such as Random Forest, Deep Neural Nets, and various other ensembling techniques.
Train.csv - 29451 rows x 12 columns Test.csv - 68720 rows x 11 columns Sample Submission - Acceptable submission format. (.csv/.xlsx file with 68720 rows)
POSTED_BY - Category marking who has listed the property UNDER_CONSTRUCTION - Under Construction or Not RERA - Rera approved or Not BHK_NO - Number of Rooms BHK_OR_RK - Type of property SQUARE_FT - Total area of the house in square feet READY_TO_MOVE - Category marking Ready to move or Not RESALE - Category marking Resale or not ADDRESS - Address of the property LONGITUDE - Longitude of the property LATITUDE - Latitude of the property
What is the Metric In this competition? How is the Leaderboard Calculated ?? The submission will be evaluated using the RMSLE (Root Mean Squared Logarithmic Error) metric. One can use np.sqrt(mean_squared_log_error( actual, predicted)) This hackathon supports private and public leaderboards The public leaderboard is evaluated on 30% of Test data The private leaderboard will be made available at the end of the hackathon which will be evaluated on 100% Test data
This is a data Shared by Machine Hack you can participate in Hackathon and submit your own submissions Link to Machine Hack, Hackathon- https://www.machinehack.com/hackathons/house_price_prediction_beat_the_benchmark/overview
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a realistic and structured pizza sales dataset covering the time span from **2024 to 2025. ** Whether you're a beginner in data science, a student working on a machine learning project, or an experienced analyst looking to test out time series forecasting and dashboard building, this dataset is for you.
📁 What’s Inside? The dataset contains rich details from a pizza business including:
✅ Order Dates & Times ✅ Pizza Names & Categories (Veg, Non-Veg, Classic, Gourmet, etc.) ✅ Sizes (Small, Medium, Large, XL) ✅ Prices ✅ Order Quantities ✅ Customer Preferences & Trends
It is neatly organized in Excel format and easy to use with tools like Python (Pandas), Power BI, Excel, or Tableau.
💡** Why Use This Dataset?** This dataset is ideal for:
📈 Sales Analysis & Reporting 🧠 Machine Learning Models (demand forecasting, recommendations) 📅 Time Series Forecasting 📊 Data Visualization Projects 🍽️ Customer Behavior Analysis 🛒 Market Basket Analysis 📦 Inventory Management Simulations
🧠 Perfect For: Data Science Beginners & Learners BI Developers & Dashboard Designers MBA Students (Marketing, Retail, Operations) Hackathons & Case Study Competitions
pizza, sales data, excel dataset, retail analysis, data visualization, business intelligence, forecasting, time series, customer insights, machine learning, pandas, beginner friendly
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
American Express and Analytics Vidhya present “AmExpert 2021 – Machine Learning Hackathon”, an amazing opportunity to showcase your analytical abilities and talent!
Get a taste of the kind of challenges we face here at American Express on a day-to-day basis.
https://datahack.analyticsvidhya.com/contest/amexpert-2021-machine-learning-hackathon/
XYZ Bank is a mid-sized private bank that includes a variety of banking products, such as savings accounts, current accounts, investment products, credit products, and home loans.
The Bank wants to predict the next set of products for a set of customers to optimize their marketing and communication campaigns.
The data available in this problem contains the following information: * User Demographic Details : Gender, Age, Vintage, Customer Category etc. * Current Product Holdings * Product Holding in Next 6 Months (only for Train dataset)
Here, our task is to predict the next set of products (upto 3 products) for a set of customers (Test data) based on their demographics and current product holdings.
Customer_ID - Unique ID for the customer
Gender - Gender of the Customer
Age - Age of the Customer (in Years)
Vintage - Vintage for the Customer (In Months)
Is_Active - Activity Index, 0 : Less frequent customer, 1 : More frequent customer
City_Category - Encoded Category of customer's city
Customer_Category - Encoded Category of the customer
Product_Holding_B1 - Current Product Holding (Encoded)
Product_Holding_B2 - Product Holding in next six months (Encoded) - Target Column
Customer_ID - Unique ID for the customer
Gender - Gender of the Customer
Age - Age of the Customer (in Years)
Vintage - Vintage for the Customer (In Months)
Is_Active - Activity Index, 0 : Less frequent customer, 1 : More frequent customer
City_Category - Encoded Category of customer's city
Customer_Category - Encoded Category of the customer
Product_Holding_B1 - Current Product Holding (Encoded)
The evaluation metric is Mean Average Precision (MAP) at K (K = 3). MAP is a well-known metric used to evaluate ranked retrieval results. E.g. Let’s say for a given customer, we recommended 3 products and only 1st and 3rd products are correct. So, the result would look like — 1, 0, 1
In this case, The precision at 1 will be: 1*1/1 = 1 The precision at 2 will be: 0*1/2 The precision at 3 will be: 1*2/3 = 0.67 Average Precision will be: (1 + 0 + 0.67)/3 = 0.556.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset compiles the official problem statements from Smart India Hackathon (SIH) 2024 along with the corresponding winning teams and their accepted solutions. By merging these two perspectives — what was asked, and what was delivered — this dataset provides a holistic view of India’s largest innovation challenge for students.
You’ll find:
✅ Problem titles and detailed descriptions ✅ Categories (Hardware/Software) ✅ Technology domains (e.g., MedTech, Sustainability, etc.) ✅ Winning team details — names, institutes, city/state ✅ Organizing departments/ministries and nodal centers
Whether you are a student preparing for future hackathons, a mentor guiding innovation challenges, or a policymaker interested in problem-solving trends, this dataset gives you a clear, data-backed lens on how real-world challenges are solved by India’s top student innovators.
💡 Use Cases Analyze which technology domains attracted the most winning solutions
Map regional innovation patterns (which states/institutes are winning most often)
Visualize which organizations posed the most impactful challenges
Build EDA or dashboards to track hackathon outcomes
Use as a reference to prepare for SIH 2025 or similar competitions
📄 License This dataset is released under CC BY 4.0. You’re free to use, modify, and share it with attribution.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Credit card default risk is the chance that companies or Individuals will not be able to return the money lent on time.
You are given relevant information about the customers of a company.
You are required to build a machine learning model that can predict if there will be credit card default.
The dataset folder contains the following files:
The columns provided in the dataset are as follows:
Column name |
Description |
customer_id | Represents the unique identification of a customer |
name | Represents the name of a customer |
age | Represents the age of a customer ( in years ) |
gender | Represents the gender of a customer( F means Female and M means Male ) |
owns_car | Represents whether a customer owns a car ( Y means Yes and N means No ) |
owns_house | Represents whether a customer owns a house ( Y means Yes and N means No ) |
no_of_children | Represents the number of children of a customer |
net_yearly_income | Represents the net yearly income of a customer ( in USD ) |
no_of_days_employed | Represents the no of days employed |
occupation_type | Represents the occupation type of a customer ( IT staff, Managers, Accountants, Cooking staff, etc ) |
total_family_members | Represents the number of family members of a customer |
migrant_worker | Represents whether a customer is a migrant worker( 1 means Yes and 0 means No ) |
yearly_debt_payments | Represents the yearly debt payments of a customer ( in USD ) |
credit_limit | Represents the credit limit of a customer ( in USD ) |
credit_limit_used(%) | Represents the percentage of credit limit used by a customer |
credit_score | Represents the credit score of a customer |
prev_defaults | Represents the number of previous defaults |
default_in_last_6months | Represents whether a customer has defaulted in the last 6 months ( 1 means Yes and 0 means No ) |
credit_card_default | Represents whether there will be credit card default ( 1 means Yes and 0 means No ) |
score = 100*(metrics.f1_score(actual, predicted, average= "macro" ))
Note: Ensure that your submission file contains the following: