Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).
The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.
The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.
The dataset has a tabular structure and was initially stored in CSV format. It contains:
Rows: 7,043 customer records
Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).
Naming Convention:
The table in the database is named telco_customer_churn_data
.
Software Requirements:
To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).
For machine learning applications, libraries such as pandas
, scikit-learn
, and joblib
are typically used.
Additional Resources:
Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn
When reusing the dataset, users should be aware:
Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).
Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.
****Business Problem Overview**** Let us say that Reliance Jio Infocomm Limited approached us with a problem. There is a general tendency in the telecom industry that customers actively switch from one operator to another. As the telecom is highly competitive, the telecommunications industry experiences an average of 18-27% annual churn rate. Since, it costs 7-12 times more to acquire a new customer as compared to retaining an existing one, customer retention is an important aspect when compared with customer acquisition which is why our clients, Jio, wants to retain their high profitable customers and thus, wish to predict those customers which have a high risk of churning. Also, since a postpaid customer usually informs the operator prior to shifting their business to a competitor’s platform, our client is more concerned regarding its prepaid customers that usually churn or shift their business to a different operator without informing them which results in loss of business because Jio couldn’t offer any promotional scheme in time, to prevent churning. As per Jio, there are two kinds of churning - revenue based and usage based. Those customers who have not utilized any revenue-generating facilities such as mobile data usage, outgoing calls, caller tunes, SMS etc. over a given period of time. To determine such a customer, Jio usually uses an aggregate metrics like ‘customers who have generated less than ₹ 7 per month in total revenue’. However, the disadvantage of using such a metric would be that many of Jio customers who use their services only for incoming calls will also be counted/treated as churn since they do not generate direct revenue. In such scenarios, revenue is generated by their relatives who also uses Jio network to call them. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. The other type of Churn, as per our client, is usage based which consists of customers who do not use any of their services i.e., no calls (either incoming or outgoing), no internet usage, no SMS, etc. The problem with this segment is that by the time one realizes that a customer is not utilizing any of the services, it may be too late to take any corrective measure since the said customer might already switched to another operator. Currently, our client, Reliance Jio Infocomm Limited, have approached us to help them in predicting customers who will churn based on the usage-based definition Another aspect that we have to bear in mind is that as per Jio, 80% of their revenue is generated from 20% of their top customers. They call this group High-valued customers. Thus, if we can help reduce churn of the high-value customers, we will be able to reduce significant revenue leakage and for this they want us to define high-value customers based on a certain metric based on usage-based churn and predict only on high-value customers for prepaid segment. Understanding the Data-set The data-set contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behavior during churn will be helpful. Understanding Customer Behavior During Churn Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle: 1) The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual. 2) The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behavior than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.) 3) The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase. Data Dictionary The data-set is available in a csv file named as “Company Data.csv” and the da...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).
The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.
The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.
The dataset has a tabular structure and was initially stored in CSV format. It contains:
Rows: 7,043 customer records
Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).
Naming Convention:
The table in the database is named telco_customer_churn_data
.
Software Requirements:
To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).
For machine learning applications, libraries such as pandas
, scikit-learn
, and joblib
are typically used.
Additional Resources:
Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn
When reusing the dataset, users should be aware:
Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).
Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.