Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cotton is one of the important economic crops in China and one of the most important textile raw materials in the world. During the cotton planting process, various cotton diseases and pests have a significant impact on cotton yield. Constructing a knowledge map of cotton pests and diseases, and intelligentizing the names, symptoms, and prevention methods of cotton pests and diseases, and storing them in the form of a map, is of great significance for precise and rapid prevention and control of cotton pests and diseases. A dataset was constructed based on a knowledge graph of cotton diseases and pests. Based on books and websites related to cotton disease and pest control, unstructured data was collected through OCR technology and Python crawling. After cleaning and merging the data, 30 common cotton diseases and 49 cotton pest data were finally obtained. This dataset can be used to construct a knowledge map of cotton diseases and pests, providing data support for the development of informationization and intelligence in China's cotton planting industry.
Facebook
TwitterEnvironmental surveillance, emergency response, and smart city planning all require the use of geospatial data, which includes satellite imagery, cartographic records, and real-time GPS coordinates. The high sensitivity and value of location-specific information make it unsafe to store and transmit it through conventional, centralized means, which can result in privacy breaches, unauthorized manipulations, and potential misuse. This paper aims to design and implement a secure, blockchain-based framework that blends AES (Advanced Encryption Standard) and RSA (Rivest–Shamir–Adleman) key management, which addresses these challenges. The aim is to guarantee strong data confidentiality by using symmetric encryption, and to use public-key cryptography for granular access control and secure key distribution. The proposed system uses Ethereum smart contracts to connect encrypted data references to a decentralized ledger, ensuring tamper resistance and auditability. In the proposed system, a Python-based FastAPI backend is responsible for data ingestion, cleaning, encryption, and blockchain interaction, while a React frontend can upload datasets, generate encryption keys, and retrieve access permissions. Modular microservices and well-defined APIs can seamlessly integrate various components, such as data processing scripts and on-chain contract logic, during development. The system's scalability is demonstrated by evaluating its performance against various dataset sizes, which involves metrics such as encryption overhead, blockchain transaction costs, and smart contract execution times. The practical usability of the system in actual scenarios is demonstrated through user acceptance testing, which is crucial for adoption in resource-limited environments. The results show the proposed crypto-enhanced blockchain framework can significantly enhance geospatial data security while still maintaining operational efficiency. Integration with zero-knowledge proofs may be explored in future work to enhance privacy, mitigate energy costs through alternative consensus algorithms, and enhance resilience in multi-network ecosystems through cross-chain interoperability.
Facebook
TwitterTypically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Image from stocksnap.io.
Analyses for this dataset could include time series, clustering, classification and more.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Using the self-compiled crawler tool (implemented in Python language), the store reviews and comment information of different kinds of meals in Guangzhou from Dianping.com’s gourmet category are collected. After deduplicating, denoising and cleaning the data, some of the original fields were preprocessed to generate a set of derived variables for subsequent research. Specific operations can be seen in data documents. The dataset contains a total of 3124 restaurant data, classified by catering type, including 722 Cantonese cuisine, 572 porridge noodles, 566 Sichuan cuisine, 595 Japanese cuisine, and 669 western food. Data collection time is November 2017, and data format is csv.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This is an academic network traffic classification dataset, designed for the recognition of DNS exfiltration.
All the credit goes to the original authors: Dr. Samaneh Mahdavifar, Amgad Hanafy Salem BSc., Princy Victor MSc., Dr. Miguel Garzon, Dr. Amir H. Razavi, Natasha Hellberg and Dr. Arash Habibi Lashkari. Please cite their original paper.
V1: Base dataset in CSV format as downloaded from here V2: Aggregation per broad category (BenignExtra, HeavyAttack, LightAttack) V3: Opinionated cleaning -> parquet Minimal cleaning -> parquet V4: Reorganize to save storage, only keep original CSVs in V1/V2
important There are 2 version of the cleaning process. The minimalistic version stays as true to the data-as-provided as possible. The opinionated cleaning addresses several concerns I have with this dataset. The size of the dataset is much smaller after the opinionated cleaning process, mostly because of duplicate records at the end of the preprocessing. - missing values - features where the documentation does not match the content of the actual dataset - extremely skewed features which would certainly bias training - the inclusion of features without any variance - strange encoded values like python set() as string rather than {}
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set was scrapped using python from http://books.toscrape.com/ which is a fictional book store. It contains 1000 books, with different categories, star ratings and prices. This data set can be used by anyone who wants to practice data cleaning and simple data manipulations.
The code I used to scrap this data can be found on my github: https://github.com/Sbonelondhlazi/dummybooks
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset contains 1 excel workbook (.xlsx) with 2 sheets.
This data can be used to practice EDA and some data cleaning tasks. Can be used for Data visualization using python Matplotlib and Seaborn libraries.
I used this dataset for a Power BI project also and created a Dashboard on it. Used python inside power query to clean and convert some encoded and Unicode characters from App URL, Name, and Description columns.
Total Columns: 16
Facebook
TwitterThis dataset was created by Martin Kanju
Released under Other (specified in description)
Facebook
TwitterAttribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
With the dataset consisting of 6 CSV files crawled from the Holidify website, you can explore what makes a great place to explore along with the rating and best time to visit and you can also explore the places and show your text preprocessing skills.
The dataset consists of 6 csv files - bhutan.csv - india.csv - indonesia.csv - singapore.csv - thailand.csv - vietnam.csv
Each csv file has 4 column - Place Name - Rating - About Place - Best Time To Visit
Note:- Singapore.csv file doesn't have (Best Time to Visit) Column
I've scrape the data using Python, BeautifulSoup, Request libraries and did some text processing/cleaning and store the data into csv format
Facebook
TwitterPortfolio_Adidas_Dataset A set of real-world dataset tasks is completed by using the Python Pandas and Matplotlib libraries.
Background Information: In this portfolio, we use Python Pandas & Python Matplotlib to analyze and answer business questions about 5 products worth of sales data. The data contains hundreds of thousands of footwear store purchases broken down by product type, cost, region, state, city, and so on.
We start by cleaning our data. Tasks during this section include:
Once we have cleaned up our data a bit, we move to the data exploration section. In this section we explore 5 high-level business questions related to our data:
To answer these questions we walk through many different openpyxl, pandas, and matplotlib methods. They include:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data is part of data published on GitHub in this link
File Name: ifood_df.csv
The data is about result of 5 Marketing Campaigns done by food company, and how each customer interact with those campaigns, in addition to demographics data about customers such as: Income, age, education level, marital status, number of children and teenagers, and other data related to each customer
**I download it to Explore, Clean, and Transform it by Microsoft Excel, then Visualize and Analysis by Python ** First of all, in the Exploration Phase: Understand the column and the relationships between each column and others, Also Define the important questions which the way to make recommendations about the Marketing Campaign.
In the Cleaning Phase: Delete Columns “Z_CostContact” and “Z_Revenue” because it contains fixed number and not important for my questions. Delete column “Response” because it’s not used in my analysis and meaningless in addition to can’t find what’s that stands for! Then Check for missing data and I found that the day is Complete. After that check for duplicates and find out the every row is Unique. Also check for Accuracy and know that the data have correct and logical values. The only thing you should know that is the data not current, it's from more than 2000 customers from 2020.
Overall in Cleaning Process: The data is Accurate, Complete, Consistent, Relevant, Valid, and Unique. But need some Transformation.
In Transportation Phase: Add a column for “Index” to make a unique identifier for each customer. Aggregate all marital status in one column. Aggregate all education level in one column. Rearrange some columns like campaigns and Totals.
Index: unique identifier for each customer. Income: the yearly income for each customer. Kidhome: number of small children in customer’s household. Teenhome: number of teenagers in customer’s household. Recency: number of days since last purchase. MntWines: Amount of wine purchased in last 2 years. MntFruits: Amount of fruit purchased in last 2 years. MntMeatProducts: Amount of meat purchased in last 2 years. MntFishProducts: Amount of fish purchased in last 2 years. MntSweetProducts: Amount of sweets purchased in last 2 years. MntRegularProds: Amount of Regular Products purchased in last 2 years. MntGoldProds: Amount of Special Products purchased in last 2 years. MntTotal: Total amount of everything purchased in last 2 years. NumDealsPurchases: number of purchases made with discount. NumWebPurchases: number of purchases made through company’s website. NumCatalogPurchases: number of purchases made using catalog. NumStorePurchases: number of purchases made directly in the store. NumWebVisitsMonth: number of visits to company’s website in the last month. AcceptedCmp1: 1 if customer accepts the offer in first campaign, 0 for otherwise. AcceptedCmp2: 1 if customer accepts the offer in second campaign, 0 for otherwise. AcceptedCmp3: 1 if customer accepts the offer in third campaign, 0 for otherwise. AcceptedCmp4: 1 if customer accepts the offer in fourth campaign, 0 for otherwise. AcceptedCmp5: 1 if customer accepts the offer in fifth campaign, 0 for otherwise. AcceptedCmpOverall: total number of marketing campaigns that customer accepted. Complain: whether the customer complained in last 2 years or not. Age: the customer’s age. Customer_Days: days since registration. marital_status: the customer’s status. education: the customer’s level of education.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">
A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.