15 datasets found

S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
S
Construction of a Dataset for Knowledge Atlas of Cotton Diseases and Pests
scidb.cn
Updated Sep 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Dongya; Wang Zhenlu; Dai Shuo; Chen Zhen; Bai Tao; Sun Wei (2023). Construction of a Dataset for Knowledge Atlas of Cotton Diseases and Pests [Dataset]. http://doi.org/10.57760/sciencedb.11412
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.11412
Dataset updated
Sep 26, 2023
Dataset provided by
Science Data Bank
Authors
Li Dongya; Wang Zhenlu; Dai Shuo; Chen Zhen; Bai Tao; Sun Wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cotton is one of the important economic crops in China and one of the most important textile raw materials in the world. During the cotton planting process, various cotton diseases and pests have a significant impact on cotton yield. Constructing a knowledge map of cotton pests and diseases, and intelligentizing the names, symptoms, and prevention methods of cotton pests and diseases, and storing them in the form of a map, is of great significance for precise and rapid prevention and control of cotton pests and diseases. A dataset was constructed based on a knowledge graph of cotton diseases and pests. Based on books and websites related to cotton disease and pest control, unstructured data was collected through OCR technology and Python crawling. After cleaning and merging the data, 30 common cotton diseases and 49 cotton pest data were finally obtained. This dataset can be used to construct a knowledge map of cotton diseases and pests, providing data support for the development of informationization and intelligence in China's cotton planting industry.
h
An approach that utilizes blockchain to effectively and securely preserve...
transfer.hft-stuttgart.de
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). An approach that utilizes blockchain to effectively and securely preserve data privacy for location data from IoT in smart cities - Dataset - Data Catalog [Dataset]. https://transfer.hft-stuttgart.de/katalog/dataset/group-datensicherheit
Explore at:
Dataset updated
Sep 15, 2025
Description
Environmental surveillance, emergency response, and smart city planning all require the use of geospatial data, which includes satellite imagery, cartographic records, and real-time GPS coordinates. The high sensitivity and value of location-specific information make it unsafe to store and transmit it through conventional, centralized means, which can result in privacy breaches, unauthorized manipulations, and potential misuse. This paper aims to design and implement a secure, blockchain-based framework that blends AES (Advanced Encryption Standard) and RSA (Rivest–Shamir–Adleman) key management, which addresses these challenges. The aim is to guarantee strong data confidentiality by using symmetric encryption, and to use public-key cryptography for granular access control and secure key distribution. The proposed system uses Ethereum smart contracts to connect encrypted data references to a decentralized ledger, ensuring tamper resistance and auditability. In the proposed system, a Python-based FastAPI backend is responsible for data ingestion, cleaning, encryption, and blockchain interaction, while a React frontend can upload datasets, generate encryption keys, and retrieve access permissions. Modular microservices and well-defined APIs can seamlessly integrate various components, such as data processing scripts and on-chain contract logic, during development. The system's scalability is demonstrated by evaluating its performance against various dataset sizes, which involves metrics such as encryption overhead, blockchain transaction costs, and smart contract execution times. The practical usability of the system in actual scenarios is demonstrated through user acceptance testing, which is crucial for adoption in resource-limited environments. The results show the proposed crypto-enhanced blockchain framework can significantly enhance geospatial data security while still maintaining operational efficiency. Integration with zero-knowledge proofs may be explored in future work to enhance privacy, mitigate energy costs through alternative consensus algorithms, and enhance resilience in multi-network ecosystems through cross-chain interoperability.
E-Commerce Data
kaggle.com
zip
Updated Aug 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
Explore at:
zip(7548686 bytes)Available download formats
Dataset updated
Aug 17, 2017
Authors
Carrie
Description
Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.
Confusion matrix.
plos.figshare.com
xls
Updated Oct 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t005
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
P
Online word-of- mouth restaurant data on Dianping.com
opendata.pku.edu.cn
docx, xls
Updated Jun 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peking University Open Research Data Platform (2018). Online word-of- mouth restaurant data on Dianping.com [Dataset]. http://doi.org/10.18170/DVN/EB6KJ1
Explore at:
xls(1030968), docx(23890)Available download formats
Unique identifier
https://doi.org/10.18170/DVN/EB6KJ1
Dataset updated
Jun 7, 2018
Dataset provided by
Peking University Open Research Data Platform
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Using the self-compiled crawler tool (implemented in Python language), the store reviews and comment information of different kinds of meals in Guangzhou from Dianping.com’s gourmet category are collected. After deduplicating, denoising and cleaning the data, some of the original fields were preprocessed to generate a set of derived variables for subsequent research. Specific operations can be seen in data documents. The dataset contains a total of 3124 restaurant data, classified by catering type, including 722 Cantonese cuisine, 572 porridge noodles, 566 Sichuan cuisine, 595 Japanese cuisine, and 669 western food. Data collection time is November 2017, and data format is csv.
CIC-Bell-DNS-EXF2021
kaggle.com
zip
Updated Aug 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
StrGenIx | Laurens D'hooge (2022). CIC-Bell-DNS-EXF2021 [Dataset]. https://www.kaggle.com/datasets/dhoogla/cicbelldnsexf2021/code
Explore at:
zip(18458211 bytes)Available download formats
Dataset updated
Aug 12, 2022
Authors
StrGenIx | Laurens D'hooge
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is an academic network traffic classification dataset, designed for the recognition of DNS exfiltration.

All the credit goes to the original authors: Dr. Samaneh Mahdavifar, Amgad Hanafy Salem BSc., Princy Victor MSc., Dr. Miguel Garzon, Dr. Amir H. Razavi, Natasha Hellberg and Dr. Arash Habibi Lashkari. Please cite their original paper.

V1: Base dataset in CSV format as downloaded from here V2: Aggregation per broad category (BenignExtra, HeavyAttack, LightAttack) V3: Opinionated cleaning -> parquet Minimal cleaning -> parquet V4: Reorganize to save storage, only keep original CSVs in V1/V2

important There are 2 version of the cleaning process. The minimalistic version stays as true to the data-as-provided as possible. The opinionated cleaning addresses several concerns I have with this dataset. The size of the dataset is much smaller after the opinionated cleaning process, mostly because of duplicate records at the end of the preprocessing. - missing values - features where the documentation does not match the content of the actual dataset - extremely skewed features which would certainly bias training - the inclusion of features without any variance - strange encoded values like python set() as string rather than {}
bookstore dataset
kaggle.com
zip
Updated Aug 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sbonelo Ndhlazi (2022). bookstore dataset [Dataset]. https://www.kaggle.com/sbonelondhlazi/bookstore-dataset
Explore at:
zip(26376 bytes)Available download formats
Dataset updated
Aug 16, 2022
Authors
Sbonelo Ndhlazi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set was scrapped using python from http://books.toscrape.com/ which is a fictional book store. It contains 1000 books, with different categories, star ratings and prices. This data set can be used by anyone who wants to practice data cleaning and simple data manipulations.

The code I used to scrap this data can be found on my github: https://github.com/Sbonelondhlazi/dummybooks
App Store Mobile Games 2008 - 2019
kaggle.com
zip
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayank Singh (2024). App Store Mobile Games 2008 - 2019 [Dataset]. https://www.kaggle.com/datasets/mayanksinghr/app-store-mobile-games-2008-2019/versions/1
Explore at:
zip(10506423 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Mayank Singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset contains 1 excel workbook (.xlsx) with 2 sheets.

Sheet 1 - App Store Games contains the mobile games launched on App Store from 2008 - 2019.

Sheet 2 - Data Dictionary is just the explanation of columns in data.

This data can be used to practice EDA and some data cleaning tasks. Can be used for Data visualization using python Matplotlib and Seaborn libraries.

I used this dataset for a Power BI project also and created a Dashboard on it. Used python inside power query to clean and convert some encoded and Unicode characters from App URL, Name, and Description columns.

Total Columns: 16

App URL

App ID

Name

Subtitle

Icon URL

Average User Rating

User Rating Count

Price per App (USD)

Description

Developer

Age Rating

Languages

Size in Bytes

Primary Genre

Genres

Release Date
Dirty Dataset to practice Data Cleaning
kaggle.com
zip
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Kanju (2024). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/martinkanju/dirty-dataset-to-practice-data-cleaning
Explore at:
zip(1235 bytes)Available download formats
Dataset updated
May 20, 2024
Authors
Martin Kanju
Description
Dataset

This dataset was created by Martin Kanju

Released under Other (specified in description)

Contents
Holidify : Rating / Places Data
kaggle.com
zip
Updated Dec 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himanshu Tripathi (2020). Holidify : Rating / Places Data [Dataset]. https://www.kaggle.com/himanshutripathi/places-to-explore
Explore at:
zip(81934 bytes)Available download formats
Dataset updated
Dec 14, 2020
Authors
Himanshu Tripathi
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
Context

With the dataset consisting of 6 CSV files crawled from the Holidify website, you can explore what makes a great place to explore along with the rating and best time to visit and you can also explore the places and show your text preprocessing skills.

Content

The dataset consists of 6 csv files - bhutan.csv - india.csv - indonesia.csv - singapore.csv - thailand.csv - vietnam.csv

Each csv file has 4 column - Place Name - Rating - About Place - Best Time To Visit

Note:- Singapore.csv file doesn't have (Best Time to Visit) Column

Acknowledgements

I've scrape the data using Python, BeautifulSoup, Request libraries and did some text processing/cleaning and store the data into csv format
Adidas_Sales_Analysis
kaggle.com
zip
Updated Mar 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archis Rudra (2023). Adidas_Sales_Analysis [Dataset]. https://www.kaggle.com/datasets/archisrudra/adidas-sales-analysis/versions/1
Explore at:
zip(1863030 bytes)Available download formats
Dataset updated
Mar 11, 2023
Authors
Archis Rudra
Description
Portfolio_Adidas_Dataset A set of real-world dataset tasks is completed by using the Python Pandas and Matplotlib libraries.

Background Information: In this portfolio, we use Python Pandas & Python Matplotlib to analyze and answer business questions about 5 products worth of sales data. The data contains hundreds of thousands of footwear store purchases broken down by product type, cost, region, state, city, and so on.

We start by cleaning our data. Tasks during this section include:

Drop NaN values from DataFrame

Removing column based on a condition

Changing the column name

Removing rows based on a condition

Reindexing rows based on a condition

Adding Month and Year column (to_datetime)

Conversion of data types from string to integer (to_numeric)

Once we have cleaned up our data a bit, we move to the data exploration section. In this section we explore 5 high-level business questions related to our data:

What was the highest number of sales in which year?

What product sold the most? Why do you think it sold the most?

What was the average price for each product? And the overall average price of all products?

What was the best retailer for sales? How much was earned that retailers?

What method is most efficient for sales?

To answer these questions we walk through many different openpyxl, pandas, and matplotlib methods. They include:

Using groupby to perform aggregate analysis

Plotting bar charts, lines graphs, and pie charts to visualize our results

Labeling our graphs
iFood Marketing Campaigns Analysis
kaggle.com
zip
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Fayez (2023). iFood Marketing Campaigns Analysis [Dataset]. https://www.kaggle.com/datasets/fayez7/ifood-marketing-campaigns
Explore at:
zip(295368 bytes)Available download formats
Dataset updated
Aug 17, 2023
Authors
Ahmad Fayez
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data is part of data published on GitHub in this link

File Name: ifood_df.csv

The data is about result of 5 Marketing Campaigns done by food company, and how each customer interact with those campaigns, in addition to demographics data about customers such as: Income, age, education level, marital status, number of children and teenagers, and other data related to each customer

**I download it to Explore, Clean, and Transform it by Microsoft Excel, then Visualize and Analysis by Python ** First of all, in the Exploration Phase: Understand the column and the relationships between each column and others, Also Define the important questions which the way to make recommendations about the Marketing Campaign.

In the Cleaning Phase: Delete Columns “Z_CostContact” and “Z_Revenue” because it contains fixed number and not important for my questions. Delete column “Response” because it’s not used in my analysis and meaningless in addition to can’t find what’s that stands for! Then Check for missing data and I found that the day is Complete. After that check for duplicates and find out the every row is Unique. Also check for Accuracy and know that the data have correct and logical values. The only thing you should know that is the data not current, it's from more than 2000 customers from 2020.

Overall in Cleaning Process: The data is Accurate, Complete, Consistent, Relevant, Valid, and Unique. But need some Transformation.

In Transportation Phase: Add a column for “Index” to make a unique identifier for each customer. Aggregate all marital status in one column. Aggregate all education level in one column. Rearrange some columns like campaigns and Totals.

Fields Description:

Index: unique identifier for each customer. Income: the yearly income for each customer. Kidhome: number of small children in customer’s household. Teenhome: number of teenagers in customer’s household. Recency: number of days since last purchase. MntWines: Amount of wine purchased in last 2 years. MntFruits: Amount of fruit purchased in last 2 years. MntMeatProducts: Amount of meat purchased in last 2 years. MntFishProducts: Amount of fish purchased in last 2 years. MntSweetProducts: Amount of sweets purchased in last 2 years. MntRegularProds: Amount of Regular Products purchased in last 2 years. MntGoldProds: Amount of Special Products purchased in last 2 years. MntTotal: Total amount of everything purchased in last 2 years. NumDealsPurchases: number of purchases made with discount. NumWebPurchases: number of purchases made through company’s website. NumCatalogPurchases: number of purchases made using catalog. NumStorePurchases: number of purchases made directly in the store. NumWebVisitsMonth: number of visits to company’s website in the last month. AcceptedCmp1: 1 if customer accepts the offer in first campaign, 0 for otherwise. AcceptedCmp2: 1 if customer accepts the offer in second campaign, 0 for otherwise. AcceptedCmp3: 1 if customer accepts the offer in third campaign, 0 for otherwise. AcceptedCmp4: 1 if customer accepts the offer in fourth campaign, 0 for otherwise. AcceptedCmp5: 1 if customer accepts the offer in fifth campaign, 0 for otherwise. AcceptedCmpOverall: total number of marketing campaigns that customer accepted. Complain: whether the customer complained in last 2 years or not. Age: the customer’s age. Customer_Days: days since registration. marital_status: the customer’s status. education: the customer’s level of education.
Housing Prices Dataset
kaggle.com
zip
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
Explore at:
zip(4740 bytes)Available download formats
Dataset updated
Jan 12, 2022
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

Description:

A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

Acknowledgement:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t a single & multiple feature.

Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001

S1 Data -

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0292466.s001

Dataset updated

Oct 11, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

Clear search

Close search

Google apps

Main menu

S1 Data -

Data and tools for studying isograms

Construction of a Dataset for Knowledge Atlas of Cotton Diseases and Pests

An approach that utilizes blockchain to effectively and securely preserve...

E-Commerce Data

Context

Content

Acknowledgements

Inspiration

Confusion matrix.

Online word-of- mouth restaurant data on Dianping.com

CIC-Bell-DNS-EXF2021

bookstore dataset

App Store Mobile Games 2008 - 2019

Dirty Dataset to practice Data Cleaning

Dataset

Contents

Holidify : Rating / Places Data

Context

Content

Acknowledgements

Adidas_Sales_Analysis

iFood Marketing Campaigns Analysis

Fields Description:

Housing Prices Dataset

Description:

Acknowledgement:

Objective:

S1 Data -