15 datasets found
  1. S1 Data -

    • plos.figshare.com
    zip
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 11, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

  2. Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  3. S

    Construction of a Dataset for Knowledge Atlas of Cotton Diseases and Pests

    • scidb.cn
    Updated Sep 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Dongya; Wang Zhenlu; Dai Shuo; Chen Zhen; Bai Tao; Sun Wei (2023). Construction of a Dataset for Knowledge Atlas of Cotton Diseases and Pests [Dataset]. http://doi.org/10.57760/sciencedb.11412
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Li Dongya; Wang Zhenlu; Dai Shuo; Chen Zhen; Bai Tao; Sun Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cotton is one of the important economic crops in China and one of the most important textile raw materials in the world. During the cotton planting process, various cotton diseases and pests have a significant impact on cotton yield. Constructing a knowledge map of cotton pests and diseases, and intelligentizing the names, symptoms, and prevention methods of cotton pests and diseases, and storing them in the form of a map, is of great significance for precise and rapid prevention and control of cotton pests and diseases. A dataset was constructed based on a knowledge graph of cotton diseases and pests. Based on books and websites related to cotton disease and pest control, unstructured data was collected through OCR technology and Python crawling. After cleaning and merging the data, 30 common cotton diseases and 49 cotton pest data were finally obtained. This dataset can be used to construct a knowledge map of cotton diseases and pests, providing data support for the development of informationization and intelligence in China's cotton planting industry.

  4. h

    An approach that utilizes blockchain to effectively and securely preserve...

    • transfer.hft-stuttgart.de
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). An approach that utilizes blockchain to effectively and securely preserve data privacy for location data from IoT in smart cities - Dataset - Data Catalog [Dataset]. https://transfer.hft-stuttgart.de/katalog/dataset/group-datensicherheit
    Explore at:
    Dataset updated
    Sep 15, 2025
    Description

    Environmental surveillance, emergency response, and smart city planning all require the use of geospatial data, which includes satellite imagery, cartographic records, and real-time GPS coordinates. The high sensitivity and value of location-specific information make it unsafe to store and transmit it through conventional, centralized means, which can result in privacy breaches, unauthorized manipulations, and potential misuse. This paper aims to design and implement a secure, blockchain-based framework that blends AES (Advanced Encryption Standard) and RSA (Rivest–Shamir–Adleman) key management, which addresses these challenges. The aim is to guarantee strong data confidentiality by using symmetric encryption, and to use public-key cryptography for granular access control and secure key distribution. The proposed system uses Ethereum smart contracts to connect encrypted data references to a decentralized ledger, ensuring tamper resistance and auditability. In the proposed system, a Python-based FastAPI backend is responsible for data ingestion, cleaning, encryption, and blockchain interaction, while a React frontend can upload datasets, generate encryption keys, and retrieve access permissions. Modular microservices and well-defined APIs can seamlessly integrate various components, such as data processing scripts and on-chain contract logic, during development. The system's scalability is demonstrated by evaluating its performance against various dataset sizes, which involves metrics such as encryption overhead, blockchain transaction costs, and smart contract execution times. The practical usability of the system in actual scenarios is demonstrated through user acceptance testing, which is crucial for adoption in resource-limited environments. The results show the proposed crypto-enhanced blockchain framework can significantly enhance geospatial data security while still maintaining operational efficiency. Integration with zero-knowledge proofs may be explored in future work to enhance privacy, mitigate energy costs through alternative consensus algorithms, and enhance resilience in multi-network ecosystems through cross-chain interoperability.

  5. E-Commerce Data

    • kaggle.com
    zip
    Updated Aug 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
    Explore at:
    zip(7548686 bytes)Available download formats
    Dataset updated
    Aug 17, 2017
    Authors
    Carrie
    Description

    Context

    Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

    Content

    "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

    Acknowledgements

    Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

    Image from stocksnap.io.

    Inspiration

    Analyses for this dataset could include time series, clustering, classification and more.

  6. Confusion matrix.

    • plos.figshare.com
    xls
    Updated Oct 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 11, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

  7. P

    Online word-of- mouth restaurant data on Dianping.com

    • opendata.pku.edu.cn
    docx, xls
    Updated Jun 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peking University Open Research Data Platform (2018). Online word-of- mouth restaurant data on Dianping.com [Dataset]. http://doi.org/10.18170/DVN/EB6KJ1
    Explore at:
    xls(1030968), docx(23890)Available download formats
    Dataset updated
    Jun 7, 2018
    Dataset provided by
    Peking University Open Research Data Platform
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Using the self-compiled crawler tool (implemented in Python language), the store reviews and comment information of different kinds of meals in Guangzhou from Dianping.com’s gourmet category are collected. After deduplicating, denoising and cleaning the data, some of the original fields were preprocessed to generate a set of derived variables for subsequent research. Specific operations can be seen in data documents. The dataset contains a total of 3124 restaurant data, classified by catering type, including 722 Cantonese cuisine, 572 porridge noodles, 566 Sichuan cuisine, 595 Japanese cuisine, and 669 western food. Data collection time is November 2017, and data format is csv.

  8. CIC-Bell-DNS-EXF2021

    • kaggle.com
    zip
    Updated Aug 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    StrGenIx | Laurens D'hooge (2022). CIC-Bell-DNS-EXF2021 [Dataset]. https://www.kaggle.com/datasets/dhoogla/cicbelldnsexf2021/code
    Explore at:
    zip(18458211 bytes)Available download formats
    Dataset updated
    Aug 12, 2022
    Authors
    StrGenIx | Laurens D'hooge
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This is an academic network traffic classification dataset, designed for the recognition of DNS exfiltration.

    All the credit goes to the original authors: Dr. Samaneh Mahdavifar, Amgad Hanafy Salem BSc., Princy Victor MSc., Dr. Miguel Garzon, Dr. Amir H. Razavi, Natasha Hellberg and Dr. Arash Habibi Lashkari. Please cite their original paper.

    V1: Base dataset in CSV format as downloaded from here V2: Aggregation per broad category (BenignExtra, HeavyAttack, LightAttack) V3: Opinionated cleaning -> parquet Minimal cleaning -> parquet V4: Reorganize to save storage, only keep original CSVs in V1/V2

    important There are 2 version of the cleaning process. The minimalistic version stays as true to the data-as-provided as possible. The opinionated cleaning addresses several concerns I have with this dataset. The size of the dataset is much smaller after the opinionated cleaning process, mostly because of duplicate records at the end of the preprocessing. - missing values - features where the documentation does not match the content of the actual dataset - extremely skewed features which would certainly bias training - the inclusion of features without any variance - strange encoded values like python set() as string rather than {}

  9. bookstore dataset

    • kaggle.com
    zip
    Updated Aug 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sbonelo Ndhlazi (2022). bookstore dataset [Dataset]. https://www.kaggle.com/sbonelondhlazi/bookstore-dataset
    Explore at:
    zip(26376 bytes)Available download formats
    Dataset updated
    Aug 16, 2022
    Authors
    Sbonelo Ndhlazi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data set was scrapped using python from http://books.toscrape.com/ which is a fictional book store. It contains 1000 books, with different categories, star ratings and prices. This data set can be used by anyone who wants to practice data cleaning and simple data manipulations.

    The code I used to scrap this data can be found on my github: https://github.com/Sbonelondhlazi/dummybooks

  10. App Store Mobile Games 2008 - 2019

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayank Singh (2024). App Store Mobile Games 2008 - 2019 [Dataset]. https://www.kaggle.com/datasets/mayanksinghr/app-store-mobile-games-2008-2019/versions/1
    Explore at:
    zip(10506423 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Mayank Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset contains 1 excel workbook (.xlsx) with 2 sheets.

    • Sheet 1 - App Store Games contains the mobile games launched on App Store from 2008 - 2019.
    • Sheet 2 - Data Dictionary is just the explanation of columns in data.

    This data can be used to practice EDA and some data cleaning tasks. Can be used for Data visualization using python Matplotlib and Seaborn libraries.

    I used this dataset for a Power BI project also and created a Dashboard on it. Used python inside power query to clean and convert some encoded and Unicode characters from App URL, Name, and Description columns.

    Total Columns: 16

    • App URL
    • App ID
    • Name
    • Subtitle
    • Icon URL
    • Average User Rating
    • User Rating Count
    • Price per App (USD)
    • Description
    • Developer
    • Age Rating
    • Languages
    • Size in Bytes
    • Primary Genre
    • Genres
    • Release Date
  11. Dirty Dataset to practice Data Cleaning

    • kaggle.com
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Kanju (2024). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/martinkanju/dirty-dataset-to-practice-data-cleaning
    Explore at:
    zip(1235 bytes)Available download formats
    Dataset updated
    May 20, 2024
    Authors
    Martin Kanju
    Description

    Dataset

    This dataset was created by Martin Kanju

    Released under Other (specified in description)

    Contents

  12. Holidify : Rating / Places Data

    • kaggle.com
    zip
    Updated Dec 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himanshu Tripathi (2020). Holidify : Rating / Places Data [Dataset]. https://www.kaggle.com/himanshutripathi/places-to-explore
    Explore at:
    zip(81934 bytes)Available download formats
    Dataset updated
    Dec 14, 2020
    Authors
    Himanshu Tripathi
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    Context

    With the dataset consisting of 6 CSV files crawled from the Holidify website, you can explore what makes a great place to explore along with the rating and best time to visit and you can also explore the places and show your text preprocessing skills.

    Content

    The dataset consists of 6 csv files - bhutan.csv - india.csv - indonesia.csv - singapore.csv - thailand.csv - vietnam.csv

    Each csv file has 4 column - Place Name - Rating - About Place - Best Time To Visit

    Note:- Singapore.csv file doesn't have (Best Time to Visit) Column

    Acknowledgements

    I've scrape the data using Python, BeautifulSoup, Request libraries and did some text processing/cleaning and store the data into csv format

  13. Adidas_Sales_Analysis

    • kaggle.com
    zip
    Updated Mar 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archis Rudra (2023). Adidas_Sales_Analysis [Dataset]. https://www.kaggle.com/datasets/archisrudra/adidas-sales-analysis/versions/1
    Explore at:
    zip(1863030 bytes)Available download formats
    Dataset updated
    Mar 11, 2023
    Authors
    Archis Rudra
    Description

    Portfolio_Adidas_Dataset A set of real-world dataset tasks is completed by using the Python Pandas and Matplotlib libraries.

    Background Information: In this portfolio, we use Python Pandas & Python Matplotlib to analyze and answer business questions about 5 products worth of sales data. The data contains hundreds of thousands of footwear store purchases broken down by product type, cost, region, state, city, and so on.

    We start by cleaning our data. Tasks during this section include:

    1. Drop NaN values from DataFrame
    2. Removing column based on a condition
    3. Changing the column name
    4. Removing rows based on a condition
    5. Reindexing rows based on a condition
    6. Adding Month and Year column (to_datetime)
    7. Conversion of data types from string to integer (to_numeric)

    Once we have cleaned up our data a bit, we move to the data exploration section. In this section we explore 5 high-level business questions related to our data:

    1. What was the highest number of sales in which year?
    2. What product sold the most? Why do you think it sold the most?
    3. What was the average price for each product? And the overall average price of all products?
    4. What was the best retailer for sales? How much was earned that retailers?
    5. What method is most efficient for sales?

    To answer these questions we walk through many different openpyxl, pandas, and matplotlib methods. They include:

    1. Using groupby to perform aggregate analysis
    2. Plotting bar charts, lines graphs, and pie charts to visualize our results
    3. Labeling our graphs
  14. iFood Marketing Campaigns Analysis

    • kaggle.com
    zip
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Fayez (2023). iFood Marketing Campaigns Analysis [Dataset]. https://www.kaggle.com/datasets/fayez7/ifood-marketing-campaigns
    Explore at:
    zip(295368 bytes)Available download formats
    Dataset updated
    Aug 17, 2023
    Authors
    Ahmad Fayez
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data is part of data published on GitHub in this link

    File Name: ifood_df.csv

    The data is about result of 5 Marketing Campaigns done by food company, and how each customer interact with those campaigns, in addition to demographics data about customers such as: Income, age, education level, marital status, number of children and teenagers, and other data related to each customer

    **I download it to Explore, Clean, and Transform it by Microsoft Excel, then Visualize and Analysis by Python ** First of all, in the Exploration Phase: Understand the column and the relationships between each column and others, Also Define the important questions which the way to make recommendations about the Marketing Campaign.

    In the Cleaning Phase: Delete Columns “Z_CostContact” and “Z_Revenue” because it contains fixed number and not important for my questions. Delete column “Response” because it’s not used in my analysis and meaningless in addition to can’t find what’s that stands for! Then Check for missing data and I found that the day is Complete. After that check for duplicates and find out the every row is Unique. Also check for Accuracy and know that the data have correct and logical values. The only thing you should know that is the data not current, it's from more than 2000 customers from 2020.

    Overall in Cleaning Process: The data is Accurate, Complete, Consistent, Relevant, Valid, and Unique. But need some Transformation.

    In Transportation Phase: Add a column for “Index” to make a unique identifier for each customer. Aggregate all marital status in one column. Aggregate all education level in one column. Rearrange some columns like campaigns and Totals.

    Fields Description:

    Index: unique identifier for each customer. Income: the yearly income for each customer. Kidhome: number of small children in customer’s household. Teenhome: number of teenagers in customer’s household. Recency: number of days since last purchase. MntWines: Amount of wine purchased in last 2 years. MntFruits: Amount of fruit purchased in last 2 years. MntMeatProducts: Amount of meat purchased in last 2 years. MntFishProducts: Amount of fish purchased in last 2 years. MntSweetProducts: Amount of sweets purchased in last 2 years. MntRegularProds: Amount of Regular Products purchased in last 2 years. MntGoldProds: Amount of Special Products purchased in last 2 years. MntTotal: Total amount of everything purchased in last 2 years. NumDealsPurchases: number of purchases made with discount. NumWebPurchases: number of purchases made through company’s website. NumCatalogPurchases: number of purchases made using catalog. NumStorePurchases: number of purchases made directly in the store. NumWebVisitsMonth: number of visits to company’s website in the last month. AcceptedCmp1: 1 if customer accepts the offer in first campaign, 0 for otherwise. AcceptedCmp2: 1 if customer accepts the offer in second campaign, 0 for otherwise. AcceptedCmp3: 1 if customer accepts the offer in third campaign, 0 for otherwise. AcceptedCmp4: 1 if customer accepts the offer in fourth campaign, 0 for otherwise. AcceptedCmp5: 1 if customer accepts the offer in fifth campaign, 0 for otherwise. AcceptedCmpOverall: total number of marketing campaigns that customer accepted. Complain: whether the customer complained in last 2 years or not. Age: the customer’s age. Customer_Days: days since registration. marital_status: the customer’s status. education: the customer’s level of education.

  15. Housing Prices Dataset

    • kaggle.com
    zip
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
    Explore at:
    zip(4740 bytes)Available download formats
    Dataset updated
    Jan 12, 2022
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

    Description:

    A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

    Acknowledgement:

    Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build Regression models to predict the sales w.r.t a single & multiple feature.
    • Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Organization logo

S1 Data -

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

Search
Clear search
Close search
Google apps
Main menu