21 datasets found
  1. Learn pandas

    • kaggle.com
    Updated Apr 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    npscul (2021). Learn pandas [Dataset]. https://www.kaggle.com/npscul/learn-pandas/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    npscul
    Description

    Dataset

    This dataset was created by npscul

    Contents

  2. Learn Data Science Series Part 1

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rupesh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

    Overview:

    • Chapter 1: Getting started with pandas
    • Chapter 2: Analysis: Bringing it all together and making decisions
    • Chapter 3: Appending to DataFrame
    • Chapter 4: Boolean indexing of dataframes
    • Chapter 5: Categorical data
    • Chapter 6: Computational Tools
    • Chapter 7: Creating DataFrames
    • Chapter 8: Cross sections of different axes with MultiIndex
    • Chapter 9: Data Types
    • Chapter 10: Dealing with categorical variables
    • Chapter 11: Duplicated data
    • Chapter 12: Getting information about DataFrames
    • Chapter 13: Gotchas of pandas
    • Chapter 14: Graphs and Visualizations
    • Chapter 15: Grouping Data
    • Chapter 16: Grouping Time Series Data
    • Chapter 17: Holiday Calendars
    • Chapter 18: Indexing and selecting data
    • Chapter 19: IO for Google BigQuery
    • Chapter 20: JSON
    • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
    • Chapter 22: Map Values
    • Chapter 23: Merge, join, and concatenate
    • Chapter 24: Meta: Documentation Guidelines
    • Chapter 25: Missing Data
    • Chapter 26: MultiIndex
    • Chapter 27: Pandas Datareader
    • Chapter 28: Pandas IO tools (reading and saving data sets)
    • Chapter 29: pd.DataFrame.apply
    • Chapter 30: Read MySQL to DataFrame
    • Chapter 31: Read SQL Server to Dataframe
    • Chapter 32: Reading files into pandas DataFrame
    • Chapter 33: Resampling
    • Chapter 34: Reshaping and pivoting
    • Chapter 35: Save pandas dataframe to a csv file
    • Chapter 36: Series
    • Chapter 37: Shifting and Lagging Data
    • Chapter 38: Simple manipulation of DataFrames
    • Chapter 39: String manipulation
    • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
    • Chapter 41: Working with Time Series
  3. Data from: car-sales

    • kaggle.com
    zip
    Updated Jun 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Makar Baderko (2020). car-sales [Dataset]. https://www.kaggle.com/makarbaderko/carsales
    Explore at:
    zip(18661 bytes)Available download formats
    Dataset updated
    Jun 30, 2020
    Authors
    Makar Baderko
    Description

    Dataset

    This dataset was created by Makar Baderko

    Released under Data files © Original Authors

    Contents

  4. P

    EDGE-IIOTSET Dataset

    • paperswithcode.com
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
    Explore at:
    Dataset updated
    Oct 16, 2023
    Description

    ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

    Instructions:

    Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

    Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

    Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

    The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

    Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

    Link to paper : https://ieeexplore.ieee.org/document/9751703

    The directories of the Edge-IIoTset dataset include the following:

    •File 1 (Normal traffic)

    -File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

    -File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

    -File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

    -File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

    -File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

    -File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

    -File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

    -File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

    •File 2 (Attack traffic):

    -File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

    -File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

    •File 3 (Selected dataset for ML and DL):

    -File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

    -File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

    Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

    !pip install -q kaggle

    files.upload()

    !mkdir ~/.kaggle

    !cp kaggle.json ~/.kaggle/

    !chmod 600 ~/.kaggle/kaggle.json

    !kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

    !unzip DNN-EdgeIIoT-dataset.csv.zip

    !rm DNN-EdgeIIoT-dataset.csv.zip

    Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

    import numpy as np

    df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

    Step 3 : Exploring some of the DataFrame's contents: df.head(5)

    print(df['Attack_type'].value_counts())

    Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

    drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

     "http.file_data","http.request.full_uri","icmp.transmit_timestamp",
    
     "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
    
     "tcp.dstport", "udp.port", "mqtt.msg"]
    

    df.drop(drop_columns, axis=1, inplace=True)

    df.dropna(axis=0, how='any', inplace=True)

    df.drop_duplicates(subset=None, keep="first", inplace=True)

    df = shuffle(df)

    df.isna().sum()

    print(df['Attack_type'].value_counts())

    Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn import preprocessing

    def encode_text_dummy(df, name):

    dummies = pd.get_dummies(df[name])

    for x in dummies.columns:

    dummy_name = f"{name}-{x}"
    
    df[dummy_name] = dummies[x]
    

    df.drop(name, axis=1, inplace=True)

    encode_text_dummy(df,'http.request.method')

    encode_text_dummy(df,'http.referer')

    encode_text_dummy(df,"http.request.version")

    encode_text_dummy(df,"dns.qry.name.len")

    encode_text_dummy(df,"mqtt.conack.flags")

    encode_text_dummy(df,"mqtt.protoname")

    encode_text_dummy(df,"mqtt.topic")

    Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

    For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

    More information about Dr. Mohamed Amine Ferrag is available at:

    https://www.linkedin.com/in/Mohamed-Amine-Ferrag

    https://dblp.uni-trier.de/pid/142/9937.html

    https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

    https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

    https://www.scopus.com/authid/detail.uri?authorId=56115001200

    https://publons.com/researcher/1322865/mohamed-amine-ferrag/

    https://orcid.org/0000-0002-0632-3172

    Last Updated: 27 Mar. 2023

  5. PANDA fanconic model weights

    • kaggle.com
    Updated Jul 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claudio Fanconi (2020). PANDA fanconic model weights [Dataset]. https://www.kaggle.com/fanconic/panda-tiles-20x112x112/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Claudio Fanconi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    16x112x112 tiles of images in PNG format from the PANDA prostate detection challenge.

    Content

    The dataset contains 16 tiles of size 112x112 for every image of the original competition dataset. The 20 images are the ones containing the most significant pixel information,

    Acknowledgements

    The data in this dataset was created with the following kernel: https://www.kaggle.com/fanconic/panda-20x112x112-tiles-for-efficientnetb0

    Many thanks to @lafoss for the original kernel: https://www.kaggle.com/iafoss/panda-16x128x128-tiles You da real MVP!

  6. Social Power NBA

    • kaggle.com
    Updated Aug 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noah Gift (2017). Social Power NBA [Dataset]. https://www.kaggle.com/datasets/noahgift/social-power-nba/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2017
    Dataset provided by
    Kaggle
    Authors
    Noah Gift
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This data set contains combined on-court performance data for NBA players in the 2016-2017 season, alongside salary, Twitter engagement, and Wikipedia traffic data.

    Further information can be found in a series of articles for IBM Developerworks: "Explore valuation and attendance using data science and machine learning" and "Exploring the individual NBA players".

    A talk about this dataset has slides from March, 2018, Strata:

    https://www.slideshare.net/noahgift/social-power-andinfluenceinthenba-89807740?qid=3f9f835a-f3d7-4174-8a8c-c97f9c82e614&v=&b=&from_search=1

    Further reading on this dataset is in the book Pragmatic AI, in Chapter 6 or full book, Pragmatic AI: An introduction to Cloud-based Machine Learning and watch lesson 9 in Essential Machine Learning and AI with Python and Jupyter Notebook

    Followup Items

    Acknowledgement

    Data sources include ESPN, Basketball-Reference, Twitter, Five-ThirtyEight, and Wikipedia. The source code for this dataset (in Python and R) can be found on GitHub. Links to more writing can be found at noahgift.com.

    Inspiration

    • Do NBA fans know more about who the best players are, or do owners?
    • What is the true worth of the social media presence of athletes in the NBA?
  7. Age and Sex Prediction by Artificial Intelligence

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMİRHAN BULUT (2025). Age and Sex Prediction by Artificial Intelligence [Dataset]. https://www.kaggle.com/datasets/emirhanai/age-and-sex-prediction-by-artificial-intelligence
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kaggle
    Authors
    EMİRHAN BULUT
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Age and Sex Prediction from Image - Convolutional Neural Network with Artificial Intelligence

    I developed an artificial intelligence software that predicts your Age and Gender. It has a 93% accuracy rate. I'm 21 years old and he predicted my age 100% correctly! I adjusted the algorithm and prepared the codes. A system that works together with Neural Networks in the Deep Learning system. I used Convolutional Layers from Convolutional Neural Networks. I am pleased to present this software for humanity. Doctoral students can use it in their theses or various companies can use this software! Upload your photo, guess your age and gender!

    Kind regards,

    Emirhan BULUT

    Head of AI & AI Inventor

    The coding language used:

    Python 3.9.8

    Libraries Used:

    TensorFlow

    Keras

    OpenCV

    MatPlotlib

    NumPy

    Pandas

    Scikit-learn - (SKLEARN)

    https://raw.githubusercontent.com/emirhanai/Age-and-Sex-Prediction-from-Image---Convolutional-Neural-Network-with-Artificial-Intelligence/main/Age%20and%20Sex%20Prediction%20from%20Image%20-%20Convolutional%20Neural%20Network%20with%20Artificial%20Intelligence.png" alt="Age and Sex Prediction from Image - Convolutional Neural Network with Artificial Intelligence">

    Developer Information:

    Name-Surname: Emirhan BULUT

    Contact (Email) : emirhan@isap.solutions

    LinkedIn : https://www.linkedin.com/in/artificialintelligencebulut/

    Kaggle: https://www.kaggle.com/emirhanai

    Official Website: https://www.emirhanbulut.com.tr

  8. Cryptocurrency Prediction Artificial Intelligence

    • kaggle.com
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMİRHAN BULUT (2025). Cryptocurrency Prediction Artificial Intelligence [Dataset]. https://www.kaggle.com/datasets/emirhanai/cryptocurrency-prediction-artificial-intelligence
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Kaggle
    Authors
    EMİRHAN BULUT
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Cryptocurrency-Prediction-with-Artificial-Intelligence

    First Version.. Cryptocurrency Prediction with Artificial Intelligence (Deep Learning via LSTM Neural Networks)- Emirhan BULUT I developed Cryptocurrency Prediction (Deep Learning with LSTM Neural Networks) software with Artificial Intelligence. I predicted the fall on December 28, 2021 with 98.5% accuracy in the XRP/USDT pair. '0.009179626158151918' MAE Score, '0.0002120391943355104' MSE Score, 98.35% Accuracy Question software has been completed.

    The XRP/USDT pair forecast for December 28, 2021 was correctly forecasted based on data from Binance.

    Software codes and information are shared with you as open source code free of charge on GitHub and My Personal Web Address.

    Happy learning!

    Emirhan BULUT

    Senior Artificial Intelligence Engineer & Inventor

    The coding language used:

    Python 3.9.8

    Libraries Used:

    Tensorflow - Keras

    NumPy

    Matplotlib

    Pandas

    Scikit-learn - (SKLEARN)

    https://raw.githubusercontent.com/emirhanai/Cryptocurrency-Prediction-with-Artificial-Intelligence/main/XRP-1%20-%20PREDICTION.png" alt="Cryptocurrency Prediction with Artificial Intelligence (Deep Learning via LSTM Neural Networks)- Emirhan BULUT">

    Developer Information:

    Name-Surname: Emirhan BULUT

    Contact (Email) : emirhan@isap.solutions

    LinkedIn : https://www.linkedin.com/in/artificialintelligencebulut/

    Kaggle: https://www.kaggle.com/emirhanai

    Official Website: https://www.emirhanbulut.com.tr

  9. Enhanced Pizza Sales Data (2024–2025)

    • kaggle.com
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    akshay gaikwad (2025). Enhanced Pizza Sales Data (2024–2025) [Dataset]. https://www.kaggle.com/datasets/akshaygaikwad448/pizza-delivery-data-with-enhanced-features
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    akshay gaikwad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a realistic and structured pizza sales dataset covering the time span from **2024 to 2025. ** Whether you're a beginner in data science, a student working on a machine learning project, or an experienced analyst looking to test out time series forecasting and dashboard building, this dataset is for you.

    📁 What’s Inside? The dataset contains rich details from a pizza business including:

    ✅ Order Dates & Times ✅ Pizza Names & Categories (Veg, Non-Veg, Classic, Gourmet, etc.) ✅ Sizes (Small, Medium, Large, XL) ✅ Prices ✅ Order Quantities ✅ Customer Preferences & Trends

    It is neatly organized in Excel format and easy to use with tools like Python (Pandas), Power BI, Excel, or Tableau.

    💡** Why Use This Dataset?** This dataset is ideal for:

    📈 Sales Analysis & Reporting 🧠 Machine Learning Models (demand forecasting, recommendations) 📅 Time Series Forecasting 📊 Data Visualization Projects 🍽️ Customer Behavior Analysis 🛒 Market Basket Analysis 📦 Inventory Management Simulations

    🧠 Perfect For: Data Science Beginners & Learners BI Developers & Dashboard Designers MBA Students (Marketing, Retail, Operations) Hackathons & Case Study Competitions

    pizza, sales data, excel dataset, retail analysis, data visualization, business intelligence, forecasting, time series, customer insights, machine learning, pandas, beginner friendly

  10. Bank Data Analysis

    • kaggle.com
    Updated Mar 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Gallegos (2022). Bank Data Analysis [Dataset]. https://www.kaggle.com/stevegallegos/bank-marketing-data-set/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 19, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Steve Gallegos
    Description

    Data Set Information

    The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.

    Goal

    The main goal is to predict if clients will subscribe to a term deposit or not.

    Attribute Information

    -Input Variables -

    Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)

    Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

    Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)

    #Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

    Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'

    Source

    [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

  11. Hospital Management Dataset

    • kaggle.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanak Baghel
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

    Dataset Overview

    This dataset includes five CSV files:

    1. patients.csv – Patient demographics, contact details, registration info, and insurance data

    2. doctors.csv – Doctor profiles with specializations, experience, and contact information

    3. appointments.csv – Appointment dates, times, visit reasons, and statuses

    4. treatments.csv – Treatment types, descriptions, dates, and associated costs

    5. billing.csv – Billing amounts, payment methods, and status linked to treatments

    📁 Files & Column Descriptions

    ** patients.csv**

    Contains patient demographic and registration details.

    Column Description

    patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

    ** doctors.csv**

    Details about the doctors working in the hospital.

    Column Description

    doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

    appointments.csv

    Records of scheduled and completed patient appointments.

    Column Description

    appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

    treatments.csv

    Information about the treatments given during appointments.

    Column Description

    treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

    ** billing.csv**

    Billing and payment details for treatments.

    Column Description

    bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

    Possible Use Cases

    SQL queries and relational database design

    Exploratory data analysis (EDA) and dashboarding

    Machine learning projects (e.g., cost prediction, no-show analysis)

    Feature engineering and data cleaning practice

    End-to-end healthcare analytics workflows

    Recommended Tools & Resources

    SQL (joins, filters, window functions)

    Pandas and Matplotlib/Seaborn for EDA

    Scikit-learn for ML models

    Pandas Profiling for automated EDA

    Plotly for interactive visualizations

    Please Note that :

    All data is synthetically generated for educational and project use. No real patient information is included.

    If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

  12. Classified Ads for Cars - unique maker/model/year

    • kaggle.com
    Updated Mar 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volodymyr Sergeyev (2019). Classified Ads for Cars - unique maker/model/year [Dataset]. https://www.kaggle.com/vsergeyev/classified-ads-for-cars-unique-makermodelyear/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Volodymyr Sergeyev
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Thanks to Miroslav Zoricak and https://www.kaggle.com/mirosval/personal-cars-classifieds

    Inspiration

    • How many unique car makers are
    • How many models
    • Learn pandas
  13. Data from: EHR data

    • kaggle.com
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bipul Shahi (2025). EHR data [Dataset]. https://www.kaggle.com/datasets/vipulshahi/ehr-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bipul Shahi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🔍 Dataset Overview

    Each patient in the dataset has 30 days of continuous health data. The goal is to predict if a patient will progress to a critical condition based on their vital signs, medication adherence, and symptoms recorded daily.

    There are 10 columns in the dataset:

    Column Name Description patient_id Unique identifier for each patient. day Day number (from 1 to 30) indicating sequential daily records. bp_systolic Systolic blood pressure (top number) in mm Hg. Higher values may indicate hypertension. bp_diastolic Diastolic blood pressure (bottom number) in mm Hg. heart_rate Heartbeats per minute. Elevated heart rate can signal stress, infection, or deterioration. respiratory_rate Breaths per minute. Elevated rates can indicate respiratory distress. temperature Body temperature in °F. Fever or hypothermia are signs of infection or inflammation. oxygen_saturation Percentage of oxygen in blood. Lower values are concerning (< 94%). med_adherence Patient’s medication adherence (between 0 and 1). Lower values may contribute to worsening. symptom_severity Subjective symptom rating (scale of 1–10). Higher means worse condition. progressed_to_critical Target label: 1 if patient deteriorated to a critical condition, else 0. 🎯 Final Task (Prediction Objective)

    Problem Type: Binary classification with time-series data.

    Goal: Train deep learning models (RNN, LSTM, GRU) to learn temporal patterns from a patient's 30-day health history and predict whether the patient will progress to a critical condition.

    📈 How the Data is Used for Modeling

    Input: A 3D array shaped as (num_patients, 30, 8) where: 30 = number of days (timesteps), 8 = features per day (excluding ID, day, and target). Output: A binary label for each patient (0 or 1). 🔄 Feature Contribution to Prediction

    Feature Why It Matters bp_systolic/dia Persistently high or rising BP may signal stress, cardiac issues, or deterioration. heart_rate A rising heart rate can indicate fever, infection, or organ distress. respiratory_rate Often increases early in critical illnesses like sepsis or COVID. temperature Fever is a key sign of infection. Chronic low/high temp may indicate underlying pathology. oxygen_saturation A declining oxygen level is a strong predictor of respiratory failure. med_adherence Poor medication adherence is often linked to worsening chronic conditions. symptom_severity Patient-reported worsening symptoms may precede measurable physiological changes. 🛠 Tools You’ll Use

    Task Tool/Technique Data processing Pandas, NumPy, Scikit-learn Time series modeling Keras (using SimpleRNN, LSTM, GRU) Evaluation Accuracy, Loss, ROC Curve (optional)

  14. Telecom Consumer Complaints

    • kaggle.com
    Updated May 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya6196 (2020). Telecom Consumer Complaints [Dataset]. https://www.kaggle.com/aditya6196/telecom-consumer-complaints/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 21, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aditya6196
    Description

    DESCRIPTION

    Comcast is an American global telecommunication company. The firm has been providing terrible customer service. They continue to fall short despite repeated promises to improve. Only last month (October 2016) the authority fined them a $2.3 million, after receiving over 1000 consumer complaints. The existing database will serve as a repository of public customer complaints filed against Comcast. It will help to pin down what is wrong with Comcast's customer service.

    Data Dictionary

    1. Ticket #: Ticket number assigned to each complaint
    2. Customer Complaint: Description of complaint
    3. Date: Date of complaint
    4. Time: Time of complaint
    5. Received Via: Mode of communication of the complaint
    6. City: Customer city
    7. State: Customer state
    8. Zipcode: Customer zip
    9. Status: Status of complaint
    10. Filing on behalf of someone

    Analysis Task

    To perform these tasks, you can use any of the different Python libraries such as NumPy, SciPy, Pandas, scikit-learn, matplotlib, and BeautifulSoup.

    • Import data into Python environment.
    • Provide the trend chart for the number of complaints at monthly and daily granularity levels.
    • Provide a table with the frequency of complaint types.

    Which complaint types are maximum i.e., around internet, network issues, or across any other domains. - Create a new categorical variable with value as Open and Closed. Open & Pending is to be categorized as Open and Closed & Solved is to be categorized as Closed. - Provide state wise status of complaints in a stacked bar chart. Use the categorized variable from Q3. Provide insights on:

    Which state has the maximum complaints Which state has the highest percentage of unresolved complaints - Provide the percentage of complaints resolved till date, which were received through the Internet and customer care calls.

  15. Tajweed Dataset

    • kaggle.com
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ala'a Abdu Saleh Alawdi (2025). Tajweed Dataset [Dataset]. https://www.kaggle.com/datasets/alawdisoft/tajweed-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ala'a Abdu Saleh Alawdi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The provided code processes a Tajweed dataset, which appears to be a collection of audio recordings categorized by different Tajweed rules (Ikhfa, Izhar, Idgham, Iqlab). Let's break down the dataset's structure and the code's functionality:

    Dataset Structure:

    • Organized by Tajweed Rule and Sheikh: The dataset is structured into directories for each Tajweed rule (e.g., 'Ikhfa', 'Izhar'). Within each rule's directory, there are subdirectories representing different reciters (sheikhs). This hierarchical organization is crucial for creating a structured metadata file and for training machine learning models.
    • Audio Files: The audio files (presumably WAV or other supported formats) are stored within the sheikh's subdirectories. The original filenames are not standardized.
    • Multiple Sheikhs per Rule: The dataset includes multiple recitations for each rule from different sheikhs, offering diversity in pronunciation.
    • Google Drive Storage: The dataset is located on Google Drive, which requires mounting the drive to access the data within a Colab environment.

    Code Functionality:

    1. Initialization and Imports: The code begins with necessary imports (pandas, pydub) and mounts Google Drive. Pydub is used for audio file format conversion.

    2. Directory Listing: It initially checks if a specified directory exists (for example, Alaa_alhsri/Ikhfa) and lists its files, demonstrating basic file system access.

    3. Metadata Creation: The core of the script is the generation of metadata, which provides essential information about each audio file. The tajweed_paths dictionary maps each Tajweed rule to a list of paths, associating each path with the reciter's name.

      • Iterating through Paths: The code iterates through each Tajweed rule and its corresponding paths.
      • File Listing: Inside each directory, it iterates through the audio files.
      • Metadata Dictionary: For each audio file, it creates a metadata dictionary that includes:
        • global_id: A unique identifier for each audio file.
        • original_filename: The original filename of the audio file.
        • new_filename: A standardized filename that incorporates the Tajweed rule (label), sheikh's ID, audio number, and a global ID.
        • label: The Tajweed rule.
        • sheikh_id: A numerical identifier for each sheikh.
        • sheikh_name: The name of the reciter.
        • audio_number: A sequential number for the audio files within a specific sheikh and Tajweed rule combination.
        • original_path: Full path to the original audio file.
        • new_path: Full path to the intended location for the renamed and potentially converted audio file.
      • Pandas DataFrame: The metadata is collected in a list of dictionaries and then converted into a Pandas DataFrame for easier viewing and processing. This DataFrame is highly informative.
    4. File Renaming and Conversion:

      • File Renaming: (commented out) The code is able to rename the audio files to the standardized format defined in new_filename and store it in the designated directory.
      • Audio Conversion to WAV: The script then converts any files in the specified directories to .wav format, creating standardized files in a new output_dataset directory. The new filenames are based on rules, sheikh and a counter.
    5. Metadata Export: Finally, the compiled metadata is saved as a CSV file (metadata.csv) in the output directory. This CSV file is crucial for training any machine learning model using this data.

  16. GitHub Commit Messages Dataset

    • kaggle.com
    Updated Apr 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhruvil Dave (2021). GitHub Commit Messages Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/2143532
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dhruvil Dave
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    https://github.githubassets.com/images/modules/site/home/footer-illustration.svg" alt="GitHub">

    Image credits: https://github.com

    Introduction

    This is a dataset that contains all commit messages and its related metadata from 34 popular GitHub repositories. These repositories are:

    • tensorflow/tensorflow
    • pytorch/pytorch
    • torvalds/linux
    • python/cpython
    • rust-lang/rust
    • microsoft/TypeScript
    • microsoft/vscode
    • golang/go
    • numpy/numpy
    • scikit-learn/scikit-learn
    • openbsd/src
    • freebsd/freebsd-src
    • pandas-dev/pandas
    • scipy/scipy
    • tidyverse/ggplot2
    • kubernetes/kubernetes
    • postgres/postgres
    • nodejs/node
    • facebook/react
    • angular/angular
    • matplotlib/matplotlib
    • apache/httpd
    • nginx/nginx
    • opencv/opencv
    • ipython/ipython
    • rstudio/rstudio
    • jupyterlab/jupyterlab
    • gcc-mirror/gcc
    • apple/swift
    • denoland/deno
    • apache/spark
    • llvm/llvm-project
    • chromium/chromium
    • v8/v8

    Data as of Wed Apr 21 03:42:44 PM IST 2021

    Credits

    Image credits: Unsplash - plhnk

  17. RAPIDS

    • kaggle.com
    Updated Jun 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Deotte (2021). RAPIDS [Dataset]. https://www.kaggle.com/cdeotte/rapids/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Chris Deotte
    Description

    Use this dataset to install RAPIDS in Kaggle notebooks. Installation takes 1 minute. Add the following lines of code to your notebook and turn GPU on. Change rapids.21.06 below to the version desired. (Currently v21.06, v0.19, v0.18 and v0.17 are available).

    import sys
    !cp ../input/rapids/rapids.21.06 /opt/conda/envs/rapids.tar.gz
    !cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
    sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
    sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
    sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
    !cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/
    

    Read more about RAPIDS here. The RAPIDS libraries allow us to perform all our data science on GPUs including reading data, transforming data, modeling, validation, and prediction. The package cuDF provides Pandas functionality and cuML provides Scikit-learn functionality. Other packages provide additional tools.

    Since GPUs are faster than CPUs, we save time, save money, and can increase model accuracy by performing additional tasks like hyperparameter searches, feature engineering and selection, data augmentation, and ensembling with bagging and boosting.

  18. Coursera AI Global Skills Index 2019 data

    • kaggle.com
    Updated Dec 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parul Pandey (2019). Coursera AI Global Skills Index 2019 data [Dataset]. https://www.kaggle.com/parulpandey/coursera-ai-global-skills-index-2019-data/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Parul Pandey
    Description

    Context

    Coursera is an online platform for higher education. The Coursera Global Skills Index (GSI) draws upon this rich data to benchmark 60 countries and 10 industries across Business, Technology, and Data Science skills to reveal skills development trends around the world.

    Content

    Cousera measured the skill proficiency of countries in AI overall and in the related skills of math, machine learning, statistics, statistical programming, and software engineering. These related skills cover the breadth of knowledge needed to build and deploy AI-powered technologies within organizations and society: • Math: the theoretical background necessary to conduct and apply AI research •**Statistics**: empirical skills needed to fit and measure the impact of AI models •**Machine Learning**: skills needed to build self-learning models like deep learning and other supervised models that power most AI applications today •**Statistical Programming**: programming skills needed to implement AI models such as in python and related packages like sci-kit learn and pandas •**Software Engineering**: programming skills needed to design and scale AI-powered applications

    Acknowledgements

  19. Top 100 Canadian Beers

    • kaggle.com
    Updated May 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam Wong (2017). Top 100 Canadian Beers [Dataset]. https://www.kaggle.com/shwong/top-100-canadian-beers/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sam Wong
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Canada
    Description

    This is a dataset created as part of a tutorial on basic web scraping. Please visit One for the Road for the tutorial!

    Introduction

    The top 100 Canadian beers as ranked by visitors to BeerAdvocate.com. This dataset is intended only to help users learn how to scrape web data using BeautifulSoup and turn it into a Pandas dataframe.

    Content

    This dataset lists the top 100 Canadian beers:

    • Rank: rank, from 1 to 100, as rated by BeerAdvocate.com users
    • Name: name of the beer
    • Brewery: the brewery responsible for this delicious creation
    • Style: the style of the beer
    • ABV: Alcohol by Volume (%)
    • Score: Overall score determined by BeerAdvocate.com users
    • Ratings: Number of ratings

    Acknowledgements

    Thanks to all the readers and contributors of BeerAdvocate, selflessly pouring, drinking, and reviewing beers for our benefit.

    Version 2 of this dataset was scraped on 5/08/2017 from https://www.beeradvocate.com/lists/ca/

  20. Kung Fu Panda

    • kaggle.com
    Updated Nov 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeeshan-ul-hassan Usmani (2017). Kung Fu Panda [Dataset]. https://www.kaggle.com/datasets/zusmani/kung-fu-panda/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zeeshan-ul-hassan Usmani
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Do you know what is common among Kung Fu Panda, Alvin and the Chipmunks, Monster Trucks, Trolls, Spongebob Movie and Monster Vs Aliens? They all were scripted by the same authors - Jonathan Aibel and Glenn Berger.

    Kung Fu Panda is a 2008 animated movie by DreamWorks Production. It has made $631 million and its one of the most successful film on the box office from DreamWorks.

    There is much talk and discussions on this movie beyond cinema-goers. Some like to learn leadership lessons from it and few others try to link it with Christianity, Taoism, Mysticism and Islam.

    I was wondering if we can see the script from data science perspective and can answer some of the questions with significant implications in movie and other industries.

    I welcome you all to do Data Science Martial Arts with Kung-fu-Panda and see who survives

    Content

    It’s a complete script of Kung Fu Panda 1 and 2 in CSV format with all background narrations, scene settings and movie dialogues by characters (Po, Master Shufy, Tai Lung, Tigress, Monkey, Viper, Oogway, Mr. Ping, Mantis and Crane).

    Acknowledgements

    Kung Fu Panda is a production by DreamWorks Studios. All scripts were gathered from online public sources like this and this.

    Inspiration

    Some ideas worth exploring:

    • Can we train the neural network to recognize the character by dialogue? For example, if I give any line from the script, your algorithm will be able to tell who’s more likely to say this in movie?

    • Can we make the word cloud for each character (and perhaps compare it with other movie characters by same authors and see who is similar to who)

    • Can we train a chat bot for Oogway to Po so kids can talk to it and it would respond the same way as Oogway or Po would

    • Can we calculate the average length or dialogue

    • Can we estimate the difficulty level of vocabulary being used and perhaps compare it with movies of other genre

    • Can we compare the script with some religious text and find out similarities

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
npscul (2021). Learn pandas [Dataset]. https://www.kaggle.com/npscul/learn-pandas/code
Organization logo

Learn pandas

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
npscul
Description

Dataset

This dataset was created by npscul

Contents

Search
Clear search
Close search
Google apps
Main menu