This dataset was created by npscul
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Makar Baderko
Released under Data files © Original Authors
ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.
Instructions:
Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.
Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...
Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.
The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:
Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809
Link to paper : https://ieeexplore.ieee.org/document/9751703
The directories of the Edge-IIoTset dataset include the following:
•File 1 (Normal traffic)
-File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.
-File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.
-File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.
-File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.
-File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.
-File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.
-File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.
-File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.
-File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.
-File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.
•File 2 (Attack traffic):
-File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.
-File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.
•File 3 (Selected dataset for ML and DL):
-File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.
-File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.
Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files
!pip install -q kaggle
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"
!unzip DNN-EdgeIIoT-dataset.csv.zip
!rm DNN-EdgeIIoT-dataset.csv.zip
Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd
import numpy as np
df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)
Step 3 : Exploring some of the DataFrame's contents: df.head(5)
print(df['Attack_type'].value_counts())
Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle
drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",
"http.file_data","http.request.full_uri","icmp.transmit_timestamp",
"http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
"tcp.dstport", "udp.port", "mqtt.msg"]
df.drop(drop_columns, axis=1, inplace=True)
df.dropna(axis=0, how='any', inplace=True)
df.drop_duplicates(subset=None, keep="first", inplace=True)
df = shuffle(df)
df.isna().sum()
print(df['Attack_type'].value_counts())
Step 5: Categorical data encoding (Dummy Encoding): import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
def encode_text_dummy(df, name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = f"{name}-{x}"
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)
encode_text_dummy(df,'http.request.method')
encode_text_dummy(df,'http.referer')
encode_text_dummy(df,"http.request.version")
encode_text_dummy(df,"dns.qry.name.len")
encode_text_dummy(df,"mqtt.conack.flags")
encode_text_dummy(df,"mqtt.protoname")
encode_text_dummy(df,"mqtt.topic")
Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')
For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com
More information about Dr. Mohamed Amine Ferrag is available at:
https://www.linkedin.com/in/Mohamed-Amine-Ferrag
https://dblp.uni-trier.de/pid/142/9937.html
https://www.researchgate.net/profile/Mohamed_Amine_Ferrag
https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao
https://www.scopus.com/authid/detail.uri?authorId=56115001200
https://publons.com/researcher/1322865/mohamed-amine-ferrag/
https://orcid.org/0000-0002-0632-3172
Last Updated: 27 Mar. 2023
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
16x112x112 tiles of images in PNG format from the PANDA prostate detection challenge.
The dataset contains 16 tiles of size 112x112 for every image of the original competition dataset. The 20 images are the ones containing the most significant pixel information,
The data in this dataset was created with the following kernel: https://www.kaggle.com/fanconic/panda-20x112x112-tiles-for-efficientnetb0
Many thanks to @lafoss for the original kernel: https://www.kaggle.com/iafoss/panda-16x128x128-tiles You da real MVP!
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data set contains combined on-court performance data for NBA players in the 2016-2017 season, alongside salary, Twitter engagement, and Wikipedia traffic data.
Further information can be found in a series of articles for IBM Developerworks: "Explore valuation and attendance using data science and machine learning" and "Exploring the individual NBA players".
A talk about this dataset has slides from March, 2018, Strata:
Further reading on this dataset is in the book Pragmatic AI, in Chapter 6 or full book, Pragmatic AI: An introduction to Cloud-based Machine Learning and watch lesson 9 in Essential Machine Learning and AI with Python and Jupyter Notebook
You can watch a breakdown of using cluster analysis on the Pragmatic AI YouTube channel
Learn to deploy a Kaggle project into a production Machine Learning sklearn + flask + container by reading Python for Devops: Learn Ruthlessly Effective Automation, Chapter 14: MLOps and Machine learning engineering
Use social media to predict a winning season with this notebook: https://github.com/noahgift/core-stats-datascience/blob/master/Lesson2_7_Trends_Supervized_Learning.ipynb
Learn to use the cloud for data analysis.
Data sources include ESPN, Basketball-Reference, Twitter, Five-ThirtyEight, and Wikipedia. The source code for this dataset (in Python and R) can be found on GitHub. Links to more writing can be found at noahgift.com.
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
I developed an artificial intelligence software that predicts your Age and Gender. It has a 93% accuracy rate. I'm 21 years old and he predicted my age 100% correctly! I adjusted the algorithm and prepared the codes. A system that works together with Neural Networks in the Deep Learning system. I used Convolutional Layers from Convolutional Neural Networks. I am pleased to present this software for humanity. Doctoral students can use it in their theses or various companies can use this software! Upload your photo, guess your age and gender!
Kind regards,
Emirhan BULUT
Head of AI & AI Inventor
Python 3.9.8
TensorFlow
Keras
OpenCV
MatPlotlib
NumPy
Pandas
Scikit-learn - (SKLEARN)
https://raw.githubusercontent.com/emirhanai/Age-and-Sex-Prediction-from-Image---Convolutional-Neural-Network-with-Artificial-Intelligence/main/Age%20and%20Sex%20Prediction%20from%20Image%20-%20Convolutional%20Neural%20Network%20with%20Artificial%20Intelligence.png" alt="Age and Sex Prediction from Image - Convolutional Neural Network with Artificial Intelligence">
Name-Surname: Emirhan BULUT
Contact (Email) : emirhan@isap.solutions
LinkedIn : https://www.linkedin.com/in/artificialintelligencebulut/
Kaggle: https://www.kaggle.com/emirhanai
Official Website: https://www.emirhanbulut.com.tr
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
First Version.. Cryptocurrency Prediction with Artificial Intelligence (Deep Learning via LSTM Neural Networks)- Emirhan BULUT I developed Cryptocurrency Prediction (Deep Learning with LSTM Neural Networks) software with Artificial Intelligence. I predicted the fall on December 28, 2021 with 98.5% accuracy in the XRP/USDT pair. '0.009179626158151918' MAE Score, '0.0002120391943355104' MSE Score, 98.35% Accuracy Question software has been completed.
The XRP/USDT pair forecast for December 28, 2021 was correctly forecasted based on data from Binance.
Software codes and information are shared with you as open source code free of charge on GitHub and My Personal Web Address.
Happy learning!
Emirhan BULUT
Senior Artificial Intelligence Engineer & Inventor
Python 3.9.8
Tensorflow - Keras
NumPy
Matplotlib
Pandas
Scikit-learn - (SKLEARN)
https://raw.githubusercontent.com/emirhanai/Cryptocurrency-Prediction-with-Artificial-Intelligence/main/XRP-1%20-%20PREDICTION.png" alt="Cryptocurrency Prediction with Artificial Intelligence (Deep Learning via LSTM Neural Networks)- Emirhan BULUT">
Name-Surname: Emirhan BULUT
Contact (Email) : emirhan@isap.solutions
LinkedIn : https://www.linkedin.com/in/artificialintelligencebulut/
Kaggle: https://www.kaggle.com/emirhanai
Official Website: https://www.emirhanbulut.com.tr
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a realistic and structured pizza sales dataset covering the time span from **2024 to 2025. ** Whether you're a beginner in data science, a student working on a machine learning project, or an experienced analyst looking to test out time series forecasting and dashboard building, this dataset is for you.
📁 What’s Inside? The dataset contains rich details from a pizza business including:
✅ Order Dates & Times ✅ Pizza Names & Categories (Veg, Non-Veg, Classic, Gourmet, etc.) ✅ Sizes (Small, Medium, Large, XL) ✅ Prices ✅ Order Quantities ✅ Customer Preferences & Trends
It is neatly organized in Excel format and easy to use with tools like Python (Pandas), Power BI, Excel, or Tableau.
💡** Why Use This Dataset?** This dataset is ideal for:
📈 Sales Analysis & Reporting 🧠 Machine Learning Models (demand forecasting, recommendations) 📅 Time Series Forecasting 📊 Data Visualization Projects 🍽️ Customer Behavior Analysis 🛒 Market Basket Analysis 📦 Inventory Management Simulations
🧠 Perfect For: Data Science Beginners & Learners BI Developers & Dashboard Designers MBA Students (Marketing, Retail, Operations) Hackathons & Case Study Competitions
pizza, sales data, excel dataset, retail analysis, data visualization, business intelligence, forecasting, time series, customer insights, machine learning, pandas, beginner friendly
The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.
The main goal is to predict if clients will subscribe to a term deposit or not.
Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)
Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)
#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.
Dataset Overview
This dataset includes five CSV files:
patients.csv – Patient demographics, contact details, registration info, and insurance data
doctors.csv – Doctor profiles with specializations, experience, and contact information
appointments.csv – Appointment dates, times, visit reasons, and statuses
treatments.csv – Treatment types, descriptions, dates, and associated costs
billing.csv – Billing amounts, payment methods, and status linked to treatments
📁 Files & Column Descriptions
** patients.csv**
Contains patient demographic and registration details.
Column Description
patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address
** doctors.csv**
Details about the doctors working in the hospital.
Column Description
doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address
appointments.csv
Records of scheduled and completed patient appointments.
Column Description
appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)
treatments.csv
Information about the treatments given during appointments.
Column Description
treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given
** billing.csv**
Billing and payment details for treatments.
Column Description
bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)
Possible Use Cases
SQL queries and relational database design
Exploratory data analysis (EDA) and dashboarding
Machine learning projects (e.g., cost prediction, no-show analysis)
Feature engineering and data cleaning practice
End-to-end healthcare analytics workflows
Recommended Tools & Resources
SQL (joins, filters, window functions)
Pandas and Matplotlib/Seaborn for EDA
Scikit-learn for ML models
Pandas Profiling for automated EDA
Plotly for interactive visualizations
Please Note that :
All data is synthetically generated for educational and project use. No real patient information is included.
If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Thanks to Miroslav Zoricak and https://www.kaggle.com/mirosval/personal-cars-classifieds
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🔍 Dataset Overview
Each patient in the dataset has 30 days of continuous health data. The goal is to predict if a patient will progress to a critical condition based on their vital signs, medication adherence, and symptoms recorded daily.
There are 10 columns in the dataset:
Column Name Description patient_id Unique identifier for each patient. day Day number (from 1 to 30) indicating sequential daily records. bp_systolic Systolic blood pressure (top number) in mm Hg. Higher values may indicate hypertension. bp_diastolic Diastolic blood pressure (bottom number) in mm Hg. heart_rate Heartbeats per minute. Elevated heart rate can signal stress, infection, or deterioration. respiratory_rate Breaths per minute. Elevated rates can indicate respiratory distress. temperature Body temperature in °F. Fever or hypothermia are signs of infection or inflammation. oxygen_saturation Percentage of oxygen in blood. Lower values are concerning (< 94%). med_adherence Patient’s medication adherence (between 0 and 1). Lower values may contribute to worsening. symptom_severity Subjective symptom rating (scale of 1–10). Higher means worse condition. progressed_to_critical Target label: 1 if patient deteriorated to a critical condition, else 0. 🎯 Final Task (Prediction Objective)
Problem Type: Binary classification with time-series data.
Goal: Train deep learning models (RNN, LSTM, GRU) to learn temporal patterns from a patient's 30-day health history and predict whether the patient will progress to a critical condition.
📈 How the Data is Used for Modeling
Input: A 3D array shaped as (num_patients, 30, 8) where: 30 = number of days (timesteps), 8 = features per day (excluding ID, day, and target). Output: A binary label for each patient (0 or 1). 🔄 Feature Contribution to Prediction
Feature Why It Matters bp_systolic/dia Persistently high or rising BP may signal stress, cardiac issues, or deterioration. heart_rate A rising heart rate can indicate fever, infection, or organ distress. respiratory_rate Often increases early in critical illnesses like sepsis or COVID. temperature Fever is a key sign of infection. Chronic low/high temp may indicate underlying pathology. oxygen_saturation A declining oxygen level is a strong predictor of respiratory failure. med_adherence Poor medication adherence is often linked to worsening chronic conditions. symptom_severity Patient-reported worsening symptoms may precede measurable physiological changes. 🛠 Tools You’ll Use
Task Tool/Technique Data processing Pandas, NumPy, Scikit-learn Time series modeling Keras (using SimpleRNN, LSTM, GRU) Evaluation Accuracy, Loss, ROC Curve (optional)
DESCRIPTION
Comcast is an American global telecommunication company. The firm has been providing terrible customer service. They continue to fall short despite repeated promises to improve. Only last month (October 2016) the authority fined them a $2.3 million, after receiving over 1000 consumer complaints. The existing database will serve as a repository of public customer complaints filed against Comcast. It will help to pin down what is wrong with Comcast's customer service.
Data Dictionary
Analysis Task
To perform these tasks, you can use any of the different Python libraries such as NumPy, SciPy, Pandas, scikit-learn, matplotlib, and BeautifulSoup.
Which complaint types are maximum i.e., around internet, network issues, or across any other domains. - Create a new categorical variable with value as Open and Closed. Open & Pending is to be categorized as Open and Closed & Solved is to be categorized as Closed. - Provide state wise status of complaints in a stacked bar chart. Use the categorized variable from Q3. Provide insights on:
Which state has the maximum complaints Which state has the highest percentage of unresolved complaints - Provide the percentage of complaints resolved till date, which were received through the Internet and customer care calls.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The provided code processes a Tajweed dataset, which appears to be a collection of audio recordings categorized by different Tajweed rules (Ikhfa, Izhar, Idgham, Iqlab). Let's break down the dataset's structure and the code's functionality:
Dataset Structure:
Code Functionality:
Initialization and Imports: The code begins with necessary imports (pandas, pydub) and mounts Google Drive. Pydub is used for audio file format conversion.
Directory Listing: It initially checks if a specified directory exists (for example, Alaa_alhsri/Ikhfa) and lists its files, demonstrating basic file system access.
Metadata Creation: The core of the script is the generation of metadata, which provides essential information about each audio file. The tajweed_paths
dictionary maps each Tajweed rule to a list of paths, associating each path with the reciter's name.
global_id
: A unique identifier for each audio file.original_filename
: The original filename of the audio file.new_filename
: A standardized filename that incorporates the Tajweed rule (label), sheikh's ID, audio number, and a global ID.label
: The Tajweed rule.sheikh_id
: A numerical identifier for each sheikh.sheikh_name
: The name of the reciter.audio_number
: A sequential number for the audio files within a specific sheikh and Tajweed rule combination.original_path
: Full path to the original audio file.new_path
: Full path to the intended location for the renamed and potentially converted audio file.File Renaming and Conversion:
new_filename
and store it in the designated directory..wav
format, creating standardized files in a new output_dataset
directory. The new filenames are based on rules, sheikh and a counter.Metadata Export: Finally, the compiled metadata is saved as a CSV file (metadata.csv
) in the output directory. This CSV file is crucial for training any machine learning model using this data.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
https://github.githubassets.com/images/modules/site/home/footer-illustration.svg" alt="GitHub">
Image credits: https://github.com
This is a dataset that contains all commit messages and its related metadata from 34 popular GitHub repositories. These repositories are:
Data as of Wed Apr 21 03:42:44 PM IST 2021
Image credits: Unsplash - plhnk
Use this dataset to install RAPIDS in Kaggle notebooks. Installation takes 1 minute. Add the following lines of code to your notebook and turn GPU on. Change rapids.21.06
below to the version desired. (Currently v21.06, v0.19, v0.18 and v0.17 are available).
import sys
!cp ../input/rapids/rapids.21.06 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/
Read more about RAPIDS here. The RAPIDS libraries allow us to perform all our data science on GPUs including reading data, transforming data, modeling, validation, and prediction. The package cuDF provides Pandas functionality and cuML provides Scikit-learn functionality. Other packages provide additional tools.
Since GPUs are faster than CPUs, we save time, save money, and can increase model accuracy by performing additional tasks like hyperparameter searches, feature engineering and selection, data augmentation, and ensembling with bagging and boosting.
Coursera is an online platform for higher education. The Coursera Global Skills Index (GSI) draws upon this rich data to benchmark 60 countries and 10 industries across Business, Technology, and Data Science skills to reveal skills development trends around the world.
Cousera measured the skill proficiency of countries in AI overall and in the related skills of math, machine learning, statistics, statistical programming, and software engineering. These related skills cover the breadth of knowledge needed to build and deploy AI-powered technologies within organizations and society: • Math: the theoretical background necessary to conduct and apply AI research •**Statistics**: empirical skills needed to fit and measure the impact of AI models •**Machine Learning**: skills needed to build self-learning models like deep learning and other supervised models that power most AI applications today •**Statistical Programming**: programming skills needed to implement AI models such as in python and related packages like sci-kit learn and pandas •**Software Engineering**: programming skills needed to design and scale AI-powered applications
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a dataset created as part of a tutorial on basic web scraping. Please visit One for the Road for the tutorial!
The top 100 Canadian beers as ranked by visitors to BeerAdvocate.com. This dataset is intended only to help users learn how to scrape web data using BeautifulSoup and turn it into a Pandas dataframe.
This dataset lists the top 100 Canadian beers:
Thanks to all the readers and contributors of BeerAdvocate, selflessly pouring, drinking, and reviewing beers for our benefit.
Version 2 of this dataset was scraped on 5/08/2017 from https://www.beeradvocate.com/lists/ca/
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Do you know what is common among Kung Fu Panda, Alvin and the Chipmunks, Monster Trucks, Trolls, Spongebob Movie and Monster Vs Aliens? They all were scripted by the same authors - Jonathan Aibel and Glenn Berger.
Kung Fu Panda is a 2008 animated movie by DreamWorks Production. It has made $631 million and its one of the most successful film on the box office from DreamWorks.
There is much talk and discussions on this movie beyond cinema-goers. Some like to learn leadership lessons from it and few others try to link it with Christianity, Taoism, Mysticism and Islam.
I was wondering if we can see the script from data science perspective and can answer some of the questions with significant implications in movie and other industries.
I welcome you all to do Data Science Martial Arts with Kung-fu-Panda and see who survives
It’s a complete script of Kung Fu Panda 1 and 2 in CSV format with all background narrations, scene settings and movie dialogues by characters (Po, Master Shufy, Tai Lung, Tigress, Monkey, Viper, Oogway, Mr. Ping, Mantis and Crane).
Kung Fu Panda is a production by DreamWorks Studios. All scripts were gathered from online public sources like this and this.
Some ideas worth exploring:
• Can we train the neural network to recognize the character by dialogue? For example, if I give any line from the script, your algorithm will be able to tell who’s more likely to say this in movie?
• Can we make the word cloud for each character (and perhaps compare it with other movie characters by same authors and see who is similar to who)
• Can we train a chat bot for Oogway to Po so kids can talk to it and it would respond the same way as Oogway or Po would
• Can we calculate the average length or dialogue
• Can we estimate the difficulty level of vocabulary being used and perhaps compare it with movies of other genre
• Can we compare the script with some religious text and find out similarities
This dataset was created by npscul