67 datasets found

Lending Club Loan Data Analysis - Deep Learning
kaggle.com
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deependra Verma
Description
DESCRIPTION

Create a model that predicts whether or not a loan will be default using the historical data.

Problem Statement:

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

Domain: Finance

Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

Content:

Dataset columns and definition:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

installment: The monthly installments owed by the borrower if the loan is funded.

log.annual.inc: The natural log of the self-reported annual income of the borrower.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

fico: The FICO credit score of the borrower.

days.with.cr.line: The number of days the borrower has had a credit line.

revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Steps to perform:

Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

Tasks:

Feature Transformation

Transform categorical values into numerical values (discrete)

Exploratory data analysis of different factors of the dataset.

Additional Feature Engineering

You will check the correlation between features and will drop those features which have a strong correlation

This will help reduce the number of features and will leave you with the most relevant features

Modeling

After applying EDA and feature engineering, you are now ready to build the predictive models

In this part, you will create a deep learning model using Keras with Tensorflow backend
Google Analytics Capstone Project
kaggle.com
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederic Xiong (2023). Google Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/fredericxiong/google-analytics-capstone-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frederic Xiong
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Created a data set counting both the average and cumulative interest of Generative AI and Large Language Model (respectively) of multiple regions based off of google trends, created basic visuals for it through code.
Spain 2019 Data for Data Science Project
kaggle.com
Updated Dec 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Sebastian Moreno (2022). Spain 2019 Data for Data Science Project [Dataset]. https://www.kaggle.com/datasets/juansebastianmoreno/spain-2019-data-for-data-science-project/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Juan Sebastian Moreno
Description
Dataset

This dataset was created by Juan Sebastian Moreno

Contents
Python for Data Science-Uber Drive Project
kaggle.com
zip
Updated May 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Athisya Nadar (2021). Python for Data Science-Uber Drive Project [Dataset]. https://www.kaggle.com/athisyanadar/python-for-data-scienceuber-drive-project
Explore at:
zip(59869 bytes)Available download formats
Dataset updated
May 26, 2021
Authors
Athisya Nadar
Description
Dataset

This dataset was created by Athisya Nadar

Contents

It contains the following files:
P
EDGE-IIOTSET Dataset
paperswithcode.com
Updated Oct 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
Explore at:
Dataset updated
Oct 16, 2023
Description
ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

Instructions:

Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

Link to paper : https://ieeexplore.ieee.org/document/9751703

The directories of the Edge-IIoTset dataset include the following:

•File 1 (Normal traffic)

-File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

-File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

-File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

-File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

-File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

-File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

-File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

-File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

-File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

-File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

•File 2 (Attack traffic):

-File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

-File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

•File 3 (Selected dataset for ML and DL):

-File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

-File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

!pip install -q kaggle

files.upload()

!mkdir ~/.kaggle

!cp kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

!unzip DNN-EdgeIIoT-dataset.csv.zip

!rm DNN-EdgeIIoT-dataset.csv.zip

Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

import numpy as np

df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

Step 3 : Exploring some of the DataFrame's contents: df.head(5)

print(df['Attack_type'].value_counts())

Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

"http.file_data","http.request.full_uri","icmp.transmit_timestamp", "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport", "tcp.dstport", "udp.port", "mqtt.msg"]

df.drop(drop_columns, axis=1, inplace=True)

df.dropna(axis=0, how='any', inplace=True)

df.drop_duplicates(subset=None, keep="first", inplace=True)

df = shuffle(df)

df.isna().sum()

print(df['Attack_type'].value_counts())

Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn import preprocessing

def encode_text_dummy(df, name):

dummies = pd.get_dummies(df[name])

for x in dummies.columns:

dummy_name = f"{name}-{x}" df[dummy_name] = dummies[x]

df.drop(name, axis=1, inplace=True)

encode_text_dummy(df,'http.request.method')

encode_text_dummy(df,'http.referer')

encode_text_dummy(df,"http.request.version")

encode_text_dummy(df,"dns.qry.name.len")

encode_text_dummy(df,"mqtt.conack.flags")

encode_text_dummy(df,"mqtt.protoname")

encode_text_dummy(df,"mqtt.topic")

Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

More information about Dr. Mohamed Amine Ferrag is available at:

https://www.linkedin.com/in/Mohamed-Amine-Ferrag

https://dblp.uni-trier.de/pid/142/9937.html

https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

https://www.scopus.com/authid/detail.uri?authorId=56115001200

https://publons.com/researcher/1322865/mohamed-amine-ferrag/

https://orcid.org/0000-0002-0632-3172

Last Updated: 27 Mar. 2023
f
Visualization of Eye-Tracking Scanpaths in Autism Spectrum Disorder: Image...
figshare.com
application/x-rar
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Elbattah (2023). Visualization of Eye-Tracking Scanpaths in Autism Spectrum Disorder: Image Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.7073087.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7073087.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Mahmoud Elbattah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We provide a dataset that includes visualizations of eye-tracking scanpaths with a particular focus Autism Spectrum Disorder (ASD). The key idea is to transform the dynamics of eye motion into visual patterns, and hence diagnosis-related tasks could be approached using image analysis techniques. The image dataset is publicly available to be used by other studies aiming to experiment the usability of eye-tracking within the ASD context. It is believed that the dataset can allow for the development of further interesting applications using Machine Learning or image processing techniques. For more info, please refer to the publication below and the project website.Original Publication:Carette, R., Elbattah, M., Dequen, G., Guérin, J, & Cilia, F. (2019, February). Learning to predict autism spectrum disorder based on the visual patterns of eye-tracking scanpaths. In Proceedings of the 12th International Conference on Health Informatics (HEALTHINF 2019).Project Website:https://www.researchgate.net/project/Predicting-Autism-Spectrum-Disorder-Using-Machine-Learning-and-Eye-Trackinghttps://mahmoud-elbattah.github.io/ML4Autism/
Archived Kickstarter Projects
kaggle.com
Updated May 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Hilmi (2019). Archived Kickstarter Projects [Dataset]. https://www.kaggle.com/uysalah/archived-kickstarter-projects/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Hilmi
Description
Context

As a project manager and a keen data scientist, I was always curious about the factors of a successful project. When I was a student in New York City Data Science Academy, I finally had the chance to do a research in this field.

Content

The contents of this CSV file is an output of a Web Scraping project which was completed in May 2019 at NYCDSA. Since Kickstarter hides almost all of the completed projects from its website and search engine, I first scraped another website called Kicktraq which keeps track of all completed projects on Kickstarter. Then, I followed the link on Kicktraq which references to the project's original Kickstarter page. Kickstarter page provides some additional information regarding the pledge tiers, backers for each pledge tiers, full project description, number of comments, updates, FAQs etc. I finally combined all variables from both Kicktraq and Kickstarter for 8028 projects in one file.

Inspiration

Using this dataset, below research questions can be investigated:

What is the success/failure ratio for each U.S. State?

What are the factors of a successful project?

Is there a linear relationship between the variables?

What kind of statistical model best describes the dataset?

Given the variables. can you predict if a project will be successful?
A
‘Heart Disease Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Heart Disease Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-heart-disease-dataset-bab8/4f4113b0/?iid=016-185&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Heart Disease Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lykin22/heart-disease-dataset on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Overview

The data science lifecycle is designed for big data issues and data science projects. Generally, the data science project consists of seven steps which are problem definition, data collection, data preparation, data exploration, data modelling and model evaluation. In this project, I will go through these steps in order to build a heart disease classifier. To build the heart disease classifier by using UCI heart disease) dataset.

Description:

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. The "goal" field refers to the presence of heart disease in the patient. It is integer-valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory. To see Test Costs (donated by Peter Turney), please see the folder "Costs"

Dataset

The dataset has 14 attributes: 1. age: age in years 2. sex: sex (1 = male; 0 = female) 3. cp: chest pain type (Value 0: typical angina; Value 1: atypical angina; Value 2: non-anginal pain; Value 3: asymptomatic) 4. trestbps: resting blood pressure in mm Hg on admission to the hospital 5. chol: serum cholestoral in mg/dl 6. fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false) 7. restecg: resting electrocardiographic results (Value 0: normal; Value 1: having ST-T wave abnormality; Value 2: probable or definite left ventricular hypertrophy) 8. thalach: maximum heart rate achieved 9. exang: exercise induced angina (1 = yes; 0 = no) 10. oldpeak: ST depression induced by exercise relative to rest 11. slope: the slope of the peak exercise ST segment (Value 0: upsloping; Value 1: flat; Value 2: downsloping) 12. ca: number of major vessels (0-3) colored by flourosopy 13. thal: thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect) 14. target: heart disease (1 = no, 2 = yes)

If you find this dataset useful, please consider upvoting ❤️

--- Original source retains full ownership of the source dataset ---
A
‘Coursera Course Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Coursera Course Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-coursera-course-dataset-839a/86aaffe7/?iid=003-724&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Coursera Course Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/siddharthm1698/coursera-course-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This is a dataset i generated during a hackathon for project purpose. Here i have scrapped data from Coursera official web site. Our project aims to help any new learner get the right course to learn by just answering a few questions. It is an intelligent course recommendation system. Hence we had to scrap data from few educational websites. This is data scrapped from Coursera website. For the project visit: https://github.com/Siddharth1698/Coursu . Please do show your support by following us. I have just started to learn on data science and hope this dataset will be helpful to someone for his/her personal purposes. The scrapping code is here : https://github.com/Siddharth1698/Coursera-Course-Dataset Article about the dataset generation : https://medium.com/analytics-vidhya/web-scraping-and-coursera-8db6af45d83f

Content

This dataset contains mainly 6 columns and 890 course data. The detailed description: 1. course_title : Contains the course title. 2. course_organization : It tells which organization is conducting the courses. 3. course_Certificate_type : It has details about what are the different certifications available in courses. 4. course_rating : It has the ratings associated with each course. 5. course_difficulty : It tells about how difficult or what is the level of the course. 6. course_students_enrolled : It has the number of students that are enrolled in the course.

Inspiration

This is just one of my first scrapped dataset. Follow my GitHub for more: https://github.com/Siddharth1698

--- Original source retains full ownership of the source dataset ---
o
Top 250 Korean Dramas (KDrama) Dataset
opendatabay.com
.undefined
Updated Jun 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Top 250 Korean Dramas (KDrama) Dataset [Dataset]. https://www.opendatabay.com/data/consumer/da19780d-ee8b-428f-994b-cb432e9cd3ca
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 8, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset contains data from the top-ranked 250 Korean Dramas as per the MyDramaList website. The data has been collected and uploaded in the form of a CSV file and can be used to work on various Data Science Projects.

The CSV file has 17 columns and 251 rows containing mostly textual data.

Most of the data were collected from the MyDramaList website (https://mydramalist.com), and the data for the names of Production Companies was collected from Wikipedia (https://www.wikipedia.org). I wasn't sure how to scrape the data at the time, and hence I went all manual; copying and pasting the data using the cursor. (Yes it was very tedious to manually copy and paste the data!)

I was working on a Content-based Recommender System for Korean Dramas and I needed data to work with. The datasets available on Kaggle had up to only 100 k-drama titles. Not only that, but quite a few of the features deemed essential were also missing; Synopsis, Tags, Director's name, Cast names, Production Companies' names, and such data weren't available with the pre-existing datasets.

Original Data Source: Top 250 Korean Dramas (KDrama) Dataset
30 Short Tips for Your Data Scientist Interview
kaggle.com
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Skillslash17
Description
If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

Technical Preparation

Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

1 Master the Basics

Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

2 Understand Machine Learning

Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

3 Data Manipulation

Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

4 SQL Skills

Gain proficiency in the use of SQL language to extract and process data from databases.

5 Feature Engineering

Understand and know the importance of feature engineering and how to create meaningful features from raw data.

6 Model Evaluation

Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

7 Big Data Technologies

If the job requires it, become familiar with big data technologies like Hadoop and Spark.

8 Coding Challenges

Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

Portfolio and Projects

9 Build a Portfolio

Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

10 Kaggle Competitions

Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

11 Open Source Contributions

Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

12 GitHub Profile

Maintain a well-organized GitHub profile with clean code and clear project documentation.

Domain Knowledge

13 Understand the Industry

Research the industry you’re applying to and understand its specific data challenges and opportunities.

14 Company Research

Study the company you’re interviewing with to tailor your responses and show your genuine interest.

Soft Skills

15 Communication

Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

16 Problem-Solving

Focus on your problem-solving abilities and how you approach complex challenges.

17 Adaptability

Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

Interview Etiquette

18 Professional Appearance

Dress and present yourself in a professional manner, whether the interview is in person or remote.

19 Punctuality

Be on time for the interview, whether it’s virtual or in person.

20 Body Language

Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

21 Active Listening

Pay close attention to the interviewer's questions and answer them directly.

Behavioral Questions

22 STAR Method

Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

23 Conflict Resolution

Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

24 Teamwork

Highlight instances where you’ve worked effectively in cross-functional teams...
o
Indonesia News Dataset (2024)
opendatabay.com
.undefined
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Indonesia News Dataset (2024) [Dataset]. https://www.opendatabay.com/data/consumer/71b802fc-33bb-466f-bd06-d0e415335f0b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 14, 2025
Dataset authored and provided by
Datasimple
Area covered
Indonesia, Entertainment & Media Consumption
Description
📰 News Dataset: January 2024 - 5 September 2024

Discover a comprehensive dataset comprising news articles from three prominent Indonesian outlets: Detik, Tempo, and Kompas. This dataset encapsulates the unfolding narratives and events from January 2024 to September 5th, 2024, offering a comprehensive view of the news landscape during this period.

Each entry includes a title reflecting the essence of the news piece, a link directing to the original article, and the complete content offering in-depth insights. Additionally, categorization tags (Tag1-Tag5) accompany each article, facilitating easy sorting and analysis. This collection, licensed under CC BY-NC for EDUCATIONAL USE, not only serves as a valuable educational resource for data science projects but also supports detailed research on news trends.

Original Data Source: Indonesia News Dataset (2024)
Ramzi Project Data Analysis with R
kaggle.com
Updated May 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramzi Arja 1998 (2022). Ramzi Project Data Analysis with R [Dataset]. https://www.kaggle.com/datasets/ramziarja1998/ramzi-project-data-analysis-with-r
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 3, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ramzi Arja 1998
Description
As a data science student and utilizing my skills in R I created the following R markdown to showcase my data anlysis skills in R. Please feel free to run the entire script in your version of R studio and read through the comments to get an idea of how I did my work. Your feedback is appreciated!
Software Project Management Tools
kaggle.com
Updated Sep 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PARV MODI (2022). Software Project Management Tools [Dataset]. https://www.kaggle.com/datasets/parvmodi/software-project-management-tools
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2022
Dataset provided by
Kaggle
Authors
PARV MODI
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Software Project Management Tool Recommendation System:

This dataset contains Details about the Software Project Management tools selected by 90 students according to their preferences and convenience.

Using This Dataset many analytics can be covered i.e There are 4 Batches among 90 students, so analytics of the best Software Project Management tool in each batch.

Similarly, the best tool in all the batches, which student & how many have chosen the same tool and many more!

Future Plans Further, I will Incorporate other features of the tools like the Expense of the tool and other unique features the tool has.

And Recommendation System will expect that the user will select the features they want in their tool, so based upon the selected features the recommendation system will recommend the best tool for the user.

Happy Learning PAM
Analysis Bay Area Bike Share Udacity
kaggle.com
Updated Nov 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luiz Henrique Amorim (2017). Analysis Bay Area Bike Share Udacity [Dataset]. https://www.kaggle.com/luizoamorim/analysis-bay-area-bike-share-udacity/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Luiz Henrique Amorim
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
San Francisco Bay Area
Description
This dataset is a udacity data science course project. In this project is did an analysis about open data of Bay Area Bake Share. Ford GoBike is the Bay Area's new bike share system, with thousands of public bikes for use across San Francisco, East Bay and San Jose. Theirs bike share is designed with convenience in mind; it’s a fun and affordable way to get around town.

In this project was did be many analysis that they ask me. And in the final of project, I did two simple analysis. In my first analysis, I present a rainy day influence on the trips. The analysis show that a rainy day influence in a trips reduction. In my second analysis I show trips quantity per week day in San Francisco. I show too trips quantity per subscriber type. And last I present trips quantity for each subscriber type per week day. It was possible to observe that trips are lower in weekends and bigger in weekdays. The data show us too that in the weekdays, most of the trips are made by annual subscribers and, in the weekend, a most are made by customers. I did create some functions that can help somebody, and it's possible check many examples about python and pandas.

This was my first project in data science. I continue to study and learn and hope to improve more and more.

Thank's.
titanic_dataset
kaggle.com
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SURENDHAN (2024). titanic_dataset [Dataset]. https://www.kaggle.com/datasets/surendhan/titanic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SURENDHAN
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Titanic dataset on Kaggle is a well-known dataset used for machine learning and data science projects, especially for binary classification tasks. It includes data on the passengers of the Titanic, which sank on its maiden voyage in 1912. This dataset is often used to predict the likelihood of a passenger's survival based on various features. Here is a detailed description of the dataset:

Overview The Titanic dataset includes information about the passengers on the Titanic, such as their demographic information, class, fare, and whether they survived the disaster. The goal is to predict the survival of the passengers.

Files The dataset typically includes three files:

train.csv: The training set, which includes the features and the target variable (Survived). test.csv: The test set, which includes the features but not the target variable. You use this file to make predictions that can be submitted to Kaggle. gender_submission.csv: An example of a submission file in the correct format. Features The dataset contains the following columns:

PassengerId: Unique ID for each passenger. Survived: Target variable (0 = No, 1 = Yes) indicating if the passenger survived. Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd). Name: Name of the passenger. Sex: Gender of the passenger (male or female). Age: Age of the passenger in years. Fractional values indicate age in months for infants. SibSp: Number of siblings or spouses aboard the Titanic. Parch: Number of parents or children aboard the Titanic. Ticket: Ticket number. Fare: Passenger fare. Cabin: Cabin number. Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
Construction/Project Management Report Examples
kaggle.com
Updated Sep 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clayton Miller (2021). Construction/Project Management Report Examples [Dataset]. https://www.kaggle.com/datasets/claytonmiller/construction-and-project-management-example-data/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 16, 2021
Dataset provided by
Kaggle
Authors
Clayton Miller
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

Building construction projects generate huge amounts of data that can be leveraged to understand improvements in efficiency, cost savings, etc. There are several digital apps on the market that helps construction project managers keep track of the details of the process.

Content

This is a simple data set from a number of construction sites generated from project management field apps that are used for quality, safety a and site management.

Essential there are two files in this data set: - Forms – generated from check list for quality/safety/site management - Tasks – which is an action item typically used for quality snags/defects or safety issues.

Acknowledgements

This data set was donated by Jason Rymer, a BIM Manager from Ireland who was keen to see more construction-related data online to be used to learn

Inspiration

The goal of this data set is to help construction industry professionals to learn how to code and process data.
EPX/USD Binance Historical Data for ANN
kaggle.com
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EMİRHAN BULUT (2024). EPX/USD Binance Historical Data for ANN [Dataset]. http://doi.org/10.34740/kaggle/dsv/8479299
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8479299
Dataset updated
May 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
EMİRHAN BULUT
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About Dataset

Context:

This dataset provides comprehensive historical data for the EPX/USDT trading pair on Binance, dating from November 21, 2021, to May 21, 2024. It is particularly curated for facilitating advanced predictive analytics and machine learning projects, especially in the field of financial time series forecasting.

Sources:

The data was meticulously sourced from investing.com, a reliable platform for financial information and data analytics. It captures critical daily trading metrics, including the opening, closing, highest, and lowest prices, along with daily trading volume and percentage changes. This rich dataset is integral for constructing robust models that can predict future trading behaviors and trends.

Inspiration:

With a background in artificial intelligence and financial modeling, I have embarked on a project to predict the future prices of EPX/USDT using advanced neural network architectures. This project aims to leverage the power of several cutting-edge algorithms to create a robust forecasting backbone, combining:

Gated Recurrent Units (GRU): Employed to capture the complexities of sequential data while efficiently handling long-term dependencies.

Long Short-Term Memory (LSTM): Utilized to overcome the vanishing gradient problem, ensuring the model remembers essential patterns over extended periods.

Recurrent Neural Networks (RNN): Applied to process sequences of trading data, retaining the temporal dynamics and dependencies inherent in time series data.

Transformers: Integrated to benefit from their ability to handle both local and global dependencies in data, ensuring more accurate and contextually aware predictions.

The synergy of these algorithms aims to forge a resilient and accurate predictive model, capable of anticipating price movements and trends for the month of June 2024. This project showcases the potential of deploying hybrid neural network architectures for tackling real-world financial forecasting challenges.

Usage:

Users can utilize this dataset to:

Conduct time series analysis and predictive modeling.

Train and evaluate various machine learning and deep learning models.

Develop custom financial forecasting tools and algorithms.

Enhance their understanding of cryptocurrency trading patterns and dynamics.

With this dataset, the financial forecasting community can explore novel modeling techniques and validate their approaches against real-world data, contributing to the development of more precise and reliable predictive models.

Conclusion:

This dataset not only serves as a vital resource for academic and professional research but also stands as a testament to the power of innovative neural network architectures in the realm of financial forecasting. Whether you are a novice data scientist eager to explore time series data or a seasoned researcher looking to refine your models, this dataset offers a valuable foundation for your endeavors.
GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
cars_wagonr_swift
kaggle.com
zip
Updated Sep 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
Explore at:
zip(44486490 bytes)Available download formats
Dataset updated
Sep 11, 2019
Authors
Ajay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

Content

There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

Inspiration

With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car

Facebook

Twitter

Click to copy link

Link copied

Cite

Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning

Lending Club Loan Data Analysis - Deep Learning

Lending Club Loan Data Analysis - Deep Learning - AI Capstone Project

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 9, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Deependra Verma

Description

DESCRIPTION

Create a model that predicts whether or not a loan will be default using the historical data.

Problem Statement:

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

Domain: Finance

Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

Content:

Dataset columns and definition:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

installment: The monthly installments owed by the borrower if the loan is funded.

log.annual.inc: The natural log of the self-reported annual income of the borrower.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

fico: The FICO credit score of the borrower.

days.with.cr.line: The number of days the borrower has had a credit line.

revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Steps to perform:

Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

Tasks:

Feature Transformation

Transform categorical values into numerical values (discrete)

Exploratory data analysis of different factors of the dataset.
Additional Feature Engineering

You will check the correlation between features and will drop those features which have a strong correlation

This will help reduce the number of features and will leave you with the most relevant features

Modeling

After applying EDA and feature engineering, you are now ready to build the predictive models

In this part, you will create a deep learning model using Keras with Tensorflow backend

Clear search

Close search

Google apps

Main menu

Lending Club Loan Data Analysis - Deep Learning

Google Analytics Capstone Project

Spain 2019 Data for Data Science Project

Dataset

Contents

Python for Data Science-Uber Drive Project

Dataset

Contents

EDGE-IIOTSET Dataset

Visualization of Eye-Tracking Scanpaths in Autism Spectrum Disorder: Image...

Archived Kickstarter Projects

Context

Content

Inspiration

‘Heart Disease Dataset’ analyzed by Analyst-2

Overview

Description:

Dataset

If you find this dataset useful, please consider upvoting ❤️

‘Coursera Course Dataset’ analyzed by Analyst-2

Context

Content

Inspiration

Top 250 Korean Dramas (KDrama) Dataset

30 Short Tips for Your Data Scientist Interview

1 Master the Basics

2 Understand Machine Learning

3 Data Manipulation

4 SQL Skills

5 Feature Engineering

6 Model Evaluation

7 Big Data Technologies

8 Coding Challenges

9 Build a Portfolio

10 Kaggle Competitions

11 Open Source Contributions

12 GitHub Profile

13 Understand the Industry

14 Company Research

15 Communication

16 Problem-Solving

17 Adaptability

18 Professional Appearance

19 Punctuality

20 Body Language

21 Active Listening

22 STAR Method

23 Conflict Resolution

24 Teamwork

Indonesia News Dataset (2024)

Ramzi Project Data Analysis with R

Software Project Management Tools

Analysis Bay Area Bike Share Udacity

titanic_dataset

Construction/Project Management Report Examples

Context

Content

Acknowledgements

Inspiration

EPX/USD Binance Historical Data for ANN

About Dataset

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

cars_wagonr_swift

Context

Content

Inspiration

Lending Club Loan Data Analysis - Deep Learning

Lending Club Loan Data Analysis - Deep Learning - AI Capstone Project