67 datasets found
  1. Lending Club Loan Data Analysis - Deep Learning

    • kaggle.com
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deependra Verma
    Description

    DESCRIPTION

    Create a model that predicts whether or not a loan will be default using the historical data.

    Problem Statement:

    For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

    Domain: Finance

    Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

    Content:

    Dataset columns and definition:

    credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

    purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

    int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

    installment: The monthly installments owed by the borrower if the loan is funded.

    log.annual.inc: The natural log of the self-reported annual income of the borrower.

    dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

    fico: The FICO credit score of the borrower.

    days.with.cr.line: The number of days the borrower has had a credit line.

    revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

    revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

    inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

    delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

    pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

    Steps to perform:

    Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

    Tasks:

    1. Feature Transformation

    Transform categorical values into numerical values (discrete)

    1. Exploratory data analysis of different factors of the dataset.

    2. Additional Feature Engineering

    You will check the correlation between features and will drop those features which have a strong correlation

    This will help reduce the number of features and will leave you with the most relevant features

    1. Modeling

    After applying EDA and feature engineering, you are now ready to build the predictive models

    In this part, you will create a deep learning model using Keras with Tensorflow backend

  2. Google Analytics Capstone Project

    • kaggle.com
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederic Xiong (2023). Google Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/fredericxiong/google-analytics-capstone-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frederic Xiong
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Created a data set counting both the average and cumulative interest of Generative AI and Large Language Model (respectively) of multiple regions based off of google trends, created basic visuals for it through code.

  3. Spain 2019 Data for Data Science Project

    • kaggle.com
    Updated Dec 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Sebastian Moreno (2022). Spain 2019 Data for Data Science Project [Dataset]. https://www.kaggle.com/datasets/juansebastianmoreno/spain-2019-data-for-data-science-project/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Juan Sebastian Moreno
    Description

    Dataset

    This dataset was created by Juan Sebastian Moreno

    Contents

  4. Python for Data Science-Uber Drive Project

    • kaggle.com
    zip
    Updated May 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Athisya Nadar (2021). Python for Data Science-Uber Drive Project [Dataset]. https://www.kaggle.com/athisyanadar/python-for-data-scienceuber-drive-project
    Explore at:
    zip(59869 bytes)Available download formats
    Dataset updated
    May 26, 2021
    Authors
    Athisya Nadar
    Description

    Dataset

    This dataset was created by Athisya Nadar

    Contents

    It contains the following files:

  5. P

    EDGE-IIOTSET Dataset

    • paperswithcode.com
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
    Explore at:
    Dataset updated
    Oct 16, 2023
    Description

    ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

    Instructions:

    Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

    Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

    Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

    The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

    Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

    Link to paper : https://ieeexplore.ieee.org/document/9751703

    The directories of the Edge-IIoTset dataset include the following:

    •File 1 (Normal traffic)

    -File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

    -File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

    -File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

    -File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

    -File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

    -File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

    -File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

    -File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

    •File 2 (Attack traffic):

    -File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

    -File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

    •File 3 (Selected dataset for ML and DL):

    -File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

    -File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

    Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

    !pip install -q kaggle

    files.upload()

    !mkdir ~/.kaggle

    !cp kaggle.json ~/.kaggle/

    !chmod 600 ~/.kaggle/kaggle.json

    !kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

    !unzip DNN-EdgeIIoT-dataset.csv.zip

    !rm DNN-EdgeIIoT-dataset.csv.zip

    Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

    import numpy as np

    df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

    Step 3 : Exploring some of the DataFrame's contents: df.head(5)

    print(df['Attack_type'].value_counts())

    Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

    drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

     "http.file_data","http.request.full_uri","icmp.transmit_timestamp",
    
     "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
    
     "tcp.dstport", "udp.port", "mqtt.msg"]
    

    df.drop(drop_columns, axis=1, inplace=True)

    df.dropna(axis=0, how='any', inplace=True)

    df.drop_duplicates(subset=None, keep="first", inplace=True)

    df = shuffle(df)

    df.isna().sum()

    print(df['Attack_type'].value_counts())

    Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn import preprocessing

    def encode_text_dummy(df, name):

    dummies = pd.get_dummies(df[name])

    for x in dummies.columns:

    dummy_name = f"{name}-{x}"
    
    df[dummy_name] = dummies[x]
    

    df.drop(name, axis=1, inplace=True)

    encode_text_dummy(df,'http.request.method')

    encode_text_dummy(df,'http.referer')

    encode_text_dummy(df,"http.request.version")

    encode_text_dummy(df,"dns.qry.name.len")

    encode_text_dummy(df,"mqtt.conack.flags")

    encode_text_dummy(df,"mqtt.protoname")

    encode_text_dummy(df,"mqtt.topic")

    Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

    For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

    More information about Dr. Mohamed Amine Ferrag is available at:

    https://www.linkedin.com/in/Mohamed-Amine-Ferrag

    https://dblp.uni-trier.de/pid/142/9937.html

    https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

    https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

    https://www.scopus.com/authid/detail.uri?authorId=56115001200

    https://publons.com/researcher/1322865/mohamed-amine-ferrag/

    https://orcid.org/0000-0002-0632-3172

    Last Updated: 27 Mar. 2023

  6. f

    Visualization of Eye-Tracking Scanpaths in Autism Spectrum Disorder: Image...

    • figshare.com
    application/x-rar
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Elbattah (2023). Visualization of Eye-Tracking Scanpaths in Autism Spectrum Disorder: Image Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.7073087.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Mahmoud Elbattah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We provide a dataset that includes visualizations of eye-tracking scanpaths with a particular focus Autism Spectrum Disorder (ASD). The key idea is to transform the dynamics of eye motion into visual patterns, and hence diagnosis-related tasks could be approached using image analysis techniques. The image dataset is publicly available to be used by other studies aiming to experiment the usability of eye-tracking within the ASD context. It is believed that the dataset can allow for the development of further interesting applications using Machine Learning or image processing techniques. For more info, please refer to the publication below and the project website.Original Publication:Carette, R., Elbattah, M., Dequen, G., Guérin, J, & Cilia, F. (2019, February). Learning to predict autism spectrum disorder based on the visual patterns of eye-tracking scanpaths. In Proceedings of the 12th International Conference on Health Informatics (HEALTHINF 2019).Project Website:https://www.researchgate.net/project/Predicting-Autism-Spectrum-Disorder-Using-Machine-Learning-and-Eye-Trackinghttps://mahmoud-elbattah.github.io/ML4Autism/

  7. Archived Kickstarter Projects

    • kaggle.com
    Updated May 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Hilmi (2019). Archived Kickstarter Projects [Dataset]. https://www.kaggle.com/uysalah/archived-kickstarter-projects/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ali Hilmi
    Description

    Context

    As a project manager and a keen data scientist, I was always curious about the factors of a successful project. When I was a student in New York City Data Science Academy, I finally had the chance to do a research in this field.

    Content

    The contents of this CSV file is an output of a Web Scraping project which was completed in May 2019 at NYCDSA. Since Kickstarter hides almost all of the completed projects from its website and search engine, I first scraped another website called Kicktraq which keeps track of all completed projects on Kickstarter. Then, I followed the link on Kicktraq which references to the project's original Kickstarter page. Kickstarter page provides some additional information regarding the pledge tiers, backers for each pledge tiers, full project description, number of comments, updates, FAQs etc. I finally combined all variables from both Kicktraq and Kickstarter for 8028 projects in one file.

    Inspiration

    Using this dataset, below research questions can be investigated:

    1. What is the success/failure ratio for each U.S. State?
    2. What are the factors of a successful project?
    3. Is there a linear relationship between the variables?
    4. What kind of statistical model best describes the dataset?
    5. Given the variables. can you predict if a project will be successful?
  8. A

    ‘Heart Disease Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Heart Disease Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-heart-disease-dataset-bab8/4f4113b0/?iid=016-185&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Heart Disease Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lykin22/heart-disease-dataset on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Overview

    The data science lifecycle is designed for big data issues and data science projects. Generally, the data science project consists of seven steps which are problem definition, data collection, data preparation, data exploration, data modelling and model evaluation. In this project, I will go through these steps in order to build a heart disease classifier. To build the heart disease classifier by using UCI heart disease) dataset.

    Description:

    This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. The "goal" field refers to the presence of heart disease in the patient. It is integer-valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory. To see Test Costs (donated by Peter Turney), please see the folder "Costs"

    Dataset

    The dataset has 14 attributes: 1. age: age in years 2. sex: sex (1 = male; 0 = female) 3. cp: chest pain type (Value 0: typical angina; Value 1: atypical angina; Value 2: non-anginal pain; Value 3: asymptomatic) 4. trestbps: resting blood pressure in mm Hg on admission to the hospital 5. chol: serum cholestoral in mg/dl 6. fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false) 7. restecg: resting electrocardiographic results (Value 0: normal; Value 1: having ST-T wave abnormality; Value 2: probable or definite left ventricular hypertrophy) 8. thalach: maximum heart rate achieved 9. exang: exercise induced angina (1 = yes; 0 = no) 10. oldpeak: ST depression induced by exercise relative to rest 11. slope: the slope of the peak exercise ST segment (Value 0: upsloping; Value 1: flat; Value 2: downsloping) 12. ca: number of major vessels (0-3) colored by flourosopy 13. thal: thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect) 14. target: heart disease (1 = no, 2 = yes)

    If you find this dataset useful, please consider upvoting ❤️

    --- Original source retains full ownership of the source dataset ---

  9. A

    ‘Coursera Course Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Coursera Course Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-coursera-course-dataset-839a/86aaffe7/?iid=003-724&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Coursera Course Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/siddharthm1698/coursera-course-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This is a dataset i generated during a hackathon for project purpose. Here i have scrapped data from Coursera official web site. Our project aims to help any new learner get the right course to learn by just answering a few questions. It is an intelligent course recommendation system. Hence we had to scrap data from few educational websites. This is data scrapped from Coursera website. For the project visit: https://github.com/Siddharth1698/Coursu . Please do show your support by following us. I have just started to learn on data science and hope this dataset will be helpful to someone for his/her personal purposes. The scrapping code is here : https://github.com/Siddharth1698/Coursera-Course-Dataset Article about the dataset generation : https://medium.com/analytics-vidhya/web-scraping-and-coursera-8db6af45d83f

    Content

    This dataset contains mainly 6 columns and 890 course data. The detailed description: 1. course_title : Contains the course title. 2. course_organization : It tells which organization is conducting the courses. 3. course_Certificate_type : It has details about what are the different certifications available in courses. 4. course_rating : It has the ratings associated with each course. 5. course_difficulty : It tells about how difficult or what is the level of the course. 6. course_students_enrolled : It has the number of students that are enrolled in the course.

    Inspiration

    This is just one of my first scrapped dataset. Follow my GitHub for more: https://github.com/Siddharth1698

    --- Original source retains full ownership of the source dataset ---

  10. o

    Top 250 Korean Dramas (KDrama) Dataset

    • opendatabay.com
    .undefined
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Top 250 Korean Dramas (KDrama) Dataset [Dataset]. https://www.opendatabay.com/data/consumer/da19780d-ee8b-428f-994b-cb432e9cd3ca
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    This dataset contains data from the top-ranked 250 Korean Dramas as per the MyDramaList website. The data has been collected and uploaded in the form of a CSV file and can be used to work on various Data Science Projects.

    The CSV file has 17 columns and 251 rows containing mostly textual data.

    Most of the data were collected from the MyDramaList website (https://mydramalist.com), and the data for the names of Production Companies was collected from Wikipedia (https://www.wikipedia.org). I wasn't sure how to scrape the data at the time, and hence I went all manual; copying and pasting the data using the cursor. (Yes it was very tedious to manually copy and paste the data!)

    I was working on a Content-based Recommender System for Korean Dramas and I needed data to work with. The datasets available on Kaggle had up to only 100 k-drama titles. Not only that, but quite a few of the features deemed essential were also missing; Synopsis, Tags, Director's name, Cast names, Production Companies' names, and such data weren't available with the pre-existing datasets.

    Original Data Source: Top 250 Korean Dramas (KDrama) Dataset

  11. 30 Short Tips for Your Data Scientist Interview

    • kaggle.com
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Skillslash17
    Description

    If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

    Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

    With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

    Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

    Technical Preparation

    Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

    1 Master the Basics

    Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

    2 Understand Machine Learning

    Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

    3 Data Manipulation

    Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

    4 SQL Skills

    Gain proficiency in the use of SQL language to extract and process data from databases.

    5 Feature Engineering

    Understand and know the importance of feature engineering and how to create meaningful features from raw data.

    6 Model Evaluation

    Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

    7 Big Data Technologies

    If the job requires it, become familiar with big data technologies like Hadoop and Spark.

    8 Coding Challenges

    Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

    Portfolio and Projects

    9 Build a Portfolio

    Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

    10 Kaggle Competitions

    Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

    11 Open Source Contributions

    Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

    12 GitHub Profile

    Maintain a well-organized GitHub profile with clean code and clear project documentation.

    Domain Knowledge

    13 Understand the Industry

    Research the industry you’re applying to and understand its specific data challenges and opportunities.

    14 Company Research

    Study the company you’re interviewing with to tailor your responses and show your genuine interest.

    Soft Skills

    15 Communication

    Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

    16 Problem-Solving

    Focus on your problem-solving abilities and how you approach complex challenges.

    17 Adaptability

    Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

    Interview Etiquette

    18 Professional Appearance

    Dress and present yourself in a professional manner, whether the interview is in person or remote.

    19 Punctuality

    Be on time for the interview, whether it’s virtual or in person.

    20 Body Language

    Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

    21 Active Listening

    Pay close attention to the interviewer's questions and answer them directly.

    Behavioral Questions

    22 STAR Method

    Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

    23 Conflict Resolution

    Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

    24 Teamwork

    Highlight instances where you’ve worked effectively in cross-functional teams...

  12. o

    Indonesia News Dataset (2024)

    • opendatabay.com
    .undefined
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Indonesia News Dataset (2024) [Dataset]. https://www.opendatabay.com/data/consumer/71b802fc-33bb-466f-bd06-d0e415335f0b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 14, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Indonesia, Entertainment & Media Consumption
    Description

    📰 News Dataset: January 2024 - 5 September 2024

    Discover a comprehensive dataset comprising news articles from three prominent Indonesian outlets: Detik, Tempo, and Kompas. This dataset encapsulates the unfolding narratives and events from January 2024 to September 5th, 2024, offering a comprehensive view of the news landscape during this period.

    Each entry includes a title reflecting the essence of the news piece, a link directing to the original article, and the complete content offering in-depth insights. Additionally, categorization tags (Tag1-Tag5) accompany each article, facilitating easy sorting and analysis. This collection, licensed under CC BY-NC for EDUCATIONAL USE, not only serves as a valuable educational resource for data science projects but also supports detailed research on news trends.

    Original Data Source: Indonesia News Dataset (2024)

  13. Ramzi Project Data Analysis with R

    • kaggle.com
    Updated May 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramzi Arja 1998 (2022). Ramzi Project Data Analysis with R [Dataset]. https://www.kaggle.com/datasets/ramziarja1998/ramzi-project-data-analysis-with-r
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramzi Arja 1998
    Description

    As a data science student and utilizing my skills in R I created the following R markdown to showcase my data anlysis skills in R. Please feel free to run the entire script in your version of R studio and read through the comments to get an idea of how I did my work. Your feedback is appreciated!

  14. Software Project Management Tools

    • kaggle.com
    Updated Sep 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PARV MODI (2022). Software Project Management Tools [Dataset]. https://www.kaggle.com/datasets/parvmodi/software-project-management-tools
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 30, 2022
    Dataset provided by
    Kaggle
    Authors
    PARV MODI
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Software Project Management Tool Recommendation System:

    This dataset contains Details about the Software Project Management tools selected by 90 students according to their preferences and convenience.

    Using This Dataset many analytics can be covered i.e There are 4 Batches among 90 students, so analytics of the best Software Project Management tool in each batch.

    Similarly, the best tool in all the batches, which student & how many have chosen the same tool and many more!

    Future Plans Further, I will Incorporate other features of the tools like the Expense of the tool and other unique features the tool has.

    And Recommendation System will expect that the user will select the features they want in their tool, so based upon the selected features the recommendation system will recommend the best tool for the user.

    Happy Learning PAM

  15. Analysis Bay Area Bike Share Udacity

    • kaggle.com
    Updated Nov 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luiz Henrique Amorim (2017). Analysis Bay Area Bike Share Udacity [Dataset]. https://www.kaggle.com/luizoamorim/analysis-bay-area-bike-share-udacity/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luiz Henrique Amorim
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    San Francisco Bay Area
    Description

    This dataset is a udacity data science course project. In this project is did an analysis about open data of Bay Area Bake Share. Ford GoBike is the Bay Area's new bike share system, with thousands of public bikes for use across San Francisco, East Bay and San Jose. Theirs bike share is designed with convenience in mind; it’s a fun and affordable way to get around town.

    In this project was did be many analysis that they ask me. And in the final of project, I did two simple analysis. In my first analysis, I present a rainy day influence on the trips. The analysis show that a rainy day influence in a trips reduction. In my second analysis I show trips quantity per week day in San Francisco. I show too trips quantity per subscriber type. And last I present trips quantity for each subscriber type per week day. It was possible to observe that trips are lower in weekends and bigger in weekdays. The data show us too that in the weekdays, most of the trips are made by annual subscribers and, in the weekend, a most are made by customers. I did create some functions that can help somebody, and it's possible check many examples about python and pandas.

    This was my first project in data science. I continue to study and learn and hope to improve more and more.

    Thank's.

  16. titanic_dataset

    • kaggle.com
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SURENDHAN (2024). titanic_dataset [Dataset]. https://www.kaggle.com/datasets/surendhan/titanic-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SURENDHAN
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Titanic dataset on Kaggle is a well-known dataset used for machine learning and data science projects, especially for binary classification tasks. It includes data on the passengers of the Titanic, which sank on its maiden voyage in 1912. This dataset is often used to predict the likelihood of a passenger's survival based on various features. Here is a detailed description of the dataset:

    Overview The Titanic dataset includes information about the passengers on the Titanic, such as their demographic information, class, fare, and whether they survived the disaster. The goal is to predict the survival of the passengers.

    Files The dataset typically includes three files:

    train.csv: The training set, which includes the features and the target variable (Survived). test.csv: The test set, which includes the features but not the target variable. You use this file to make predictions that can be submitted to Kaggle. gender_submission.csv: An example of a submission file in the correct format. Features The dataset contains the following columns:

    PassengerId: Unique ID for each passenger. Survived: Target variable (0 = No, 1 = Yes) indicating if the passenger survived. Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd). Name: Name of the passenger. Sex: Gender of the passenger (male or female). Age: Age of the passenger in years. Fractional values indicate age in months for infants. SibSp: Number of siblings or spouses aboard the Titanic. Parch: Number of parents or children aboard the Titanic. Ticket: Ticket number. Fare: Passenger fare. Cabin: Cabin number. Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

  17. Construction/Project Management Report Examples

    • kaggle.com
    Updated Sep 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clayton Miller (2021). Construction/Project Management Report Examples [Dataset]. https://www.kaggle.com/datasets/claytonmiller/construction-and-project-management-example-data/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2021
    Dataset provided by
    Kaggle
    Authors
    Clayton Miller
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    Building construction projects generate huge amounts of data that can be leveraged to understand improvements in efficiency, cost savings, etc. There are several digital apps on the market that helps construction project managers keep track of the details of the process.

    Content

    This is a simple data set from a number of construction sites generated from project management field apps that are used for quality, safety a and site management.

    Essential there are two files in this data set: - Forms – generated from check list for quality/safety/site management - Tasks – which is an action item typically used for quality snags/defects or safety issues.

    Acknowledgements

    This data set was donated by Jason Rymer, a BIM Manager from Ireland who was keen to see more construction-related data online to be used to learn

    Inspiration

    The goal of this data set is to help construction industry professionals to learn how to code and process data.

  18. EPX/USD Binance Historical Data for ANN

    • kaggle.com
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMİRHAN BULUT (2024). EPX/USD Binance Historical Data for ANN [Dataset]. http://doi.org/10.34740/kaggle/dsv/8479299
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    EMİRHAN BULUT
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    About Dataset

    Context:

    This dataset provides comprehensive historical data for the EPX/USDT trading pair on Binance, dating from November 21, 2021, to May 21, 2024. It is particularly curated for facilitating advanced predictive analytics and machine learning projects, especially in the field of financial time series forecasting.

    Sources:

    The data was meticulously sourced from investing.com, a reliable platform for financial information and data analytics. It captures critical daily trading metrics, including the opening, closing, highest, and lowest prices, along with daily trading volume and percentage changes. This rich dataset is integral for constructing robust models that can predict future trading behaviors and trends.

    Inspiration:

    With a background in artificial intelligence and financial modeling, I have embarked on a project to predict the future prices of EPX/USDT using advanced neural network architectures. This project aims to leverage the power of several cutting-edge algorithms to create a robust forecasting backbone, combining:

    • Gated Recurrent Units (GRU): Employed to capture the complexities of sequential data while efficiently handling long-term dependencies.

    • Long Short-Term Memory (LSTM): Utilized to overcome the vanishing gradient problem, ensuring the model remembers essential patterns over extended periods.

    • Recurrent Neural Networks (RNN): Applied to process sequences of trading data, retaining the temporal dynamics and dependencies inherent in time series data.

    • Transformers: Integrated to benefit from their ability to handle both local and global dependencies in data, ensuring more accurate and contextually aware predictions.

    The synergy of these algorithms aims to forge a resilient and accurate predictive model, capable of anticipating price movements and trends for the month of June 2024. This project showcases the potential of deploying hybrid neural network architectures for tackling real-world financial forecasting challenges.

    Usage:

    Users can utilize this dataset to:

    • Conduct time series analysis and predictive modeling.

    • Train and evaluate various machine learning and deep learning models.

    • Develop custom financial forecasting tools and algorithms.

    • Enhance their understanding of cryptocurrency trading patterns and dynamics.

    With this dataset, the financial forecasting community can explore novel modeling techniques and validate their approaches against real-world data, contributing to the development of more precise and reliable predictive models.

    Conclusion:

    This dataset not only serves as a vital resource for academic and professional research but also stands as a testament to the power of innovative neural network architectures in the realm of financial forecasting. Whether you are a novice data scientist eager to explore time series data or a seasoned researcher looking to refine your models, this dataset offers a valuable foundation for your endeavors.

  19. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  20. cars_wagonr_swift

    • kaggle.com
    zip
    Updated Sep 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
    Explore at:
    zip(44486490 bytes)Available download formats
    Dataset updated
    Sep 11, 2019
    Authors
    Ajay
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

    Content

    There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

    The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

    Inspiration

    1. With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

    2. Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

    3. I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning
Organization logo

Lending Club Loan Data Analysis - Deep Learning

Lending Club Loan Data Analysis - Deep Learning - AI Capstone Project

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deependra Verma
Description

DESCRIPTION

Create a model that predicts whether or not a loan will be default using the historical data.

Problem Statement:

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

Domain: Finance

Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

Content:

Dataset columns and definition:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

installment: The monthly installments owed by the borrower if the loan is funded.

log.annual.inc: The natural log of the self-reported annual income of the borrower.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

fico: The FICO credit score of the borrower.

days.with.cr.line: The number of days the borrower has had a credit line.

revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Steps to perform:

Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

Tasks:

  1. Feature Transformation

Transform categorical values into numerical values (discrete)

  1. Exploratory data analysis of different factors of the dataset.

  2. Additional Feature Engineering

You will check the correlation between features and will drop those features which have a strong correlation

This will help reduce the number of features and will leave you with the most relevant features

  1. Modeling

After applying EDA and feature engineering, you are now ready to build the predictive models

In this part, you will create a deep learning model using Keras with Tensorflow backend

Search
Clear search
Close search
Google apps
Main menu