Facebook
TwitterThis dataset was created by Loulou
Facebook
TwitterThis dataset was used for training machine learning models to detect phishing attacks and for studying the explainability of these models. It was published in 2024. The dataset refers to phishing and legitimate websites. Phishing samples have been collected from two sources, namely, PhishTank and Tranco, whereas legitimate samples were collected from Alexa. The dataset is balanced and contains 5,000 phishing and 5,000 legitimate samples, each described by 74 features extracted from the entire URL as well as from the Fully Qualified Domain Name, pathname, filename, and parameters. Of these features, 70 are numerical and four binary. The target variable is also binary.
Calzarossa, M., Giudici, P., Zieni, R.: Explainable machine learning for phishing feature detection. Quality and Reliability Engineering International 40, 362–373 (2024).
Zieni, Rasha (2024), “Zieni dataset for Phishing detection ”, Mendeley Data, V1, doi: 10.17632/8mcz8jsgnb.1
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Medical-Grade Explainable AI Project Assets
This dataset contains comprehensive assets for a production-ready Explainable AI (XAI) heart disease prediction system achieving 94.1% accuracy with full model transparency.
📊 CONTEXT: Healthcare AI faces a critical "black box" problem where models make predictions without explanations. This project demonstrates how to build trustworthy medical AI using SHAP and LIME for real-time explainability.
🎯 PROJECT GOAL: Create a clinically deployable AI system that not only predicts heart disease with high accuracy but also provides interpretable explanations for each prediction, enabling doctor-AI collaboration.
🚀 KEY FEATURES: - 94.1% prediction accuracy (XGBoost + Optuna) - Real-time SHAP & LIME explanations - FastAPI backend with medical validation - Gradio clinical dashboard - Full MLOps pipeline (MLflow tracking) - 4-Layer enterprise architecture
📁 ASSETS INCLUDED:
- heart_clean.csv - Clinical dataset ready for analysis
- SHAP summary plots for global explainability
- Performance metrics and visualizations
- Architecture diagrams
- Model evaluation results
🔗 COMPANION RESOURCES: - Live Demo: https://huggingface.co/spaces/Ariyan-Pro/HeartDisease-Predictor - Notebook: https://www.kaggle.com/code/ariyannadeem/heart-disease-prediction-with-explainable-ai - Source Code: https://github.com/Ariyan-Pro/ExplainableAI-HeartDisease
Perfect for learning medical AI implementation, explainable AI techniques, and production deployment.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 5 million synthetically generated financial transactions designed to simulate real-world behavior for fraud detection research and machine learning applications. Each transaction record includes fields such as:
Transaction Details: ID, timestamp, sender/receiver accounts, amount, type (deposit, transfer, etc.)
Behavioral Features: time since last transaction, spending deviation score, velocity score, geo-anomaly score
Metadata: location, device used, payment channel, IP address, device hash
Fraud Indicators: binary fraud label (is_fraud) and type of fraud (e.g., money laundering, account takeover)
The dataset follows realistic fraud patterns and behavioral anomalies, making it suitable for:
Binary and multiclass classification models
Fraud detection systems
Time-series anomaly detection
Feature engineering and model explainability
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains data on 6,607 students and the factors influencing their academic performance. It’s designed to help researchers, educators, and data scientists analyze how habits, environment, and background affect exam scores.
You can use this dataset for:
The dataset includes demographic, behavioral, and academic variables such as study hours, attendance, parental involvement, and more.
Target Variable: Exam_Score
Example research ideas - What is the most influential factor affecting student performance? - Can machine learning accurately predict academic success? - How do socioeconomic and behavioral factors interact in education?
Data source Synthetic data generated for research and educational purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview:
This dataset has been designed as a benchmark for AutoML and predictive modeling in the financial domain. It focuses on assessing credit risk by predicting whether a borrower will experience serious delinquency within two years. The data comprises a mix of financial metrics and personal attributes, which allow users to build and evaluate models for credit risk scoring.
Dataset Characteristics:
Column Descriptions:
Below is a list of the available columns along with their abbreviated names for ease-of-use:
Use Cases and Applications:
- Risk Management: Build and validate credit scoring models to forecast borrower default risks.
- AutoML Benchmarking: Evaluate and compare the performance of various AutoML frameworks on a structured, financial dataset.
- Academic Research: Explore trends and relationships in credit behavior, along with the predictive power of financial indicators.
- Model Interpretability: Given the regulated nature of financial models, this dataset provides an excellent context for testing feature importance and creating explainable AI solutions.
Additional Information:
- Preprocessing & Feature Engineering: Users are encouraged to perform exploratory data analysis, handle potential missing values or outliers, and experiment with scaling techniques and feature transformations.
- Regulatory Considerations: Since credit scoring models often require transparency, it’s important to incorporate techniques that ensure model interpretability.
- Benchmarking: Ideal for comparing traditional modeling techniques (like logistic regression) with modern approaches (such as gradient boosting and neural networks).
This dataset is now available on Kaggle for anyone looking to experiment with or benchmark predictive models for credit risk analysis. Whether you're a data scientist, researcher, or financial analyst, the dataset provides a straightforward yet robust framework for exploring credit-related behavior and risk factors.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a synthetic dataset for training and testing spam detection models. It contains 20,000 email samples, and each sample is described by five features and one label.
num_links
λ) of 1.5 num_words
has_offer
sender_score
all_caps
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic health dataset simulates real-world lifestyle and wellness data for individuals. It is designed to help data scientists, machine learning engineers, and students build and test health risk prediction models safely — without using sensitive medical data.
The dataset includes features such as age, weight, height, exercise habits, sleep hours, sugar intake, smoking, alcohol consumption, marital status, and profession, along with a synthetic health_risk label generated using a heuristic rule-based algorithm that mimics realistic risk behavior patterns.
| Column Name | Description | Type | Example |
|---|---|---|---|
age | Age of the person (years) | Numeric | 35 |
weight | Body weight in kilograms | Numeric | 70 |
height | Height in centimeters | Numeric | 172 |
exercise | Exercise frequency level | Categorical (none, low, medium, high) | medium |
sleep | Average hours of sleep per night | Numeric | 7 |
sugar_intake | Level of sugar consumption | Categorical (low, medium, high) | high |
smoking | Smoking habit | Categorical (yes, no) | no |
alcohol | Alcohol consumption habit | Categorical (yes, no) | yes |
married | Marital status | Categorical (yes, no) | yes |
profession | Type of work or profession | Categorical (office_worker, teacher, doctor, engineer, etc.) | teacher |
bmi | Body Mass Index calculated as weight / (height²) | Numeric | 24.5 |
health_risk | Target label showing overall health risk | Categorical (low, high) | high |
Health Risk Prediction:
Train classification models (Logistic Regression, RandomForest, XGBoost, CatBoost) to predict health risk (low / high).
Feature Importance Analysis: Identify which lifestyle factors most influence health risk.
Data Preprocessing & EDA Practice: Use this dataset for data cleaning, encoding, and visualization practice.
Model Explainability Projects: Use SHAP or LIME to explain how different lifestyle habits affect predictions.
Streamlit or Flask Web App Development: Build a real-time web app that predicts health risk from user input.
Imagine you are a data scientist building a Health Risk Prediction App for a wellness startup. You want to analyze how exercise, sleep, and sugar intake affect overall health risk. This dataset helps you simulate those relationships without handling sensitive medical data.
You could:
health_risk.health_riskCC0: Public Domain You are free to use this dataset for research, learning, or commercial projects.
Created by Arif Miah Machine Learning Engineer | Kaggle Expert | Data Scientist 📧 arifmiahcse@gmail.com
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.
| Column Name | Description |
|---|---|
| Transaction_ID | Unique identifier for each transaction |
| User_ID | Unique identifier for the user |
| Transaction_Amount | Amount of money involved in the transaction |
| Transaction_Type | Type of transaction (Online, In-Store, ATM, etc.) |
| Timestamp | Date and time of the transaction |
| Account_Balance | User's current account balance before the transaction |
| Device_Type | Type of device used (Mobile, Desktop, etc.) |
| Location | Geographical location of the transaction |
| Merchant_Category | Type of merchant (Retail, Food, Travel, etc.) |
| IP_Address_Flag | Whether the IP address was flagged as suspicious (0 or 1) |
| Previous_Fraudulent_Activity | Number of past fraudulent activities by the user |
| Daily_Transaction_Count | Number of transactions made by the user that day |
| Avg_Transaction_Amount_7d | User's average transaction amount in the past 7 days |
| Failed_Transaction_Count_7d | Count of failed transactions in the past 7 days |
| Card_Type | Type of payment card used (Credit, Debit, Prepaid, etc.) |
| Card_Age | Age of the card in months |
| Transaction_Distance | Distance between the user's usual location and transaction location |
| Authentication_Method | How the user authenticated (PIN, Biometric, etc.) |
| Risk_Score | Fraud risk score computed for the transaction |
| Is_Weekend | Whether the transaction occurred on a weekend (0 or 1) |
| Fraud_Label | Target variable (0 = Not Fraud, 1 = Fraud) |
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Please note that this is the original dataset with additional information and proper attribution. There is at least one other version of this dataset on Kaggle that was uploaded without permission. Please be fair and attribute the original author. This synthetic dataset is modeled after an existing milling machine and consists of 10 000 data points from a stored as rows with 14 features in columns
The machine failure consists of five independent failure modes 10. tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 - 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned). 11. heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tools rotational speed is below 1380 rpm. This is the case for 115 data points. 12. power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset. 13. overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints. 14. random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset. If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1. It is therefore not transparent to the machine learning method, which of the failure modes has caused the process to fail.
This dataset is part of the following publication, please cite when using this dataset: S. Matzka, "Explainable Artificial Intelligence for Predictive Maintenance Applications," 2020 Third International Conference on Artificial Intelligence for Industries (AI4I), 2020, pp. 69-74, doi: 10.1109/AI4I49448.2020.00023.
The image of the milling process is the work of Daniel Smyth @ Pexels: https://www.pexels.com/de-de/foto/industrie-herstellung-maschine-werkzeug-10406128/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Oxford Parkinson's Disease Detection Dataset UCI Machine Learning Repository
dataset's original link : https://archive.ics.uci.edu/dataset/174/parkinsons
Dataset Characteristics Multivariate
Subject Area Health and Medicine
Associated Tasks Classification
Feature Type Real
197
22
Dataset Information Additional Information
This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.
The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).
Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).
Has Missing Values?
No
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
📄 Dataset Description: Iris Flower Dataset
The Iris dataset is one of the most famous and widely used datasets in the field of machine learning and pattern recognition. It was first introduced by the British biologist and statistician Ronald A. Fisher in 1936.
🌸 Dataset Overview
The dataset consists of 150 samples of iris flowers from three different species: - Setosa - Versicolor - Virginica
Each sample contains four features (all numeric), which are: 1. Sepal Length (cm) 2. Sepal Width (cm) 3. Petal Length (cm) 4. Petal Width (cm)
The target variable is the species of the flower.
📊 Dataset Characteristics
| Feature | Type | Description |
|---|---|---|
| Sepal Length | Float | Length of the sepal in cm |
| Sepal Width | Float | Width of the sepal in cm |
| Petal Length | Float | Length of the petal in cm |
| Petal Width | Float | Width of the petal in cm |
| Species | String | Category: Setosa, Versicolor, Virginica |
🔍 Applications
This dataset is commonly used for: - Supervised learning (classification) - Data visualization and EDA - Algorithm comparison (e.g., Logistic Regression, SVM, KNN) - Dimensionality reduction (e.g., PCA)
✅ Why This Dataset?
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
As YouTube becomes one of the most popular video-sharing platforms, YouTuber is developed as a new type of career in recent decades. YouTubers earn money through advertising revenue from YouTube videos, sponsorships from companies, merchandise sales, and donations from their fans. In order to maintain a stable income, the popularity of videos become the top priority for YouTubers. Meanwhile, some of our friends are YouTubers or channel owners in other video-sharing platforms. This raises our interest in predicting the performance of the video. If creators can have a preliminary prediction and understanding on their videos’ performance, they may adjust their video to gain the most attention from the public.
You have been provided details on videos along with some features as well. Can you accurately predict the number of likes for each video using the set of input variables?
Train Set
video_id -> Identifier for each video
title -> Name of the Video on Youtube
channel_title -> Name of the Channel on Youtube
category_id -> Category of the Video (anonymous)
publish_date -> The date video was published
tags -> Different tags for the video
views -> Number of views received by the Video
dislikes -> Number of dislikes on the Video
comment_count -> Number on comments on the Video
description -> Textual description of the Video
country_code -> Country from which the Video was published
likes -> Number of Likes on the video
Thank You Analytics Vidhya for providing this dataset.
Facebook
TwitterThe quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.
CIFAKE is a dataset that contains 60,000 synthetically-generated images and 60,000 real images (collected from CIFAR-10). Can computer vision techniques be used to detect when an image is real or has been generated by AI?
Further information on this dataset can be found here: Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
The dataset contains two classes - REAL and FAKE.
For REAL, we collected the images from Krizhevsky & Hinton's CIFAR-10 dataset
For the FAKE images, we generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4
There are 100,000 images for training (50k per class) and 20,000 for testing (10k per class)
The dataset and all studies using it are linked using Papers with Code https://paperswithcode.com/dataset/cifake-real-and-ai-generated-synthetic-images
If you use this dataset, you must cite the following sources
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Bird, J.J. and Lotfi, A., 2024. CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access.
Real images are from Krizhevsky & Hinton (2009), fake images are from Bird & Lotfi (2024). The Bird & Lotfi study is available here.
The updates to the dataset on the 28th of March 2023 did not change anything; the file formats ".jpeg" were renamed ".jpg" and the root folder was uploaded to meet Kaggle's usability requirements.
This dataset is published under the same MIT license as CIFAR-10:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Facebook
TwitterThe data set is provided both in text file and csv file which provides the following resources that can be used as inputs for model building :
A collection of website URLs for 11000+ websites. Each sample has 30 website parameters and a class label identifying it as a phishing website or not (1 or -1).
The code template containing these code blocks: a. Import modules (Part 1) b. Load data function + input/output field descriptions
The data set also serves as an input for project scoping and tries to specify the functional and non-functional requirements for it.
You are expected to write the code for a binary classification model (phishing website or not) using Python Scikit-Learn that trains on the data and calculates the accuracy score on the test data. You have to use one or more of the classification algorithms to train a model on the phishing website data set.
Facebook
Twitter
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🌍 Storyline: The Digital Bazaar
In 2024, e-commerce platforms across Istanbul, Berlin, New York, London, and Paris began noticing strange transaction bursts. Some cards tested with $1 purchases at midnight. Others shipped “gaming accessories” 5,000 km away. Promo codes were being reused from freshly created accounts.
To investigate these global patterns safely, this synthetic dataset recreates realistic fraud behavior across countries, channels, and user profiles — allowing anyone to build, test, and compare fraud-detection models without exposing any real user data.
💡 What makes it special
🧍♀️ 6 000 unique users performing ≈300 000 transactions
💳 Multiple transactions per user (40–60) → enables behavioral analysis
🧩 Strong feature correlations — not random noise
🌐 Cross-country dynamics (country, bin_country)
💸 Natural imbalance (~2 % fraud) just like real financial systems
🕓 Time realism — night-time fraud spikes, daily rhythms
🧠 Feature explainability — easy to visualize, model, and interpret
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Hurricanes can cause upwards of 1,000 deaths and $50 billion in damages in a single event, and have been responsible for well over 160,000 deaths globally in recent history. During a tropical cyclone, humanitarian response efforts hinge on accurate risk approximation models that depend on wind speed measurements at different points in time throughout a storm’s life cycle.
For several decades, forecasters have relied on visual pattern recognition of complex cloud features in visible and infrared imagery. While the longevity of this technique indicates the strong relationship between spatial patterns and cyclone intensity, visual inspection is manual, subjective, and often leads to inconsistent estimates between even well-trained analysts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is created for Kaggle users who want to explore, experiment, and innovate in the field of stock prediction and financial machine learning. It bridges real-world financial data with practical modeling challenges, enabling you to build, test, and showcase predictive models that simulate professional-grade investment analysis.
Whether you're a beginner exploring quantitative finance or an experienced data scientist refining predictive strategies, this dataset offers a rich playground for uncovering insights and improving model performance.
This dataset allows contributors to focus on what matters most: building impactful models and sharing innovative approaches with the Kaggle community.
Data is aggregated from widely-used financial intelligence platforms:
FinancialModelingPrep (FMP) Company profiles, financial fundamentals, financial ratios, S&P 500 performance indicators, and analyst ratings.
Alpha Vantage Daily historical OHLC and adjusted close stock price data.
These sources ensure high relevance, broad coverage, and strong analytical value for market-based modeling.
The dataset is organized for intuitive exploration and modeling, with each record structured by stock ticker and time period. Feature categories include:
This design supports: ✔ Binary classification ✔ Regression modeling ✔ Time-series experimentation
To ensure usability and consistency, the data underwent:
This ensures a smooth experience whether you're performing quick EDA or building production-ready models.
You are encouraged to:
If you find this dataset valuable, feel free to follow for future releases and updates as this project evolves with enhanced features, expanded stock coverage, and refined modeling strategies.
This dataset is intended for research and educational purposes only and does not constitute financial or investment advice. Market conditions and external factors may significantly influence real-world outcomes.
✨ This dataset is part of an ongoing effort to build a transparent, reusable, and community-driven resource for advancing financial machine learning on Kaggle.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
💉 COVID-19 Vaccine Adverse Events (2020-2025): VAERS Real-World Surveillance Data This dataset offers a critical, large-scale look into the real-world safety surveillance of COVID-19 vaccines, sourced from the Vaccine Adverse Event Reporting System (VAERS). Maintained by the CDC and FDA, this collection spans the unprecedented period of mass vaccination from 2020 through 2025, providing an invaluable resource for pharmacovigilance, public health research, and regulatory decision-making.
Key Features & Challenge The dataset is a rich blend of structured and unstructured information detailing reported Adverse Drug Events (ADEs), which range from mild local reactions to severe, life-threatening complications.
Structured Data: Includes standardized symptom codes, offering a direct, quantitative view of reported reactions.
Free-Text Notes: Contains verbose, real-world symptom descriptions provided by reporters. This text is a "treasure trove" of granular context, including details on duration, intensity, and location of symptoms.
The Challenge: The structured entries are limited in scope. The free-text notes, while rich, are inherently noisy and lack standardized metadata such as clinical severity scores or age-specific pattern normalization.
Value to Data Scientists This dataset presents a significant Natural Language Processing (NLP) and Machine Learning (ML) challenge:
Extracting Context: Develop models to effectively extract critical clinical context (e.g., "headache lasting three days, severe") from the raw, non-standardized free-text notes.
Standardizing Severity: Create predictive models to assign standardized severity and age-specific risk patterns to ADEs.
Informed Decision Making: The ultimate goal is to generate actionable, timely insights for regulators, healthcare providers, and pharmaceutical companies, improving both vaccine safety monitoring and public trust.
Dive into this dataset to apply your skills in advanced data cleaning, feature engineering, and state-of-the-art NLP to solve a crucial, high-impact public health challenge.
Facebook
TwitterThis dataset was created by Loulou