Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.
This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290
The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.
| DatasetId | Entity | Observed Behaviour | Groundtruth | Sample Shape |
| Crypto_desktop.parquet | DE | Miner Checking | IDS | 0: 738/161202, 1: 11/1343 |
| Crypto_smarphone.parquet | SM | Miner Checking | IDS | 0: 613/180021, 1: 4/956 |
| OutFlash_desktop.parquet | DE | Outdated software components | IDS | 0: 738/161202, 1: 56/10820 |
| OutFlash_smartphone.parquet | SM | Outdated software components | IDS | 0: 613/180021, 1: 22/6639 |
| OutTLS_desktop.parquet | DE | Outdated TLS protocol | IDS | 0: 738/161202, 1: 18/2458 |
| OutTLS_smartphone.parquet | SM | Outdated TLS protocol | IDS | 0: 613/180021, 1: 11/2930 |
| P2P_desktop.parquet | DE | P2P Activity | IDS | 0: 738/161202, 1: 177/35892 |
| P2P_smartphone.parquet | SM | P2P Activity | IDS | 0: 613/180021, 1: 94/21688 |
| NonEnc_desktop.parquet | DE | Non-encrypted password | IDS | 0: 738/161202, 1: 291/59943 |
| NonEnc_smaprthone.parquet | SM | Non-encrypted password | IDS | 0: 613/180021, 1: 167/41434 |
| Phishing_desktop.parquet | DE | Phishing email |
Experimental Campaign | 0: 98/13864, 1: 19/3072 |
| Phishing_smartphone.parquet | SM | Phishing email | Experimental Campaign | 0: 117/34006, 1: 26/8968 |
To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:
- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.
For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.
The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.
Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:
```pythonimport pandas as pd
# Reading a Parquet filedf = pd.read_parquet( 'path_to_your_file.parquet', engine='fastparquet' )
```
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">
This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.
| Feature | Description | Range |
|---|---|---|
| 10 Features | Economic, environmental & social indicators | Realistically scaled |
| 300 Cities | Europe, Asia, Americas, Africa, Oceania | Diverse distributions |
| Strong Correlations | Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6) | ML-ready |
| No Missing Values | Clean, preprocessed data | Ready for analysis |
| 4-5 Natural Clusters | Metropolitan hubs, eco-towns, developing centers | Pre-validated |
✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze
print(df.groupby('cluster').mean())
After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics
| Cluster | Characteristics | Example Cities |
|---|---|---|
| Metropolitan Tech Hubs | High income, density, rent | Silicon Valley, Singapore |
| Eco-Friendly Towns | Low density, clean air, high happiness | Nordic cities |
| Developing Centers | Mid income, high density, poor air | Emerging markets |
| Low-Income Suburban | Low infrastructure, income | Rural areas |
| Industrial Mega-Cities | Very high density, pollution | Manufacturing hubs |
Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code
✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights
This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.
Happy Clustering! 🎉
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 10,000,000 samples with 50 numerical features. Each feature has been randomly generated using a uniform distribution between 0 and 1. To add complexity, a hidden structure has been introduced in some of the features. Specifically, Feature 2 is related to Feature 1, making it a good candidate for regression analysis tasks. The other features remain purely random, allowing for the exploration of feature engineering and random data generation techniques.
This hidden structure allows you to test models on data where a simple pattern (between Feature 1 and Feature 2) exists, but with noise that can challenge more advanced models in finding the relationship.
| Feature Name | Description |
|---|---|
| feature_1 | Random number (0–1, uniform) |
| feature_2 | 2 × feature_1 + small noise (N(0, 0.05)) |
| feature_3–50 | Independent random numbers (0–1) |
This dataset is versatile and can be used for various machine learning tasks, including:
This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains detailed flight performance and delay information for domestic flights in 2024, merged from monthly BTS TranStats files into a single cleaned dataset. It includes over 7 million rows and 35 columns, providing comprehensive information on scheduled and actual flight times, delays, cancellations, diversions, and distances between airports. The dataset is suitable for exploratory data analysis (EDA), machine learning tasks such as delay prediction, time series analysis, and airline/airport performance studies.
Monthly CSV files for January–December 2024 were downloaded from the BTS TranStats On-Time Performance database, and 35 relevant columns were selected. The monthly files were merged into a single dataset using pandas, with cleaning steps including standardizing column names to snake_case (e.g., flight_date, dep_delay), converting flight_date to ISO format (YYYY-MM-DD), converting cancelled and diverted to binary indicators (0/1), and filling missing values in delay-related columns (carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay) with 0, while preserving all other values as in the original data.
Source: Available at BTS TranStats
flight_data_2024.csv — full cleaned dataset (~7M rows, 35 columns) flight_data_2024_sample.csv — sample dataset (10,000 rows) flight_data_2024_data_dictionary.csv — column names, data types, null percentage, and example values README.md — dataset overview and usage instructions LICENSE.txt — CC0 license dataset-metadata.json — Kaggle metadata for the dataset| Column Name | Description |
|---|---|
year | Year of flight |
month | Month of flight (1–12) |
day_of_month | Day of the month |
day_of_week | Day of week (1=Monday … 7=Sunday) |
fl_date | Flight date (YYYY-MM-DD) |
op_unique_carrier | Unique carrier code |
op_carrier_fl_num | Flight number for reporting airline |
origin | Origin airport code |
origin_city_name | Origin city name |
origin_state_nm | Origin state name |
dest | Destination airport code |
dest_city_name | Destination city name |
dest_state_nm | Destination state name |
crs_dep_time | Scheduled departure time (local, hhmm) |
dep_time | Actual departure time (local, hhmm) |
dep_delay | Departure delay in minutes (negative if early) |
taxi_out | Taxi out time in minutes |
wheels_off | Wheels-off time (local, hhmm) |
wheels_on | Wheels-on time (local, hhmm) |
taxi_in | Taxi in time in minutes |
crs_arr_time | Scheduled arrival time (local, hhmm) |
arr_time | Actual arrival time (local, hhmm) |
arr_delay | Arrival delay in minutes (negative if early) |
cancelled | Cancelled flight indicator (0=No, 1=Yes) |
cancellation_code | Reason for cancellation (if cancelled) |
diverted | Diverted flight indicator (0=No, 1=Yes) |
crs_elapsed_time | Scheduled elapsed time in minutes |
actual_elapsed_time | Actual elapsed time in minutes |
air_time | Flight time in minutes |
distance | Distance between origin and destination (miles) |
carrier_delay | Carrier-related delay in minutes |
weather_delay | Weather-related delay in minutes |
nas_delay | National Air System delay in minutes |
security_delay | Security delay in minutes |
late_aircraft_delay | Late aircraft delay in minutes |
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:
Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.
Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The names of prefectures are maintained as GIS data for administrative boundaries throughout Japan.
This geographic data for Japan can be combined with other statistical information to create intuitive and easy-to-understand plots.
For example, by combining this GIS data with information on the population of each prefecture, it is possible to see at a glance how many people are in any given prefecture.
GeoPandas is an extension of pandas, a GIS-based Python library that allows you to work with data, including geographic data, in tabular form like pandas.
Load the shp file in GeoPandas as follows:
gdf = gpd.read_file("/kaggle/input/japan-national-land-numerical-data/N03-20240101_prefecture.shp")
This notebook also explains simple usage.
There are many other types of files in the folder besides shp files, all of which are required to be read by GeoPandas.
Technical Report of the Geospatial Information Authority of Japan publishes GIS data based on numerical land information, which was used.
出典:国土交通省国土数値情報ダウンロードサイト(https://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-N03-2023.html)
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.
Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence
You can use Pandas Dataframe to read and manipulate this dataset.
Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```
data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string
You can use the following to convert it back to list type:from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```
This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.
Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
Essential info about entities:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6372737%2F612bab165133b143d48757098ec59e80%2Fimage2.jpg?generation=1659644889577044&alt=media" alt="">
The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:
The columns are, instead, the following:
country: where is the country in which the video was published.
video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'.
title: title of the video.
publishedAt: publication date of the video.
channelId: identification number of the channel who published the video.
channelTitle: name of the channel who published the video.
categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list.
trending_date: trending date of the video.
tags: tags present in the video.
view_count: view count of the video.
comment_count: number of comments in the video.
thumbnail_link: the link of the image that appears before clicking the video.
-comments_disabled: tells if the comments are disabled or not for a certain video.
-ratings_disabled: tells if the rating is disabled or not for that video.
-description: description below the video.
You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions.
Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information.
Remember to upvote if you found the dataset useful :).
The original data were scraped using a tool that you can find here.
Only the Mediterranean countries were considered and the datasets related to different countries were put together in one csv file, adding the country column.
The likes and dislikes columns were removed due to the fact that, at the moment, it's not possible to visualize them through the API.
https://github.com/mitchelljy/Trending-YouTube-Scraper https://github.com/mitchelljy/Trending-YouTube-Scraper/blob/master/LICENSE
Facebook
TwitterDataset Overview
| Attribute | Details |
|---|---|
| Time Span | 2015–2025 |
| Countries Included | 20 global economies |
| Total Records | 220 rows |
| Total Features | 12 quantitative & qualitative attributes |
| Data Type | Synthetic, statistically coherent |
| Tools Used | Python (Faker, NumPy, Pandas) |
| License | CC BY-NC 4.0 – Attribution Non-Commercial |
| Creator | Emirhan Akkuş – Kaggle Expert |
This dataset provides a macro-level simulation of how artificial intelligence and automation have transformed global workforce dynamics, productivity growth, and job distribution during the last decade. It is designed for predictive analytics, forecasting, visualization, and policy research applications.
Data Generation Process | Step | Description | | :-------------------------- | :---------------------------------------------------------------------------------------------------------------- | | 1. Initialization | A baseline AI investment and automation rate were defined for each country (between 5–80 billion USD and 10–40%). | | 2. Temporal Simulation | Yearly values were simulated for 2015–2025 using exponential and non-linear growth models with controlled noise. | | 3. Correlation Modeling | Employment, productivity, and salary were dynamically linked to automation and AI investment levels. | | 4. Randomization | Gaussian noise (±2%) was introduced to prevent perfect correlation and ensure natural variability. | | 5. Policy Simulation | Synthetic indexes were calculated for AI readiness, policy maturity, and reskilling investment efforts. | | 6. Export | Final data were consolidated and exported to CSV using Pandas for easy reproducibility. |
The dataset was generated to maintain internal coherence — as automation and AI investment increase, employment tends to slightly decline, productivity grows, and reskilling budgets expand proportionally.
Column Definitions | Column | Description | Value Range / Type | | :----------------------------------- | :---------------------------------------------- | :------------------------- | | Year | Observation year between 2015–2025 | Integer | | Country | Country name | Categorical (20 unique) | | AI_Investment_BillionUSD | Annual AI investment (in billions of USD) | Continuous (5–200) | | Automation_Rate_Percent | Percentage of workforce automated | Continuous (10–95%) | | Employment_Rate_Percent | Percentage of total population employed | Continuous (50–80%) | | Average_Salary_USD | Mean annual salary in USD | Continuous (25,000–90,000) | | Productivity_Index | Productivity score scaled 0–100 | Continuous | | Reskilling_Investment_MillionUSD | Government/corporate reskilling investment | Continuous (100–5,000) | | AI_Policy_Index | Policy readiness index (0–1) | Float | | Job_Displacement_Million | Estimated number of jobs replaced by automation | Continuous (0–3 million) | | Job_Creation_Million | New AI-driven jobs created | Continuous (0–4 million) | | AI_Readiness_Score | Composite readiness and adoption index | Continuous (0–100) |
Each feature is designed to maintain realistic relationships between AI investments, automation, and socio-economic outcomes.
Analytical Applications | Application Area | Example Analyses | | :---------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Exploratory Data Analysis (EDA) | Study how AI investment evolves across countries, compare productivity and employment patterns, or compute correlation...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset was created to allow researchers to gain an in-depth understanding of the inner workings of human-generated movie reviews. With these train, test, and validation sets, researchers can explore different aspects of movie reviews, such as sentiment labels or rationales behind them. By analyzing this information and finding patterns and correlations, insightful ideas can be discovered that can lead to developing models powerful enough to uncover importance of the unique human perspectives when interpreting movie reviews. Any data scientist or researcher interested in AI applications is encouraged to take advantage of this dataset which may potentially provide useful insights into better understanding user intent when reviewing movies
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is intended to enable researchers and developers to uncover the rationales behind movie reviews. To use it effectively, you must understand the data format and how each column in the dataset works.
What does each column mean?
- review: The text of the movie review. (String)
- label: The sentiment label of the review (Positive, Negative, or Neutral). (String)
- validation.csv: The validation set which contains reviews, labels, and evidence which can be used to validate models developed for understanding human perspective on movie reviews.
- train.csv: The train set which contains reviews, labels as well as evidence used for training a model based on human annotations of movie reviews.
test.csv: The test set which contains reviews, labels and evidence that can be used to evaluate models on unseen data related to understanding perspectives of humans when it comes to movie reviews..
How do I use this dataset?
To get started with this dataset you need a working environment such as Python or R where you have access library’s needed for natural language processing(NLP). After setting up an environment with libraries that support NLP tasks execute following steps :
Import csv files into your workspace using appropriate functions provided by specified language libraries e,.g., for Python use pandas read_csv() method .
Preprocess your text data in 'review' & 'label' columns by standardizing them like removing stopwords from sentences & converting words into lowercase etc .Following link link provides best possible preprocessing libraries available in Python .
Train&Test ML algorithms using appropriate feature extraction techniques related to NLP( Bag Of Words , TF-IDF , Word2Vec ) eines are some examples in many more are available Refer link
Measure performance accuracy after running experiments on datasets provided validation & test sets we have also included precision recall curves along famous metrics like F1 score & accuracy score so you could easily analyze hyperparameter tuning & algorithm efficiency according their outputs values you get while testing your ML algorithm
Recommendation systems are always fun! build a simple machine learning reccomendation system by collecting user visits logs post hand writting new featuers might
- Developing an automated movie review summarizer based on user ratings, that can accurately capture the salient points of a review and summarize it for moviegoers.
- Training a model to predict the sentiment of a review, by combining machine learning models with human-annotated rationales from this dataset.
- Building an AI system that can detect linguistic markers of deception in reviews (e.g., 'fake news', thin reviews etc) and issue warnings on possible fraudulent purchases or online reviews
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description ...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🎯 1. Define the Goal
Ask yourself: what do you want to do with the data?
Examples:
📊 Analyze sales, profit, and inventory
🧠 Predict car prices based on features
🧾 Build a car showroom management system (SQL/Flask)
🖥️ Create a dashboard showing cars, sales, and customershttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22534838%2Fdfac04a63ca17f24f22024cf647423bb%2FChatGPT%20Image%20Oct%2031%202025%2006_56_39%20PM.png?generation=1761929844815237&alt=media" alt="">
Tools You Can Use | Goal | Tools | | ------------- | ----------------------------------------- | | Data Creation | Excel / Python (Pandas) | | Database | MySQL / SQLite / PostgreSQL | | Dashboard | Power BI / Tableau / Streamlit / Flask | | ML Models | scikit-learn (e.g., car price prediction) |
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.
This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290
The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.
| DatasetId | Entity | Observed Behaviour | Groundtruth | Sample Shape |
| Crypto_desktop.parquet | DE | Miner Checking | IDS | 0: 738/161202, 1: 11/1343 |
| Crypto_smarphone.parquet | SM | Miner Checking | IDS | 0: 613/180021, 1: 4/956 |
| OutFlash_desktop.parquet | DE | Outdated software components | IDS | 0: 738/161202, 1: 56/10820 |
| OutFlash_smartphone.parquet | SM | Outdated software components | IDS | 0: 613/180021, 1: 22/6639 |
| OutTLS_desktop.parquet | DE | Outdated TLS protocol | IDS | 0: 738/161202, 1: 18/2458 |
| OutTLS_smartphone.parquet | SM | Outdated TLS protocol | IDS | 0: 613/180021, 1: 11/2930 |
| P2P_desktop.parquet | DE | P2P Activity | IDS | 0: 738/161202, 1: 177/35892 |
| P2P_smartphone.parquet | SM | P2P Activity | IDS | 0: 613/180021, 1: 94/21688 |
| NonEnc_desktop.parquet | DE | Non-encrypted password | IDS | 0: 738/161202, 1: 291/59943 |
| NonEnc_smaprthone.parquet | SM | Non-encrypted password | IDS | 0: 613/180021, 1: 167/41434 |
| Phishing_desktop.parquet | DE | Phishing email |
Experimental Campaign | 0: 98/13864, 1: 19/3072 |
| Phishing_smartphone.parquet | SM | Phishing email | Experimental Campaign | 0: 117/34006, 1: 26/8968 |
To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:
- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.
For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.
The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.
Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:
```pythonimport pandas as pd
# Reading a Parquet filedf = pd.read_parquet( 'path_to_your_file.parquet', engine='fastparquet' )
```