11 datasets found

RBD24 - Risk Activities Dataset 2024

zenodo.org

bin

Updated Mar 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13787591

Dataset updated

Mar 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

Summary of the Datasets

The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

DatasetId	Entity	Observed Behaviour	Groundtruth	Sample Shape
Crypto_desktop.parquet	DE	Miner Checking	IDS	0: 738/161202, 1: 11/1343
Crypto_smarphone.parquet	SM	Miner Checking	IDS	0: 613/180021, 1: 4/956
OutFlash_desktop.parquet	DE	Outdated software components	IDS	0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquet	SM	Outdated software components	IDS	0: 613/180021, 1: 22/6639
OutTLS_desktop.parquet	DE	Outdated TLS protocol	IDS	0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquet	SM	Outdated TLS protocol	IDS	0: 613/180021, 1: 11/2930
P2P_desktop.parquet	DE	P2P Activity	IDS	0: 738/161202, 1: 177/35892
P2P_smartphone.parquet	SM	P2P Activity	IDS	0: 613/180021, 1: 94/21688
NonEnc_desktop.parquet	DE	Non-encrypted password	IDS	0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquet	SM	Non-encrypted password	IDS	0: 613/180021, 1: 167/41434
Phishing_desktop.parquet	DE	Phishing email	Experimental Campaign	0: 98/13864, 1: 19/3072
Phishing_smartphone.parquet	SM	Phishing email	Experimental Campaign	0: 117/34006, 1: 26/8968

Methodology

To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:

- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.

Sample Representation

The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.

User:** A unique hash value that identifies a user.
Timestamp:** The timestamp of the windows.
Features
Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

Dataset Format

Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

```python
import pandas as pd

# Reading a Parquet file
df = pd.read_parquet(
'path_to_your_file.parquet',
engine='fastparquet'
)

```

🌆 City Lifestyle Segmentation Dataset

kaggle.com

zip

Updated Nov 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset

Explore at:

zip(11274 bytes)Available download formats

Dataset updated

Nov 15, 2025

Authors

UmutUygurr

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

🌆 About This Dataset

This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

🎯 Perfect For:

📊 K-Means, DBSCAN, Agglomerative Clustering
🔬 PCA & t-SNE Dimensionality Reduction
🗺️ Geospatial Visualization (Plotly, Folium)
📈 Correlation Analysis & Feature Engineering
🎓 Educational Projects (Beginner to Intermediate)

📦 What's Inside?

Feature	Description	Range
10 Features	Economic, environmental & social indicators	Realistically scaled
300 Cities	Europe, Asia, Americas, Africa, Oceania	Diverse distributions
Strong Correlations	Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6)	ML-ready
No Missing Values	Clean, preprocessed data	Ready for analysis
4-5 Natural Clusters	Metropolitan hubs, eco-towns, developing centers	Pre-validated

🔥 Key Features

✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases

🚀 Quick Start Example

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)

# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze
print(df.groupby('cluster').mean())

🎓 Learning Outcomes

After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

📚 Ideal For These Projects

🏆 Kaggle Competitions: Practice clustering techniques
📝 Academic Projects: Urban planning, sociology, environmental science
💼 Portfolio Work: Showcase ML skills to employers
🎓 Learning: Hands-on practice with unsupervised learning
🔬 Research: Urban lifestyle segmentation studies

🌍 Expected Clusters

Cluster	Characteristics	Example Cities
Metropolitan Tech Hubs	High income, density, rent	Silicon Valley, Singapore
Eco-Friendly Towns	Low density, clean air, high happiness	Nordic cities
Developing Centers	Mid income, high density, poor air	Emerging markets
Low-Income Suburban	Low infrastructure, income	Rural areas
Industrial Mega-Cities	Very high density, pollution	Manufacturing hubs

🛠️ Technical Details

Format: CSV (UTF-8)
Size: ~300 rows × 10 columns
Missing Values: 0%
Data Types: 2 categorical, 8 numerical
Target Variable: None (unsupervised)
Correlation Strength: Pre-validated (r: 0.4 to 0.8)

📖 What Makes This Dataset Special?

Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

🏅 Use This Dataset If You Want To:

✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights

📊 Acknowledgments

This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

Happy Clustering! 🎉

10 Million Number Dataset
kaggle.com
zip
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehedi Hasand1497 (2025). 10 Million Number Dataset [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/10-million-random-number-dataset-for-ml/data
Explore at:
zip(2285635720 bytes)Available download formats
Dataset updated
Apr 28, 2025
Authors
Mehedi Hasand1497
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the Dataset: Random Data with Hidden Structure

This dataset consists of 10,000,000 samples with 50 numerical features. Each feature has been randomly generated using a uniform distribution between 0 and 1. To add complexity, a hidden structure has been introduced in some of the features. Specifically, Feature 2 is related to Feature 1, making it a good candidate for regression analysis tasks. The other features remain purely random, allowing for the exploration of feature engineering and random data generation techniques.

Key Features and Structure

Feature 1: A random number drawn from a uniform distribution between 0 and 1.

Feature 2: A function of Feature 1, specifically Feature 2 ≈ 2 × Feature 1 + small Gaussian noise (N(0, 0.05)). This introduces a hidden linear relationship with a small amount of noise for added realism.

Features 3 to 50: Independent random numbers generated between 0 and 1, with no relationship to each other or any other features.

This hidden structure allows you to test models on data where a simple pattern (between Feature 1 and Feature 2) exists, but with noise that can challenge more advanced models in finding the relationship.

Dataset Overview

Feature Name Description
feature_1 Random number (0–1, uniform)
feature_2 2 × feature_1 + small noise (N(0, 0.05))
feature_3–50 Independent random numbers (0–1)

Rows: 10,000,000

Columns: 50

Format: CSV

File Size: 5.32 GB ## Intended Uses

This dataset is versatile and can be used for various machine learning tasks, including:

Testing and benchmarking machine learning models: Evaluate model performance on large, randomly generated datasets.

Regression analysis practice: The relationship between Feature 1 and Feature 2 makes it ideal for testing regression models.

Feature engineering experiments: Explore techniques for selecting, transforming, or creating new features.

Random data generation research: Investigate methods for generating synthetic data and its applications.

Large-scale data processing testing: Test frameworks such as Pandas, Dask, and Spark for processing large datasets.

Licensing

This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

Learn more about the license here

Feature Name	Description
feature_1	Random number (0–1, uniform)
feature_2	2 × feature_1 + small noise (N(0, 0.05))
feature_3–50	Independent random numbers (0–1)

Flight Delay Dataset — 2024

kaggle.com

zip

Updated Sep 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Hrishit Patil (2025). Flight Delay Dataset — 2024 [Dataset]. https://www.kaggle.com/datasets/hrishitpatil/flight-data-2024

Explore at:

zip(283545854 bytes)Available download formats

Dataset updated

Sep 21, 2025

Authors

Hrishit Patil

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Flight Delay Dataset — 2024

Description

This dataset contains detailed flight performance and delay information for domestic flights in 2024, merged from monthly BTS TranStats files into a single cleaned dataset. It includes over 7 million rows and 35 columns, providing comprehensive information on scheduled and actual flight times, delays, cancellations, diversions, and distances between airports. The dataset is suitable for exploratory data analysis (EDA), machine learning tasks such as delay prediction, time series analysis, and airline/airport performance studies.

Monthly CSV files for January–December 2024 were downloaded from the BTS TranStats On-Time Performance database, and 35 relevant columns were selected. The monthly files were merged into a single dataset using pandas, with cleaning steps including standardizing column names to snake_case (e.g., flight_date, dep_delay), converting flight_date to ISO format (YYYY-MM-DD), converting cancelled and diverted to binary indicators (0/1), and filling missing values in delay-related columns (carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay) with 0, while preserving all other values as in the original data.

Source: Available at BTS TranStats

File Description

flight_data_2024.csv — full cleaned dataset (~7M rows, 35 columns)
flight_data_2024_sample.csv — sample dataset (10,000 rows)
flight_data_2024_data_dictionary.csv — column names, data types, null percentage, and example values
README.md — dataset overview and usage instructions
LICENSE.txt — CC0 license
dataset-metadata.json — Kaggle metadata for the dataset

Column Description

Column Name	Description
`year`	Year of flight
`month`	Month of flight (1–12)
`day_of_month`	Day of the month
`day_of_week`	Day of week (1=Monday … 7=Sunday)
`fl_date`	Flight date (YYYY-MM-DD)
`op_unique_carrier`	Unique carrier code
`op_carrier_fl_num`	Flight number for reporting airline
`origin`	Origin airport code
`origin_city_name`	Origin city name
`origin_state_nm`	Origin state name
`dest`	Destination airport code
`dest_city_name`	Destination city name
`dest_state_nm`	Destination state name
`crs_dep_time`	Scheduled departure time (local, hhmm)
`dep_time`	Actual departure time (local, hhmm)
`dep_delay`	Departure delay in minutes (negative if early)
`taxi_out`	Taxi out time in minutes
`wheels_off`	Wheels-off time (local, hhmm)
`wheels_on`	Wheels-on time (local, hhmm)
`taxi_in`	Taxi in time in minutes
`crs_arr_time`	Scheduled arrival time (local, hhmm)
`arr_time`	Actual arrival time (local, hhmm)
`arr_delay`	Arrival delay in minutes (negative if early)
`cancelled`	Cancelled flight indicator (0=No, 1=Yes)
`cancellation_code`	Reason for cancellation (if cancelled)
`diverted`	Diverted flight indicator (0=No, 1=Yes)
`crs_elapsed_time`	Scheduled elapsed time in minutes
`actual_elapsed_time`	Actual elapsed time in minutes
`air_time`	Flight time in minutes
`distance`	Distance between origin and destination (miles)
`carrier_delay`	Carrier-related delay in minutes
`weather_delay`	Weather-related delay in minutes
`nas_delay`	National Air System delay in minutes
`security_delay`	Security delay in minutes
`late_aircraft_delay`	Late aircraft delay in minutes

Salaries case study
kaggle.com
zip
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
Explore at:
zip(13105509 bytes)Available download formats
Dataset updated
Oct 2, 2024
Authors
Shobhit Chauhan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.
Japan National Land Numerical Data🇯🇵
kaggle.com
zip
Updated May 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kyotoman (2024). Japan National Land Numerical Data🇯🇵 [Dataset]. https://www.kaggle.com/datasets/tatsuokoshida/japan-national-land-numerical-data/data
Explore at:
zip(699666341 bytes)Available download formats
Dataset updated
May 12, 2024
Authors
kyotoman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Japan
Description
Description

The names of prefectures are maintained as GIS data for administrative boundaries throughout Japan.

Purpose

This geographic data for Japan can be combined with other statistical information to create intuitive and easy-to-understand plots.
For example, by combining this GIS data with information on the population of each prefecture, it is possible to see at a glance how many people are in any given prefecture.

How To Use

GeoPandas is an extension of pandas, a GIS-based Python library that allows you to work with data, including geographic data, in tabular form like pandas.
Load the shp file in GeoPandas as follows:
gdf = gpd.read_file("/kaggle/input/japan-national-land-numerical-data/N03-20240101_prefecture.shp")

This notebook also explains simple usage.

There are many other types of files in the folder besides shp files, all of which are required to be read by GeoPandas.

Data Collection

Technical Report of the Geospatial Information Authority of Japan publishes GIS data based on numerical land information, which was used.

Source

出典：国土交通省国土数値情報ダウンロードサイト（https://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-N03-2023.html）
Named Entity Recognition (NER) Corpus
kaggle.com
zip
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
Explore at:
zip(4343548 bytes)Available download formats
Dataset updated
Jan 14, 2022
Authors
Naser Al-qaydeh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

geo = Geographical Entity

org = Organization

per = Person

gpe = Geopolitical Entity

tim = Time indicator

art = Artifact

eve = Event

nat = Natural Phenomenon
YouTube Trending Videos of the Day
kaggle.com
zip
Updated Jul 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iron486 (2022). YouTube Trending Videos of the Day [Dataset]. https://www.kaggle.com/datasets/die9origephit/youtube-trending-videos-in-mediterranean-countries
Explore at:
zip(1935296 bytes)Available download formats
Dataset updated
Jul 11, 2022
Authors
Iron486
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6372737%2F612bab165133b143d48757098ec59e80%2Fimage2.jpg?generation=1659644889577044&alt=media" alt="">

The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:

IT - Italy

ES - Spain

GR - Greece

HR - Croatia

TR - Turkey

AL - Albania

DZ - Algeria

EG - Egypt

LY - Lybia

TN - Tunisia

MA - Morocco

IL - Israel

ME - Montenegro

LB - Lebanon

FR - France

BA - Bosnia and Herzegovina

MT - Malta

SI - Slovenia

CY - Cyprus

SY - Syria

-

The columns are, instead, the following:

country: where is the country in which the video was published.

video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'.

title: title of the video.

publishedAt: publication date of the video.

channelId: identification number of the channel who published the video.

channelTitle: name of the channel who published the video.

categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list.

trending_date: trending date of the video.

tags: tags present in the video.

view_count: view count of the video.

comment_count: number of comments in the video.

thumbnail_link: the link of the image that appears before clicking the video.

-comments_disabled: tells if the comments are disabled or not for a certain video.

-ratings_disabled: tells if the rating is disabled or not for that video.

-description: description below the video.

Inspiration

You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions. Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information. Remember to upvote if you found the dataset useful :).

Collection methodology

The original data were scraped using a tool that you can find here. Only the Mediterranean countries were considered and the datasets related to different countries were put together in one csv file, adding the country column. The likes and dislikes columns were removed due to the fact that, at the moment, it's not possible to visualize them through the API.

Acknowledgment

https://github.com/mitchelljy/Trending-YouTube-Scraper https://github.com/mitchelljy/Trending-YouTube-Scraper/blob/master/LICENSE

AI Workforce & Automation Dataset (2015–2025)

kaggle.com

zip

Updated Nov 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Emirhan Akkuş (2025). AI Workforce & Automation Dataset (2015–2025) [Dataset]. https://www.kaggle.com/emirhanakku/ai-workforce-and-automation-dataset-20152025

Explore at:

zip(7409 bytes)Available download formats

Dataset updated

Nov 16, 2025

Authors

Emirhan Akkuş

Description

Dataset Overview

Attribute	Details
Time Span	2015–2025
Countries Included	20 global economies
Total Records	220 rows
Total Features	12 quantitative & qualitative attributes
Data Type	Synthetic, statistically coherent
Tools Used	Python (Faker, NumPy, Pandas)
License	CC BY-NC 4.0 – Attribution Non-Commercial
Creator	Emirhan Akkuş – Kaggle Expert

This dataset provides a macro-level simulation of how artificial intelligence and automation have transformed global workforce dynamics, productivity growth, and job distribution during the last decade. It is designed for predictive analytics, forecasting, visualization, and policy research applications.

Data Generation Process | Step | Description | | :-------------------------- | :---------------------------------------------------------------------------------------------------------------- | | 1. Initialization | A baseline AI investment and automation rate were defined for each country (between 5–80 billion USD and 10–40%). | | 2. Temporal Simulation | Yearly values were simulated for 2015–2025 using exponential and non-linear growth models with controlled noise. | | 3. Correlation Modeling | Employment, productivity, and salary were dynamically linked to automation and AI investment levels. | | 4. Randomization | Gaussian noise (±2%) was introduced to prevent perfect correlation and ensure natural variability. | | 5. Policy Simulation | Synthetic indexes were calculated for AI readiness, policy maturity, and reskilling investment efforts. | | 6. Export | Final data were consolidated and exported to CSV using Pandas for easy reproducibility. |

The dataset was generated to maintain internal coherence — as automation and AI investment increase, employment tends to slightly decline, productivity grows, and reskilling budgets expand proportionally.

Each feature is designed to maintain realistic relationships between AI investments, automation, and socio-economic outcomes.

Analytical Applications | Application Area | Example Analyses | | :---------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Exploratory Data Analysis (EDA) | Study how AI investment evolves across countries, compare productivity and employment patterns, or compute correlation...

Movie Rationales (Rationales For Movie Reviews)
kaggle.com
zip
Updated Nov 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Movie Rationales (Rationales For Movie Reviews) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-human-perspective-on-movie-reviews/discussion
Explore at:
zip(3187183 bytes)Available download formats
Dataset updated
Nov 30, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Movie Rationales (Rationales For Movie Reviews)

Human annotated rationales for movie reviews

By Huggingface Hub [source]

About this dataset

This dataset was created to allow researchers to gain an in-depth understanding of the inner workings of human-generated movie reviews. With these train, test, and validation sets, researchers can explore different aspects of movie reviews, such as sentiment labels or rationales behind them. By analyzing this information and finding patterns and correlations, insightful ideas can be discovered that can lead to developing models powerful enough to uncover importance of the unique human perspectives when interpreting movie reviews. Any data scientist or researcher interested in AI applications is encouraged to take advantage of this dataset which may potentially provide useful insights into better understanding user intent when reviewing movies

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is intended to enable researchers and developers to uncover the rationales behind movie reviews. To use it effectively, you must understand the data format and how each column in the dataset works.

What does each column mean?

review: The text of the movie review. (String)

label: The sentiment label of the review (Positive, Negative, or Neutral). (String)

validation.csv: The validation set which contains reviews, labels, and evidence which can be used to validate models developed for understanding human perspective on movie reviews.

train.csv: The train set which contains reviews, labels as well as evidence used for training a model based on human annotations of movie reviews.

test.csv: The test set which contains reviews, labels and evidence that can be used to evaluate models on unseen data related to understanding perspectives of humans when it comes to movie reviews..

How do I use this dataset?

To get started with this dataset you need a working environment such as Python or R where you have access library’s needed for natural language processing(NLP). After setting up an environment with libraries that support NLP tasks execute following steps :

Import csv files into your workspace using appropriate functions provided by specified language libraries e,.g., for Python use pandas read_csv() method .

Preprocess your text data in 'review' & 'label' columns by standardizing them like removing stopwords from sentences & converting words into lowercase etc .Following link link provides best possible preprocessing libraries available in Python .

Train&Test ML algorithms using appropriate feature extraction techniques related to NLP( Bag Of Words , TF-IDF , Word2Vec ) eines are some examples in many more are available Refer link

Measure performance accuracy after running experiments on datasets provided validation & test sets we have also included precision recall curves along famous metrics like F1 score & accuracy score so you could easily analyze hyperparameter tuning & algorithm efficiency according their outputs values you get while testing your ML algorithm

Recommendation systems are always fun! build a simple machine learning reccomendation system by collecting user visits logs post hand writting new featuers might

Research Ideas

Developing an automated movie review summarizer based on user ratings, that can accurately capture the salient points of a review and summarize it for moviegoers.

Training a model to predict the sentiment of a review, by combining machine learning models with human-annotated rationales from this dataset.

Building an AI system that can detect linguistic markers of deception in reviews (e.g., 'fake news', thin reviews etc) and issue warnings on possible fraudulent purchases or online reviews

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description ...
🚗Car ShowRoom💸
kaggle.com
zip
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Essa (2025). 🚗Car ShowRoom💸 [Dataset]. https://www.kaggle.com/datasets/jockeroika/car-showroom/suggestions
Explore at:
zip(312493 bytes)Available download formats
Dataset updated
Nov 1, 2025
Authors
Omar Essa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🎯 1. Define the Goal

Ask yourself: what do you want to do with the data?

Examples:

📊 Analyze sales, profit, and inventory

🧠 Predict car prices based on features

🧾 Build a car showroom management system (SQL/Flask)

🖥️ Create a dashboard showing cars, sales, and customershttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22534838%2Fdfac04a63ca17f24f22024cf647423bb%2FChatGPT%20Image%20Oct%2031%202025%2006_56_39%20PM.png?generation=1761929844815237&alt=media" alt="">

Tools You Can Use | Goal | Tools | | ------------- | ----------------------------------------- | | Data Creation | Excel / Python (Pandas) | | Database | MySQL / SQLite / PostgreSQL | | Dashboard | Power BI / Tableau / Streamlit / Flask | | ML Models | scikit-learn (e.g., car price prediction) |
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

RBD24 - Risk Activities Dataset 2024

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13787591

Dataset updated

Mar 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

Summary of the Datasets

The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

DatasetId	Entity	Observed Behaviour	Groundtruth	Sample Shape
Crypto_desktop.parquet	DE	Miner Checking	IDS	0: 738/161202, 1: 11/1343
Crypto_smarphone.parquet	SM	Miner Checking	IDS	0: 613/180021, 1: 4/956
OutFlash_desktop.parquet	DE	Outdated software components	IDS	0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquet	SM	Outdated software components	IDS	0: 613/180021, 1: 22/6639
OutTLS_desktop.parquet	DE	Outdated TLS protocol	IDS	0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquet	SM	Outdated TLS protocol	IDS	0: 613/180021, 1: 11/2930
P2P_desktop.parquet	DE	P2P Activity	IDS	0: 738/161202, 1: 177/35892
P2P_smartphone.parquet	SM	P2P Activity	IDS	0: 613/180021, 1: 94/21688
NonEnc_desktop.parquet	DE	Non-encrypted password	IDS	0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquet	SM	Non-encrypted password	IDS	0: 613/180021, 1: 167/41434
Phishing_desktop.parquet	DE	Phishing email	Experimental Campaign	0: 98/13864, 1: 19/3072
Phishing_smartphone.parquet	SM	Phishing email	Experimental Campaign	0: 117/34006, 1: 26/8968

Methodology

Sample Representation

User:** A unique hash value that identifies a user.
Timestamp:** The timestamp of the windows.
Features
Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

Dataset Format

```python
import pandas as pd

# Reading a Parquet file
df = pd.read_parquet(
'path_to_your_file.parquet',
engine='fastparquet'
)

```

Clear search

Close search

Google apps

Main menu

RBD24 - Risk Activities Dataset 2024

Introduction

Summary of the Datasets

Methodology

Sample Representation

Dataset Format

🌆 City Lifestyle Segmentation Dataset

🌆 About This Dataset

🎯 Perfect For:

📦 What's Inside?

🔥 Key Features

🚀 Quick Start Example

🎓 Learning Outcomes

📚 Ideal For These Projects

🌍 Expected Clusters

🛠️ Technical Details

📖 What Makes This Dataset Special?

🏅 Use This Dataset If You Want To:

📊 Acknowledgments

10 Million Number Dataset

About the Dataset: Random Data with Hidden Structure

Key Features and Structure

Dataset Overview

Licensing

Flight Delay Dataset — 2024

Flight Delay Dataset — 2024

Description

File Description

Column Description

Salaries case study

Japan National Land Numerical Data🇯🇵

Description

Purpose

How To Use

Data Collection

Source

Named Entity Recognition (NER) Corpus

Task

Dataset

Acknowledgements

YouTube Trending Videos of the Day

-

Inspiration

Collection methodology

Acknowledgment

AI Workforce & Automation Dataset (2015–2025)

Movie Rationales (Rationales For Movie Reviews)

Movie Rationales (Rationales For Movie Reviews)

Human annotated rationales for movie reviews

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

What does each column mean?

How do I use this dataset?

Research Ideas

Acknowledgements

License

Columns

🚗Car ShowRoom💸

RBD24 - Risk Activities Dataset 2024

Introduction

Summary of the Datasets

Methodology

Sample Representation

Dataset Format