11 datasets found
  1. RBD24 - Risk Activities Dataset 2024

    • zenodo.org
    bin
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

    This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

    Summary of the Datasets

    The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

    DatasetIdEntity Observed BehaviourGroundtruthSample Shape
    Crypto_desktop.parquetDEMiner CheckingIDS0: 738/161202, 1: 11/1343
    Crypto_smarphone.parquetSMMiner CheckingIDS0: 613/180021, 1: 4/956
    OutFlash_desktop.parquetDEOutdated software components IDS0: 738/161202, 1: 56/10820
    OutFlash_smartphone.parquetSMOutdated software components IDS0: 613/180021, 1: 22/6639
    OutTLS_desktop.parquetDEOutdated TLS protocolIDS0: 738/161202, 1: 18/2458
    OutTLS_smartphone.parquetSMOutdated TLS protocolIDS0: 613/180021, 1: 11/2930
    P2P_desktop.parquetDEP2P ActivityIDS0: 738/161202, 1: 177/35892
    P2P_smartphone.parquetSMP2P ActivityIDS0: 613/180021, 1: 94/21688
    NonEnc_desktop.parquetDENon-encrypted passwordIDS0: 738/161202, 1: 291/59943
    NonEnc_smaprthone.parquetSMNon-encrypted passwordIDS0: 613/180021, 1: 167/41434
    Phishing_desktop.parquetDEPhishing email

    Experimental Campaign

    0: 98/13864, 1: 19/3072
    Phishing_smartphone.parquetSMPhishing emailExperimental Campaign0: 117/34006, 1: 26/8968

    Methodology

    To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
    more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
    ground truth are as follows:

    - Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
    - IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

    For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
    user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
    and unsupervised methods.

    Sample Representation

    The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
    timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
    construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
    interpretable features designed to describe device-level properties within the specified time frame. The most
    influential features are described below.

    • User:** A unique hash value that identifies a user.
    • Timestamp:** The timestamp of the windows.
    • Features
    • Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

    Dataset Format

    Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

    ```python
    import pandas as pd

    # Reading a Parquet file
    df = pd.read_parquet(
    'path_to_your_file.parquet',
    engine='fastparquet'
    )

    ```

  2. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  3. 10 Million Number Dataset

    • kaggle.com
    zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehedi Hasand1497 (2025). 10 Million Number Dataset [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/10-million-random-number-dataset-for-ml/data
    Explore at:
    zip(2285635720 bytes)Available download formats
    Dataset updated
    Apr 28, 2025
    Authors
    Mehedi Hasand1497
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the Dataset: Random Data with Hidden Structure

    This dataset consists of 10,000,000 samples with 50 numerical features. Each feature has been randomly generated using a uniform distribution between 0 and 1. To add complexity, a hidden structure has been introduced in some of the features. Specifically, Feature 2 is related to Feature 1, making it a good candidate for regression analysis tasks. The other features remain purely random, allowing for the exploration of feature engineering and random data generation techniques.

    Key Features and Structure

    • Feature 1: A random number drawn from a uniform distribution between 0 and 1.
    • Feature 2: A function of Feature 1, specifically Feature 2 ≈ 2 × Feature 1 + small Gaussian noise (N(0, 0.05)). This introduces a hidden linear relationship with a small amount of noise for added realism.
    • Features 3 to 50: Independent random numbers generated between 0 and 1, with no relationship to each other or any other features.

    This hidden structure allows you to test models on data where a simple pattern (between Feature 1 and Feature 2) exists, but with noise that can challenge more advanced models in finding the relationship.

    Dataset Overview

    Feature NameDescription
    feature_1Random number (0–1, uniform)
    feature_22 × feature_1 + small noise (N(0, 0.05))
    feature_3–50Independent random numbers (0–1)
    • Rows: 10,000,000
    • Columns: 50
    • Format: CSV
    • File Size: 5.32 GB ## Intended Uses

    This dataset is versatile and can be used for various machine learning tasks, including:

    • Testing and benchmarking machine learning models: Evaluate model performance on large, randomly generated datasets.
    • Regression analysis practice: The relationship between Feature 1 and Feature 2 makes it ideal for testing regression models.
    • Feature engineering experiments: Explore techniques for selecting, transforming, or creating new features.
    • Random data generation research: Investigate methods for generating synthetic data and its applications.
    • Large-scale data processing testing: Test frameworks such as Pandas, Dask, and Spark for processing large datasets.

    Licensing

    This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

    Learn more about the license here

  4. Flight Delay Dataset — 2024

    • kaggle.com
    zip
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hrishit Patil (2025). Flight Delay Dataset — 2024 [Dataset]. https://www.kaggle.com/datasets/hrishitpatil/flight-data-2024
    Explore at:
    zip(283545854 bytes)Available download formats
    Dataset updated
    Sep 21, 2025
    Authors
    Hrishit Patil
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Flight Delay Dataset — 2024

    Description

    This dataset contains detailed flight performance and delay information for domestic flights in 2024, merged from monthly BTS TranStats files into a single cleaned dataset. It includes over 7 million rows and 35 columns, providing comprehensive information on scheduled and actual flight times, delays, cancellations, diversions, and distances between airports. The dataset is suitable for exploratory data analysis (EDA), machine learning tasks such as delay prediction, time series analysis, and airline/airport performance studies.

    Monthly CSV files for January–December 2024 were downloaded from the BTS TranStats On-Time Performance database, and 35 relevant columns were selected. The monthly files were merged into a single dataset using pandas, with cleaning steps including standardizing column names to snake_case (e.g., flight_date, dep_delay), converting flight_date to ISO format (YYYY-MM-DD), converting cancelled and diverted to binary indicators (0/1), and filling missing values in delay-related columns (carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay) with 0, while preserving all other values as in the original data.

    Source: Available at BTS TranStats

    File Description

    • flight_data_2024.csv — full cleaned dataset (~7M rows, 35 columns)
    • flight_data_2024_sample.csv — sample dataset (10,000 rows)
    • flight_data_2024_data_dictionary.csv — column names, data types, null percentage, and example values
    • README.md — dataset overview and usage instructions
    • LICENSE.txt — CC0 license
    • dataset-metadata.json — Kaggle metadata for the dataset

    Column Description

    Column NameDescription
    yearYear of flight
    monthMonth of flight (1–12)
    day_of_monthDay of the month
    day_of_weekDay of week (1=Monday … 7=Sunday)
    fl_dateFlight date (YYYY-MM-DD)
    op_unique_carrierUnique carrier code
    op_carrier_fl_numFlight number for reporting airline
    originOrigin airport code
    origin_city_nameOrigin city name
    origin_state_nmOrigin state name
    destDestination airport code
    dest_city_nameDestination city name
    dest_state_nmDestination state name
    crs_dep_timeScheduled departure time (local, hhmm)
    dep_timeActual departure time (local, hhmm)
    dep_delayDeparture delay in minutes (negative if early)
    taxi_outTaxi out time in minutes
    wheels_offWheels-off time (local, hhmm)
    wheels_onWheels-on time (local, hhmm)
    taxi_inTaxi in time in minutes
    crs_arr_timeScheduled arrival time (local, hhmm)
    arr_timeActual arrival time (local, hhmm)
    arr_delayArrival delay in minutes (negative if early)
    cancelledCancelled flight indicator (0=No, 1=Yes)
    cancellation_codeReason for cancellation (if cancelled)
    divertedDiverted flight indicator (0=No, 1=Yes)
    crs_elapsed_timeScheduled elapsed time in minutes
    actual_elapsed_timeActual elapsed time in minutes
    air_timeFlight time in minutes
    distanceDistance between origin and destination (miles)
    carrier_delayCarrier-related delay in minutes
    weather_delayWeather-related delay in minutes
    nas_delayNational Air System delay in minutes
    security_delaySecurity delay in minutes
    late_aircraft_delayLate aircraft delay in minutes
  5. Salaries case study

    • kaggle.com
    zip
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
    Explore at:
    zip(13105509 bytes)Available download formats
    Dataset updated
    Oct 2, 2024
    Authors
    Shobhit Chauhan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

    Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

    Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.

  6. Japan National Land Numerical Data🇯🇵

    • kaggle.com
    zip
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kyotoman (2024). Japan National Land Numerical Data🇯🇵 [Dataset]. https://www.kaggle.com/datasets/tatsuokoshida/japan-national-land-numerical-data/data
    Explore at:
    zip(699666341 bytes)Available download formats
    Dataset updated
    May 12, 2024
    Authors
    kyotoman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan
    Description

    Description

    The names of prefectures are maintained as GIS data for administrative boundaries throughout Japan.

    Purpose

    This geographic data for Japan can be combined with other statistical information to create intuitive and easy-to-understand plots.
    For example, by combining this GIS data with information on the population of each prefecture, it is possible to see at a glance how many people are in any given prefecture.

    How To Use

    GeoPandas is an extension of pandas, a GIS-based Python library that allows you to work with data, including geographic data, in tabular form like pandas.
    Load the shp file in GeoPandas as follows:
    gdf = gpd.read_file("/kaggle/input/japan-national-land-numerical-data/N03-20240101_prefecture.shp")

    This notebook also explains simple usage.

    There are many other types of files in the folder besides shp files, all of which are required to be read by GeoPandas.

    Data Collection

    Technical Report of the Geospatial Information Authority of Japan publishes GIS data based on numerical land information, which was used.

    Source

    出典:国土交通省国土数値情報ダウンロードサイト(https://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-N03-2023.html)

  7. Named Entity Recognition (NER) Corpus

    • kaggle.com
    zip
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
    Explore at:
    zip(4343548 bytes)Available download formats
    Dataset updated
    Jan 14, 2022
    Authors
    Naser Al-qaydeh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Task

    Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

    Dataset

    Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

    You can use Pandas Dataframe to read and manipulate this dataset.

    Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

    data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

    Acknowledgements

    This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

    Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

    Essential info about entities:

    • geo = Geographical Entity
    • org = Organization
    • per = Person
    • gpe = Geopolitical Entity
    • tim = Time indicator
    • art = Artifact
    • eve = Event
    • nat = Natural Phenomenon
  8. YouTube Trending Videos of the Day

    • kaggle.com
    zip
    Updated Jul 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iron486 (2022). YouTube Trending Videos of the Day [Dataset]. https://www.kaggle.com/datasets/die9origephit/youtube-trending-videos-in-mediterranean-countries
    Explore at:
    zip(1935296 bytes)Available download formats
    Dataset updated
    Jul 11, 2022
    Authors
    Iron486
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6372737%2F612bab165133b143d48757098ec59e80%2Fimage2.jpg?generation=1659644889577044&alt=media" alt="">

    The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:

    • IT - Italy
    • ES - Spain
    • GR - Greece
    • HR - Croatia
    • TR - Turkey
    • AL - Albania
    • DZ - Algeria
    • EG - Egypt
    • LY - Lybia
    • TN - Tunisia
    • MA - Morocco
    • IL - Israel
    • ME - Montenegro
    • LB - Lebanon
    • FR - France
    • BA - Bosnia and Herzegovina
    • MT - Malta
    • SI - Slovenia
    • CY - Cyprus
    • SY - Syria

    -

    The columns are, instead, the following:

    • country: where is the country in which the video was published.

    • video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'.

    • title: title of the video.

    • publishedAt: publication date of the video.

    • channelId: identification number of the channel who published the video.

    • channelTitle: name of the channel who published the video.

    • categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list.

    • trending_date: trending date of the video.

    • tags: tags present in the video.

    • view_count: view count of the video.

    • comment_count: number of comments in the video.

    • thumbnail_link: the link of the image that appears before clicking the video.

    -comments_disabled: tells if the comments are disabled or not for a certain video.

    -ratings_disabled: tells if the rating is disabled or not for that video.

    -description: description below the video.

    Inspiration

    You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions. Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information. Remember to upvote if you found the dataset useful :).

    Collection methodology

    The original data were scraped using a tool that you can find here. Only the Mediterranean countries were considered and the datasets related to different countries were put together in one csv file, adding the country column. The likes and dislikes columns were removed due to the fact that, at the moment, it's not possible to visualize them through the API.

    Acknowledgment

    https://github.com/mitchelljy/Trending-YouTube-Scraper https://github.com/mitchelljy/Trending-YouTube-Scraper/blob/master/LICENSE

  9. AI Workforce & Automation Dataset (2015–2025)

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emirhan Akkuş (2025). AI Workforce & Automation Dataset (2015–2025) [Dataset]. https://www.kaggle.com/emirhanakku/ai-workforce-and-automation-dataset-20152025
    Explore at:
    zip(7409 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    Emirhan Akkuş
    Description

    Dataset Overview

    AttributeDetails
    Time Span2015–2025
    Countries Included20 global economies
    Total Records220 rows
    Total Features12 quantitative & qualitative attributes
    Data TypeSynthetic, statistically coherent
    Tools UsedPython (Faker, NumPy, Pandas)
    LicenseCC BY-NC 4.0 – Attribution Non-Commercial
    CreatorEmirhan Akkuş – Kaggle Expert

    This dataset provides a macro-level simulation of how artificial intelligence and automation have transformed global workforce dynamics, productivity growth, and job distribution during the last decade. It is designed for predictive analytics, forecasting, visualization, and policy research applications.

    Data Generation Process | Step | Description | | :-------------------------- | :---------------------------------------------------------------------------------------------------------------- | | 1. Initialization | A baseline AI investment and automation rate were defined for each country (between 5–80 billion USD and 10–40%). | | 2. Temporal Simulation | Yearly values were simulated for 2015–2025 using exponential and non-linear growth models with controlled noise. | | 3. Correlation Modeling | Employment, productivity, and salary were dynamically linked to automation and AI investment levels. | | 4. Randomization | Gaussian noise (±2%) was introduced to prevent perfect correlation and ensure natural variability. | | 5. Policy Simulation | Synthetic indexes were calculated for AI readiness, policy maturity, and reskilling investment efforts. | | 6. Export | Final data were consolidated and exported to CSV using Pandas for easy reproducibility. |

    The dataset was generated to maintain internal coherence — as automation and AI investment increase, employment tends to slightly decline, productivity grows, and reskilling budgets expand proportionally.

    Column Definitions | Column | Description | Value Range / Type | | :----------------------------------- | :---------------------------------------------- | :------------------------- | | Year | Observation year between 2015–2025 | Integer | | Country | Country name | Categorical (20 unique) | | AI_Investment_BillionUSD | Annual AI investment (in billions of USD) | Continuous (5–200) | | Automation_Rate_Percent | Percentage of workforce automated | Continuous (10–95%) | | Employment_Rate_Percent | Percentage of total population employed | Continuous (50–80%) | | Average_Salary_USD | Mean annual salary in USD | Continuous (25,000–90,000) | | Productivity_Index | Productivity score scaled 0–100 | Continuous | | Reskilling_Investment_MillionUSD | Government/corporate reskilling investment | Continuous (100–5,000) | | AI_Policy_Index | Policy readiness index (0–1) | Float | | Job_Displacement_Million | Estimated number of jobs replaced by automation | Continuous (0–3 million) | | Job_Creation_Million | New AI-driven jobs created | Continuous (0–4 million) | | AI_Readiness_Score | Composite readiness and adoption index | Continuous (0–100) |

    Each feature is designed to maintain realistic relationships between AI investments, automation, and socio-economic outcomes.

    Analytical Applications | Application Area | Example Analyses | | :---------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Exploratory Data Analysis (EDA) | Study how AI investment evolves across countries, compare productivity and employment patterns, or compute correlation...

  10. Movie Rationales (Rationales For Movie Reviews)

    • kaggle.com
    zip
    Updated Nov 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Movie Rationales (Rationales For Movie Reviews) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-human-perspective-on-movie-reviews/discussion
    Explore at:
    zip(3187183 bytes)Available download formats
    Dataset updated
    Nov 30, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Movie Rationales (Rationales For Movie Reviews)

    Human annotated rationales for movie reviews

    By Huggingface Hub [source]

    About this dataset

    This dataset was created to allow researchers to gain an in-depth understanding of the inner workings of human-generated movie reviews. With these train, test, and validation sets, researchers can explore different aspects of movie reviews, such as sentiment labels or rationales behind them. By analyzing this information and finding patterns and correlations, insightful ideas can be discovered that can lead to developing models powerful enough to uncover importance of the unique human perspectives when interpreting movie reviews. Any data scientist or researcher interested in AI applications is encouraged to take advantage of this dataset which may potentially provide useful insights into better understanding user intent when reviewing movies

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is intended to enable researchers and developers to uncover the rationales behind movie reviews. To use it effectively, you must understand the data format and how each column in the dataset works.

    What does each column mean?

    • review: The text of the movie review. (String)
    • label: The sentiment label of the review (Positive, Negative, or Neutral). (String)
    • validation.csv: The validation set which contains reviews, labels, and evidence which can be used to validate models developed for understanding human perspective on movie reviews.
    • train.csv: The train set which contains reviews, labels as well as evidence used for training a model based on human annotations of movie reviews.
    • test.csv: The test set which contains reviews, labels and evidence that can be used to evaluate models on unseen data related to understanding perspectives of humans when it comes to movie reviews..

      How do I use this dataset?

      To get started with this dataset you need a working environment such as Python or R where you have access library’s needed for natural language processing(NLP). After setting up an environment with libraries that support NLP tasks execute following steps :

      • Import csv files into your workspace using appropriate functions provided by specified language libraries e,.g., for Python use pandas read_csv() method .

      • Preprocess your text data in 'review' & 'label' columns by standardizing them like removing stopwords from sentences & converting words into lowercase etc .Following link link provides best possible preprocessing libraries available in Python .

      • Train&Test ML algorithms using appropriate feature extraction techniques related to NLP( Bag Of Words , TF-IDF , Word2Vec ) eines are some examples in many more are available Refer link

      • Measure performance accuracy after running experiments on datasets provided validation & test sets we have also included precision recall curves along famous metrics like F1 score & accuracy score so you could easily analyze hyperparameter tuning & algorithm efficiency according their outputs values you get while testing your ML algorithm

      • Recommendation systems are always fun! build a simple machine learning reccomendation system by collecting user visits logs post hand writting new featuers might

    Research Ideas

    • Developing an automated movie review summarizer based on user ratings, that can accurately capture the salient points of a review and summarize it for moviegoers.
    • Training a model to predict the sentiment of a review, by combining machine learning models with human-annotated rationales from this dataset.
    • Building an AI system that can detect linguistic markers of deception in reviews (e.g., 'fake news', thin reviews etc) and issue warnings on possible fraudulent purchases or online reviews

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description ...

  11. 🚗Car ShowRoom💸

    • kaggle.com
    zip
    Updated Nov 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Essa (2025). 🚗Car ShowRoom💸 [Dataset]. https://www.kaggle.com/datasets/jockeroika/car-showroom/suggestions
    Explore at:
    zip(312493 bytes)Available download formats
    Dataset updated
    Nov 1, 2025
    Authors
    Omar Essa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🎯 1. Define the Goal

    Ask yourself: what do you want to do with the data?

    Examples:

    📊 Analyze sales, profit, and inventory

    🧠 Predict car prices based on features

    🧾 Build a car showroom management system (SQL/Flask)

    🖥️ Create a dashboard showing cars, sales, and customershttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22534838%2Fdfac04a63ca17f24f22024cf647423bb%2FChatGPT%20Image%20Oct%2031%202025%2006_56_39%20PM.png?generation=1761929844815237&alt=media" alt="">

    Tools You Can Use | Goal | Tools | | ------------- | ----------------------------------------- | | Data Creation | Excel / Python (Pandas) | | Database | MySQL / SQLite / PostgreSQL | | Dashboard | Power BI / Tableau / Streamlit / Flask | | ML Models | scikit-learn (e.g., car price prediction) |

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591
Organization logo

RBD24 - Risk Activities Dataset 2024

Explore at:
binAvailable download formats
Dataset updated
Mar 4, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

Summary of the Datasets

The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

DatasetIdEntity Observed BehaviourGroundtruthSample Shape
Crypto_desktop.parquetDEMiner CheckingIDS0: 738/161202, 1: 11/1343
Crypto_smarphone.parquetSMMiner CheckingIDS0: 613/180021, 1: 4/956
OutFlash_desktop.parquetDEOutdated software components IDS0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquetSMOutdated software components IDS0: 613/180021, 1: 22/6639
OutTLS_desktop.parquetDEOutdated TLS protocolIDS0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquetSMOutdated TLS protocolIDS0: 613/180021, 1: 11/2930
P2P_desktop.parquetDEP2P ActivityIDS0: 738/161202, 1: 177/35892
P2P_smartphone.parquetSMP2P ActivityIDS0: 613/180021, 1: 94/21688
NonEnc_desktop.parquetDENon-encrypted passwordIDS0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquetSMNon-encrypted passwordIDS0: 613/180021, 1: 167/41434
Phishing_desktop.parquetDEPhishing email

Experimental Campaign

0: 98/13864, 1: 19/3072
Phishing_smartphone.parquetSMPhishing emailExperimental Campaign0: 117/34006, 1: 26/8968

Methodology

To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:

- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.

Sample Representation

The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.

  • User:** A unique hash value that identifies a user.
  • Timestamp:** The timestamp of the windows.
  • Features
  • Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

Dataset Format

Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

```python
import pandas as pd

# Reading a Parquet file
df = pd.read_parquet(
'path_to_your_file.parquet',
engine='fastparquet'
)

```

Search
Clear search
Close search
Google apps
Main menu