100+ datasets found
  1. Large Customer Churn Analysis Dataset

    • kaggle.com
    zip
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hajra Amir (2024). Large Customer Churn Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/hajraamir21/large-customer-churn-analysis-dataset
    Explore at:
    zip(17387 bytes)Available download formats
    Dataset updated
    Dec 18, 2024
    Authors
    Hajra Amir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains synthetic data generated for customer churn analysis. It includes 1000 entries representing customer information, such as demographics, account details, subscription types, and churn status. The data is ideal for predictive modeling, machine learning algorithms, and exploratory data analysis (EDA). Features: CustomerID: A unique identifier for each customer. Gender: Male or Female. Age: Customer's age in years. Geography: Country or region of the customer (e.g., Germany, France, UK). Tenure: Number of months the customer has been with the company. Contract: Type of subscription (Month-to-month, One-year, Two-year). MonthlyCharges: The amount billed monthly. TotalCharges: The total amount billed to date. PaymentMethod: Method used for payments (e.g., Credit card, Direct debit). IsActiveMember: Whether the customer is an active member (1 = Active, 0 = Inactive). Churn: Indicates whether the customer has churned (Yes/No).

  2. Z

    A dataset to investigate ChatGPT for enhancing Students' Learning Experience...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schicchi, Daniele; Taibi, Davide (2024). A dataset to investigate ChatGPT for enhancing Students' Learning Experience via Concept Maps [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12076680
    Explore at:
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    Institute for Educational Technology, National Research Council of Italy
    Authors
    Schicchi, Daniele; Taibi, Davide
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was compiled to examine the use of ChatGPT 3.5 in educational settings, particularly for creating and personalizing concept maps. The data has been organized into three folders: Maps, Texts, and Questionnaires. The Maps folder contains the graphical representation of the concept maps and the PlanUML code for drawing them in Italian and English. The Texts folder contains the source text used as input for the map's creation The Questionnaires folder includes the students' responses to the three administered questionnaires.

  3. h

    DISL

    • huggingface.co
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ASSERT | Research group at KTH Royal Institute of Technology (2024). DISL [Dataset]. https://huggingface.co/datasets/ASSERT-KTH/DISL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2024
    Dataset authored and provided by
    ASSERT | Research group at KTH Royal Institute of Technology
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DISL

    The DISL dataset features a collection of 514506 unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts.

      Content
    

    the raw subset has full contracts source code and it's not deduplicated, it has 3,298,271 smart contracts the… See the full description on the dataset page: https://huggingface.co/datasets/ASSERT-KTH/DISL.

  4. Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...

    • springernature.figshare.com
    application/csv
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal (2024). A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.26133973.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

    We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.

    We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].

    • Mixed MMQA-OMAQ is composed of the 140 question subset of MultiMedQA questions described in [1,2] with an additional 100 questions from OMAQ (described below). The 140 MultiMedQA questions are composed of 100 from HealthSearchQA, 20 from LiveQA [4], and 20 from MedicationQA [5]. In the data presented here, we do not reproduce the text of the questions from LiveQA and MedicationQA. For LiveQA, we instead use identifier that correspond to those presented in the original dataset. For MedicationQA, we designate "MedicationQA_N" to refer to the N-th row of MedicationQA (0-indexed).

    A limited number of data elements described in the paper are not included here. The following elements are excluded:

    1. The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.

    2. The free-text comments written by raters during the ratings process.

    3. Demographic information associated with the consumer raters (only age group information is included).

    References

    1. Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

    2. Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2

    3. Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z

    4. Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.

    5. Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.

    Description of files and sheets

    1. Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.

    2. Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.

    3. Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.

    4. Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].

    5. Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.

    6. Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.

    7. Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.

    8. Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.

    9. TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.

    10. Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.

    11. Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.

    12. HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].

    13. Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.

    14. Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].

    Version history

    Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)

    WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.

    NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

  5. a

    CIFAR-100

    • datasets.activeloop.ai
    • universe.roboflow.com
    • +5more
    deeplake
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Krizhevsky (2022). CIFAR-100 [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/cifar-100-dataset/
    Explore at:
    deeplakeAvailable download formats
    Dataset updated
    Feb 3, 2022
    Authors
    Alex Krizhevsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 8, 2009
    Dataset funded by
    University of Toronto
    Description

    The CIFAR-100 dataset is a large dataset of labeled images. It is a popular dataset for machine learning and artificial intelligence research. The dataset consists of 100,000 32x32 images. These images are split into 100 mutually exclusive classes, with 1,000 images per class. The classes are animals, vehicles, and other objects.

  6. d

    (HS 3) Large Spatial Sample Datasets in Maryland

    • search.dataone.org
    Updated Dec 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Don Choi (2023). (HS 3) Large Spatial Sample Datasets in Maryland [Dataset]. https://search.dataone.org/view/sha256%3A0c1fb040332273a6f82c7474485554a47dd112d2f39225f9ff00889d5db26581
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Young-Don Choi
    Area covered
    Description

    This HydroShare resource was created to share large spatial sample datasets in Maryland on GeoServer (https://geoserver.hydroshare.org/geoserver/web/wicket/bookmarkable/org.geoserver.web.demo.MapPreviewPage) and THREDDS (https://thredds.hydroshare.org/thredds/catalog/hydroshare/resources/catalog.html).

    Users can check the uploaded LSS datasets on HydroShare-GeoServer and THREDDS using this HS resource id.

    Then, through the RHESSys workflows, users can subset LSS datasets using OWSLib and xarray.

  7. h

    Big-Math-RL-UNVERIFIED

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SynthLabs (2025). Big-Math-RL-UNVERIFIED [Dataset]. https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset authored and provided by
    SynthLabs
    Description

    Big-Math: UNVERIFIED

    [!WARNING] WARNING: This dataset contains ONLY questions whose answers have not been verified to be correct. Use this dataset at your own caution.

      Dataset Creation
    

    Big-Math-Unverified is created as an offshoot of the Big-Math dataset (HuggingFace Dataset Link). Big-Math-Unverified goes through the same filters as the rest of Big-Math (eg. remove non-English, remove multiple choice, etc.), except that these problems were not solved in any of the… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED.

  8. E

    DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

    • live.european-language-grid.eu
    binary format
    Updated Jun 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22959
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 15, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.

  9. roberta-large

    • kaggle.com
    zip
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergey Bochenkov (2022). roberta-large [Dataset]. https://www.kaggle.com/datasets/bachan/roberta-large
    Explore at:
    zip(858807520 bytes)Available download formats
    Dataset updated
    Nov 17, 2022
    Authors
    Sergey Bochenkov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Sergey Bochenkov

    Released under CC0: Public Domain

    Contents

  10. Data from: FloodCastBench: A Large-Scale Dataset and Foundation Models for...

    • zenodo.org
    zip
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingsong Xu; Qingsong Xu; Yilei Shi; Jie Zhao; Jie Zhao; Xiao Xiang Zhu; Xiao Xiang Zhu; Yilei Shi (2024). FloodCastBench: A Large-Scale Dataset and Foundation Models for Flood Modeling and Forecasting [Dataset]. http://doi.org/10.5281/zenodo.14017092
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qingsong Xu; Qingsong Xu; Yilei Shi; Jie Zhao; Jie Zhao; Xiao Xiang Zhu; Xiao Xiang Zhu; Yilei Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Effective flood forecasting is crucial for informed decision-making and emergency response. Existing flood datasets mainly describe flood events but lack dynamic process data suitable for machine learning (ML). This work introduces the FloodCastBench dataset, designed for ML-based flood modeling and forecasting, featuring four major flood events: Pakistan 2022, UK 2015, Australia 2022, and Mozambique 2019. FloodCastBench provides comprehensive low-fidelity and high-fidelity flood forecasting datasets specifically for ML.
    This dataset comprises three folders: the low-fidelity flood forecasting folder, the high-fidelity flood forecasting folder, and the relevant data folder. The low-fidelity flood forecasting folder includes data on the 2022 Pakistan flood and the 2019 Mozambique flood, both with a spatial resolution of 480 m. The high-fidelity flood forecasting folder contains two subfolders: one for the 2022 Australia flood and the 2015 UK flood with a spatial resolution of 30 m, and another for the same floods with a spatial resolution of 60 m. All data files are stored in TIFF format, with a temporal resolution of 300 seconds, and file names are numbered sequentially, incremented every 300 seconds until the simulation endpoint. The relevant data folder includes five subfiles: DEM, land use and land cover, rainfall data, georeferenced files, and initial condition files. The DEM, land use and land cover, rainfall, and initial condition data are all provided in TIFF format. The rainfall data is organized in a format of year-month-day-hour-minute-second. Georeferenced files provide geographic extent and spatial reference to support viewing and analysis of the associated TIFF files in GIS.
    FloodCastBench details the process of flood dynamics data acquisition, starting with input data preparation (e.g., topography, land use, rainfall) and flood measurement data collection (e.g., SAR-based maps, surveyed outlines) for hydrodynamic modeling. We deploy a widely recognized finite difference numerical solution to construct high-resolution spatiotemporal dynamic processes with 30-m spatial and 300-second temporal resolutions. Flood measurement data are used to calibrate the hydrodynamic model parameters and validate the flood inundation maps. Furthermore, we establish a benchmark of foundational models for neural flood forecasting using FloodCastBench, validating its effectiveness in supporting ML models for spatiotemporal, cross-regional, and downscaled flood forecasting.
  11. h

    ai_music_large

    • huggingface.co
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesse (2024). ai_music_large [Dataset]. https://huggingface.co/datasets/SleepyJesse/ai_music_large
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2024
    Authors
    Jesse
    Description

    AI/Human Music (Large variant)

    A dataset that comprises of both AI-generated music and human-composed music. This is the "large" variant of the dataset, which is around 70GiB in size. It contains 10,000 audio files from human and 10,000 audio files from AI. The distribution is: $256$ are from SunoCaps, $4,872$ are from Udio, and $4,872$ are from MusicSet. Data sources for this dataset:

    https://huggingface.co/datasets/blanchon/udio_dataset… See the full description on the dataset page: https://huggingface.co/datasets/SleepyJesse/ai_music_large.

  12. RBD24 - Risk Activities Dataset 2024

    • zenodo.org
    bin
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

    This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

    Summary of the Datasets

    The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

    DatasetIdEntity Observed BehaviourGroundtruthSample Shape
    Crypto_desktop.parquetDEMiner CheckingIDS0: 738/161202, 1: 11/1343
    Crypto_smarphone.parquetSMMiner CheckingIDS0: 613/180021, 1: 4/956
    OutFlash_desktop.parquetDEOutdated software components IDS0: 738/161202, 1: 56/10820
    OutFlash_smartphone.parquetSMOutdated software components IDS0: 613/180021, 1: 22/6639
    OutTLS_desktop.parquetDEOutdated TLS protocolIDS0: 738/161202, 1: 18/2458
    OutTLS_smartphone.parquetSMOutdated TLS protocolIDS0: 613/180021, 1: 11/2930
    P2P_desktop.parquetDEP2P ActivityIDS0: 738/161202, 1: 177/35892
    P2P_smartphone.parquetSMP2P ActivityIDS0: 613/180021, 1: 94/21688
    NonEnc_desktop.parquetDENon-encrypted passwordIDS0: 738/161202, 1: 291/59943
    NonEnc_smaprthone.parquetSMNon-encrypted passwordIDS0: 613/180021, 1: 167/41434
    Phishing_desktop.parquetDEPhishing email

    Experimental Campaign

    0: 98/13864, 1: 19/3072
    Phishing_smartphone.parquetSMPhishing emailExperimental Campaign0: 117/34006, 1: 26/8968

    Methodology

    To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
    more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
    ground truth are as follows:

    - Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
    - IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

    For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
    user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
    and unsupervised methods.

    Sample Representation

    The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
    timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
    construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
    interpretable features designed to describe device-level properties within the specified time frame. The most
    influential features are described below.

    • User:** A unique hash value that identifies a user.
    • Timestamp:** The timestamp of the windows.
    • Features
    • Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

    Dataset Format

    Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

    ```python
    import pandas as pd

    # Reading a Parquet file
    df = pd.read_parquet(
    'path_to_your_file.parquet',
    engine='fastparquet'
    )

    ```

  13. m

    Student Skill Gap Analysis

    • data.mendeley.com
    • kaggle.com
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bindu Garg (2025). Student Skill Gap Analysis [Dataset]. http://doi.org/10.17632/rv6scbpd7v.1
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Bindu Garg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is designed for skill gap analysis, focusing on evaluating the skill gap between students’ current skills and industry requirements. It provides insights into technical skills, soft skills, career interests, and challenges, helping in skill gap analysis to identify areas for improvement.

    By leveraging this dataset, educators, recruiters, and researchers can conduct skill gap analysis to assess students’ job readiness and tailor training programs accordingly. It serves as a valuable resource for identifying skill deficiencies and skill gaps improving career guidance, and enhancing curriculum design through targeted skill gap analysis.

    Following is the column descriptors: Name - Student's full name. email_id - Student's email address. Year - The academic year the student is currently in (e.g., 1st Year, 2nd Year, etc.). Current Course - The course the student is currently pursuing (e.g., B.Tech CSE, MBA, etc.). Technical Skills - List of technical skills possessed by the student (e.g., Python, Data Analysis, Cloud Computing). Programming Languages - Programming languages known by the student (e.g., Python, Java, C++). Rating - Self-assessed rating of technical skills on a scale of 1 to 5. Soft Skills - List of soft skills (e.g., Communication, Leadership, Teamwork). Rating - Self-assessed rating of soft skills on a scale of 1 to 5. Projects - Indicates whether the student has worked on any projects (Yes/No). Career Interest - The student's preferred career path (e.g., Data Scientist, Software Engineer). Challenges - Challenges faced while applying for jobs/internships (e.g., Lack of experience, Resume building issues).

  14. large-data

    • kaggle.com
    zip
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AYUSH SINGH331 (2024). large-data [Dataset]. https://www.kaggle.com/datasets/ayushsingh331/large-data/versions/1
    Explore at:
    zip(1203746376 bytes)Available download formats
    Dataset updated
    Aug 13, 2024
    Authors
    AYUSH SINGH331
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by AYUSH SINGH331

    Released under MIT

    Contents

  15. N

    Big Flat, AR Age Group Population Dataset: A Complete Breakdown of Big Flat...

    • neilsberg.com
    csv, json
    Updated Jul 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Big Flat, AR Age Group Population Dataset: A Complete Breakdown of Big Flat Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2024 Edition [Dataset]. https://www.neilsberg.com/research/datasets/aa799f4c-4983-11ef-ae5d-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Big Flat, Arkansas
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Big Flat population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Big Flat. The dataset can be utilized to understand the population distribution of Big Flat by age. For example, using this dataset, we can identify the largest age group in Big Flat.

    Key observations

    The largest age group in Big Flat, AR was for the group of age 15 to 19 years years with a population of 16 (25%), according to the ACS 2018-2022 5-Year Estimates. At the same time, the smallest age group in Big Flat, AR was the 5 to 9 years years with a population of 0 (0%). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Big Flat is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Big Flat total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Big Flat Population by Age. You can refer the same here

  16. N

    Big Flat, AR Population Dataset: Yearly Figures, Population Change, and...

    • neilsberg.com
    csv, json
    Updated Sep 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). Big Flat, AR Population Dataset: Yearly Figures, Population Change, and Percent Change Analysis [Dataset]. https://www.neilsberg.com/research/datasets/6d48c774-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Sep 18, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Big Flat, Arkansas
    Variables measured
    Annual Population Growth Rate, Population Between 2000 and 2022, Annual Population Growth Rate Percent
    Measurement technique
    The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2022. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2022. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Big Flat population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Big Flat across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

    Key observations

    In 2022, the population of Big Flat was 89, a 0.00% decrease year-by-year from 2021. Previously, in 2021, Big Flat population was 89, an increase of 1.14% compared to a population of 88 in 2020. Over the last 20 plus years, between 2000 and 2022, population of Big Flat decreased by 15. In this period, the peak population was 111 in the year 2007. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

    Content

    When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

    Data Coverage:

    • From 2000 to 2022

    Variables / Data Columns

    • Year: This column displays the data year (Measured annually and for years 2000 to 2022)
    • Population: The population for the specific year for the Big Flat is shown in this column.
    • Year on Year Change: This column displays the change in Big Flat population for each year compared to the previous year.
    • Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Big Flat Population by Year. You can refer the same here

  17. R

    3 Big Data Dataset

    • universe.roboflow.com
    zip
    Updated Oct 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BIG DATA (2025). 3 Big Data Dataset [Dataset]. https://universe.roboflow.com/big-data-db8ne/3-big-data-myxfg/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 3, 2025
    Dataset authored and provided by
    BIG DATA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cats Bounding Boxes
    Description

    3 BIG DATA

    ## Overview
    
    3 BIG DATA is a dataset for object detection tasks - it contains Cats annotations for 943 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. N

    Big Falls Town, Wisconsin Census Bureau Gender Demographics and Population...

    • neilsberg.com
    Updated Feb 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Big Falls Town, Wisconsin Census Bureau Gender Demographics and Population Distribution Across Age Datasets [Dataset]. https://www.neilsberg.com/research/datasets/e173057a-52cf-11ee-804b-3860777c1fe6/
    Explore at:
    Dataset updated
    Feb 19, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Big Falls, Wisconsin
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Big Falls town population by gender and age. The dataset can be utilized to understand the gender distribution and demographics of Big Falls town.

    Content

    The dataset constitues the following two datasets across these two themes

    • Big Falls Town, Wisconsin Population Breakdown by Gender
    • Big Falls Town, Wisconsin Population Breakdown by Gender and Age

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

  19. R

    Large Labelled Datset Dataset

    • universe.roboflow.com
    zip
    Updated Mar 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unmasking Trash Empowering Automated Object Recognition with internet Intelligence Thesis (2024). Large Labelled Datset Dataset [Dataset]. https://universe.roboflow.com/unmasking-trash-empowering-automated-object-recognition-with-internet-intelligence-thesis/large-labelled-datset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 21, 2024
    Dataset authored and provided by
    Unmasking Trash Empowering Automated Object Recognition with internet Intelligence Thesis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Waste Litter Garbage Trash Bounding Boxes
    Description

    Large Labelled Datset

    ## Overview
    
    Large Labelled Datset is a dataset for object detection tasks - it contains Waste Litter Garbage Trash annotations for 12,919 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  20. N

    Big Spring, TX Age Group Population Dataset: A complete breakdown of Big...

    • neilsberg.com
    csv, json
    Updated Sep 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). Big Spring, TX Age Group Population Dataset: A complete breakdown of Big Spring age demographics from 0 to 85 years, distributed across 18 age groups [Dataset]. https://www.neilsberg.com/research/datasets/6fe2e819-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Sep 16, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Texas, Big Spring
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Big Spring population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Big Spring. The dataset can be utilized to understand the population distribution of Big Spring by age. For example, using this dataset, we can identify the largest age group in Big Spring.

    Key observations

    The largest age group in Big Spring, TX was for the group of age 35-39 years with a population of 2,222 (8.48%), according to the 2021 American Community Survey. At the same time, the smallest age group in Big Spring, TX was the 80-84 years with a population of 286 (1.09%). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Big Spring is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Big Spring total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Big Spring Population by Age. You can refer the same here

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hajra Amir (2024). Large Customer Churn Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/hajraamir21/large-customer-churn-analysis-dataset
Organization logo

Large Customer Churn Analysis Dataset

Predict Customer Churn with Synthetic Data

Explore at:
zip(17387 bytes)Available download formats
Dataset updated
Dec 18, 2024
Authors
Hajra Amir
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset contains synthetic data generated for customer churn analysis. It includes 1000 entries representing customer information, such as demographics, account details, subscription types, and churn status. The data is ideal for predictive modeling, machine learning algorithms, and exploratory data analysis (EDA). Features: CustomerID: A unique identifier for each customer. Gender: Male or Female. Age: Customer's age in years. Geography: Country or region of the customer (e.g., Germany, France, UK). Tenure: Number of months the customer has been with the company. Contract: Type of subscription (Month-to-month, One-year, Two-year). MonthlyCharges: The amount billed monthly. TotalCharges: The total amount billed to date. PaymentMethod: Method used for payments (e.g., Credit card, Direct debit). IsActiveMember: Whether the customer is an active member (1 = Active, 0 = Inactive). Churn: Indicates whether the customer has churned (Yes/No).

Search
Clear search
Close search
Google apps
Main menu