100+ datasets found

Large Customer Churn Analysis Dataset
kaggle.com
zip
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hajra Amir (2024). Large Customer Churn Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/hajraamir21/large-customer-churn-analysis-dataset
Explore at:
zip(17387 bytes)Available download formats
Dataset updated
Dec 18, 2024
Authors
Hajra Amir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains synthetic data generated for customer churn analysis. It includes 1000 entries representing customer information, such as demographics, account details, subscription types, and churn status. The data is ideal for predictive modeling, machine learning algorithms, and exploratory data analysis (EDA). Features: CustomerID: A unique identifier for each customer. Gender: Male or Female. Age: Customer's age in years. Geography: Country or region of the customer (e.g., Germany, France, UK). Tenure: Number of months the customer has been with the company. Contract: Type of subscription (Month-to-month, One-year, Two-year). MonthlyCharges: The amount billed monthly. TotalCharges: The total amount billed to date. PaymentMethod: Method used for payments (e.g., Credit card, Direct debit). IsActiveMember: Whether the customer is an active member (1 = Active, 0 = Inactive). Churn: Indicates whether the customer has churned (Yes/No).
Z
A dataset to investigate ChatGPT for enhancing Students' Learning Experience...
data.niaid.nih.gov
zenodo.org
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schicchi, Daniele; Taibi, Davide (2024). A dataset to investigate ChatGPT for enhancing Students' Learning Experience via Concept Maps [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12076680
Explore at:
Dataset updated
Jun 19, 2024
Dataset provided by
Institute for Educational Technology, National Research Council of Italy
Authors
Schicchi, Daniele; Taibi, Davide
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset was compiled to examine the use of ChatGPT 3.5 in educational settings, particularly for creating and personalizing concept maps. The data has been organized into three folders: Maps, Texts, and Questionnaires. The Maps folder contains the graphical representation of the concept maps and the PlanUML code for drawing them in Italian and English. The Texts folder contains the source text used as input for the map's creation The Questionnaires folder includes the students' responses to the three administered questionnaires.
h
DISL
huggingface.co
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ASSERT | Research group at KTH Royal Institute of Technology (2024). DISL [Dataset]. https://huggingface.co/datasets/ASSERT-KTH/DISL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 15, 2024
Dataset authored and provided by
ASSERT | Research group at KTH Royal Institute of Technology
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
DISL

The DISL dataset features a collection of 514506 unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts.

Content

the raw subset has full contracts source code and it's not deduplicated, it has 3,298,271 smart contracts the… See the full description on the dataset page: https://huggingface.co/datasets/ASSERT-KTH/DISL.
Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...
springernature.figshare.com
application/csv
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal (2024). A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.26133973.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26133973.v1
Dataset updated
Sep 24, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.

We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].

Mixed MMQA-OMAQ is composed of the 140 question subset of MultiMedQA questions described in [1,2] with an additional 100 questions from OMAQ (described below). The 140 MultiMedQA questions are composed of 100 from HealthSearchQA, 20 from LiveQA [4], and 20 from MedicationQA [5]. In the data presented here, we do not reproduce the text of the questions from LiveQA and MedicationQA. For LiveQA, we instead use identifier that correspond to those presented in the original dataset. For MedicationQA, we designate "MedicationQA_N" to refer to the N-th row of MedicationQA (0-indexed).

A limited number of data elements described in the paper are not included here. The following elements are excluded:

The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.

The free-text comments written by raters during the ratings process.

Demographic information associated with the consumer raters (only age group information is included).

References

Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2

Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z

Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.

Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.

Description of files and sheets

Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.

Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.

Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.

Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].

Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.

Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.

Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.

Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.

TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.

Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.

Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.

HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].

Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.

Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].

Version history

Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)

WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.

NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
a
CIFAR-100
datasets.activeloop.ai
universe.roboflow.com
+5more
deeplake
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Krizhevsky (2022). CIFAR-100 [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/cifar-100-dataset/
Explore at:
deeplakeAvailable download formats
Dataset updated
Feb 3, 2022
Authors
Alex Krizhevsky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 8, 2009
Dataset funded by
University of Toronto
Description
The CIFAR-100 dataset is a large dataset of labeled images. It is a popular dataset for machine learning and artificial intelligence research. The dataset consists of 100,000 32x32 images. These images are split into 100 mutually exclusive classes, with 1,000 images per class. The classes are animals, vehicles, and other objects.
d
(HS 3) Large Spatial Sample Datasets in Maryland
search.dataone.org
Updated Dec 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Don Choi (2023). (HS 3) Large Spatial Sample Datasets in Maryland [Dataset]. https://search.dataone.org/view/sha256%3A0c1fb040332273a6f82c7474485554a47dd112d2f39225f9ff00889d5db26581
Explore at:
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Young-Don Choi
Area covered

Description
This HydroShare resource was created to share large spatial sample datasets in Maryland on GeoServer (https://geoserver.hydroshare.org/geoserver/web/wicket/bookmarkable/org.geoserver.web.demo.MapPreviewPage) and THREDDS (https://thredds.hydroshare.org/thredds/catalog/hydroshare/resources/catalog.html).

Users can check the uploaded LSS datasets on HydroShare-GeoServer and THREDDS using this HS resource id.

Then, through the RHESSys workflows, users can subset LSS datasets using OWSLib and xarray.
h
Big-Math-RL-UNVERIFIED
huggingface.co
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SynthLabs (2025). Big-Math-RL-UNVERIFIED [Dataset]. https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED
Explore at:
Dataset updated
Apr 16, 2025
Dataset authored and provided by
SynthLabs
Description
Big-Math: UNVERIFIED

[!WARNING] WARNING: This dataset contains ONLY questions whose answers have not been verified to be correct. Use this dataset at your own caution.

Dataset Creation

Big-Math-Unverified is created as an offshoot of the Big-Math dataset (HuggingFace Dataset Link). Big-Math-Unverified goes through the same filters as the rest of Big-Math (eg. remove non-English, remove multiple choice, etc.), except that these problems were not solved in any of the… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-UNVERIFIED.
E
DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking
live.european-language-grid.eu
binary format
Updated Jun 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22959
Explore at:
binary formatAvailable download formats
Dataset updated
Jun 15, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
roberta-large
kaggle.com
zip
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergey Bochenkov (2022). roberta-large [Dataset]. https://www.kaggle.com/datasets/bachan/roberta-large
Explore at:
zip(858807520 bytes)Available download formats
Dataset updated
Nov 17, 2022
Authors
Sergey Bochenkov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Sergey Bochenkov

Released under CC0: Public Domain

Contents
Data from: FloodCastBench: A Large-Scale Dataset and Foundation Models for...
zenodo.org
zip
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingsong Xu; Qingsong Xu; Yilei Shi; Jie Zhao; Jie Zhao; Xiao Xiang Zhu; Xiao Xiang Zhu; Yilei Shi (2024). FloodCastBench: A Large-Scale Dataset and Foundation Models for Flood Modeling and Forecasting [Dataset]. http://doi.org/10.5281/zenodo.14017092
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14017092
Dataset updated
Nov 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Qingsong Xu; Qingsong Xu; Yilei Shi; Jie Zhao; Jie Zhao; Xiao Xiang Zhu; Xiao Xiang Zhu; Yilei Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Effective flood forecasting is crucial for informed decision-making and emergency response. Existing flood datasets mainly describe flood events but lack dynamic process data suitable for machine learning (ML). This work introduces the FloodCastBench dataset, designed for ML-based flood modeling and forecasting, featuring four major flood events: Pakistan 2022, UK 2015, Australia 2022, and Mozambique 2019. FloodCastBench provides comprehensive low-fidelity and high-fidelity flood forecasting datasets specifically for ML.

This dataset comprises three folders: the low-fidelity flood forecasting folder, the high-fidelity flood forecasting folder, and the relevant data folder. The low-fidelity flood forecasting folder includes data on the 2022 Pakistan flood and the 2019 Mozambique flood, both with a spatial resolution of 480 m. The high-fidelity flood forecasting folder contains two subfolders: one for the 2022 Australia flood and the 2015 UK flood with a spatial resolution of 30 m, and another for the same floods with a spatial resolution of 60 m. All data files are stored in TIFF format, with a temporal resolution of 300 seconds, and file names are numbered sequentially, incremented every 300 seconds until the simulation endpoint. The relevant data folder includes five subfiles: DEM, land use and land cover, rainfall data, georeferenced files, and initial condition files. The DEM, land use and land cover, rainfall, and initial condition data are all provided in TIFF format. The rainfall data is organized in a format of year-month-day-hour-minute-second. Georeferenced files provide geographic extent and spatial reference to support viewing and analysis of the associated TIFF files in GIS.

FloodCastBench details the process of flood dynamics data acquisition, starting with input data preparation (e.g., topography, land use, rainfall) and flood measurement data collection (e.g., SAR-based maps, surveyed outlines) for hydrodynamic modeling. We deploy a widely recognized finite difference numerical solution to construct high-resolution spatiotemporal dynamic processes with 30-m spatial and 300-second temporal resolutions. Flood measurement data are used to calibrate the hydrodynamic model parameters and validate the flood inundation maps. Furthermore, we establish a benchmark of foundational models for neural flood forecasting using FloodCastBench, validating its effectiveness in supporting ML models for spatiotemporal, cross-regional, and downscaled flood forecasting.
h
ai_music_large
huggingface.co
Updated Dec 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesse (2024). ai_music_large [Dataset]. https://huggingface.co/datasets/SleepyJesse/ai_music_large
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2024
Authors
Jesse
Description
AI/Human Music (Large variant)

A dataset that comprises of both AI-generated music and human-composed music. This is the "large" variant of the dataset, which is around 70GiB in size. It contains 10,000 audio files from human and 10,000 audio files from AI. The distribution is: $256$ are from SunoCaps, $4,872$ are from Udio, and $4,872$ are from MusicSet. Data sources for this dataset:

https://huggingface.co/datasets/blanchon/udio_dataset… See the full description on the dataset page: https://huggingface.co/datasets/SleepyJesse/ai_music_large.

RBD24 - Risk Activities Dataset 2024

zenodo.org

bin

Updated Mar 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13787591

Dataset updated

Mar 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

Summary of the Datasets

The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

DatasetId	Entity	Observed Behaviour	Groundtruth	Sample Shape
Crypto_desktop.parquet	DE	Miner Checking	IDS	0: 738/161202, 1: 11/1343
Crypto_smarphone.parquet	SM	Miner Checking	IDS	0: 613/180021, 1: 4/956
OutFlash_desktop.parquet	DE	Outdated software components	IDS	0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquet	SM	Outdated software components	IDS	0: 613/180021, 1: 22/6639
OutTLS_desktop.parquet	DE	Outdated TLS protocol	IDS	0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquet	SM	Outdated TLS protocol	IDS	0: 613/180021, 1: 11/2930
P2P_desktop.parquet	DE	P2P Activity	IDS	0: 738/161202, 1: 177/35892
P2P_smartphone.parquet	SM	P2P Activity	IDS	0: 613/180021, 1: 94/21688
NonEnc_desktop.parquet	DE	Non-encrypted password	IDS	0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquet	SM	Non-encrypted password	IDS	0: 613/180021, 1: 167/41434
Phishing_desktop.parquet	DE	Phishing email	Experimental Campaign	0: 98/13864, 1: 19/3072
Phishing_smartphone.parquet	SM	Phishing email	Experimental Campaign	0: 117/34006, 1: 26/8968

Methodology

To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:

- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.

Sample Representation

The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.

User:** A unique hash value that identifies a user.
Timestamp:** The timestamp of the windows.
Features
Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

Dataset Format

Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

```python
import pandas as pd

# Reading a Parquet file
df = pd.read_parquet(
'path_to_your_file.parquet',
engine='fastparquet'
)

```

m
Student Skill Gap Analysis
data.mendeley.com
kaggle.com
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bindu Garg (2025). Student Skill Gap Analysis [Dataset]. http://doi.org/10.17632/rv6scbpd7v.1
Explore at:
Unique identifier
https://doi.org/10.17632/rv6scbpd7v.1
Dataset updated
Apr 28, 2025
Authors
Bindu Garg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is designed for skill gap analysis, focusing on evaluating the skill gap between students’ current skills and industry requirements. It provides insights into technical skills, soft skills, career interests, and challenges, helping in skill gap analysis to identify areas for improvement.

By leveraging this dataset, educators, recruiters, and researchers can conduct skill gap analysis to assess students’ job readiness and tailor training programs accordingly. It serves as a valuable resource for identifying skill deficiencies and skill gaps improving career guidance, and enhancing curriculum design through targeted skill gap analysis.

Following is the column descriptors: Name - Student's full name. email_id - Student's email address. Year - The academic year the student is currently in (e.g., 1st Year, 2nd Year, etc.). Current Course - The course the student is currently pursuing (e.g., B.Tech CSE, MBA, etc.). Technical Skills - List of technical skills possessed by the student (e.g., Python, Data Analysis, Cloud Computing). Programming Languages - Programming languages known by the student (e.g., Python, Java, C++). Rating - Self-assessed rating of technical skills on a scale of 1 to 5. Soft Skills - List of soft skills (e.g., Communication, Leadership, Teamwork). Rating - Self-assessed rating of soft skills on a scale of 1 to 5. Projects - Indicates whether the student has worked on any projects (Yes/No). Career Interest - The student's preferred career path (e.g., Data Scientist, Software Engineer). Challenges - Challenges faced while applying for jobs/internships (e.g., Lack of experience, Resume building issues).
large-data
kaggle.com
zip
Updated Aug 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AYUSH SINGH331 (2024). large-data [Dataset]. https://www.kaggle.com/datasets/ayushsingh331/large-data/versions/1
Explore at:
zip(1203746376 bytes)Available download formats
Dataset updated
Aug 13, 2024
Authors
AYUSH SINGH331
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by AYUSH SINGH331

Released under MIT

Contents
N
Big Flat, AR Age Group Population Dataset: A Complete Breakdown of Big Flat...
neilsberg.com
csv, json
Updated Jul 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Big Flat, AR Age Group Population Dataset: A Complete Breakdown of Big Flat Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2024 Edition [Dataset]. https://www.neilsberg.com/research/datasets/aa799f4c-4983-11ef-ae5d-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Jul 24, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Big Flat, Arkansas
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Big Flat population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Big Flat. The dataset can be utilized to understand the population distribution of Big Flat by age. For example, using this dataset, we can identify the largest age group in Big Flat.

Key observations

The largest age group in Big Flat, AR was for the group of age 15 to 19 years years with a population of 16 (25%), according to the ACS 2018-2022 5-Year Estimates. At the same time, the smallest age group in Big Flat, AR was the 5 to 9 years years with a population of 0 (0%). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group in consideration

Population: The population for the specific age group in the Big Flat is shown in this column.

% of Total Population: This column displays the population of each age group as a proportion of Big Flat total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Big Flat Population by Age. You can refer the same here
N
Big Flat, AR Population Dataset: Yearly Figures, Population Change, and...
neilsberg.com
csv, json
Updated Sep 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2023). Big Flat, AR Population Dataset: Yearly Figures, Population Change, and Percent Change Analysis [Dataset]. https://www.neilsberg.com/research/datasets/6d48c774-3d85-11ee-9abe-0aa64bf2eeb2/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Sep 18, 2023
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Big Flat, Arkansas
Variables measured
Annual Population Growth Rate, Population Between 2000 and 2022, Annual Population Growth Rate Percent
Measurement technique
The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2022. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2022. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Big Flat population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Big Flat across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

Key observations

In 2022, the population of Big Flat was 89, a 0.00% decrease year-by-year from 2021. Previously, in 2021, Big Flat population was 89, an increase of 1.14% compared to a population of 88 in 2020. Over the last 20 plus years, between 2000 and 2022, population of Big Flat decreased by 15. In this period, the peak population was 111 in the year 2007. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

Content

When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

Data Coverage:

From 2000 to 2022

Variables / Data Columns

Year: This column displays the data year (Measured annually and for years 2000 to 2022)

Population: The population for the specific year for the Big Flat is shown in this column.

Year on Year Change: This column displays the change in Big Flat population for each year compared to the previous year.

Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Big Flat Population by Year. You can refer the same here
R
3 Big Data Dataset
universe.roboflow.com
zip
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BIG DATA (2025). 3 Big Data Dataset [Dataset]. https://universe.roboflow.com/big-data-db8ne/3-big-data-myxfg/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Oct 3, 2025
Dataset authored and provided by
BIG DATA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cats Bounding Boxes
Description
3 BIG DATA

## Overview 3 BIG DATA is a dataset for object detection tasks - it contains Cats annotations for 943 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
N
Big Falls Town, Wisconsin Census Bureau Gender Demographics and Population...
neilsberg.com
Updated Feb 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Big Falls Town, Wisconsin Census Bureau Gender Demographics and Population Distribution Across Age Datasets [Dataset]. https://www.neilsberg.com/research/datasets/e173057a-52cf-11ee-804b-3860777c1fe6/
Explore at:
Dataset updated
Feb 19, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Big Falls, Wisconsin
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Big Falls town population by gender and age. The dataset can be utilized to understand the gender distribution and demographics of Big Falls town.

Content

The dataset constitues the following two datasets across these two themes

Big Falls Town, Wisconsin Population Breakdown by Gender

Big Falls Town, Wisconsin Population Breakdown by Gender and Age

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
R
Large Labelled Datset Dataset
universe.roboflow.com
zip
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unmasking Trash Empowering Automated Object Recognition with internet Intelligence Thesis (2024). Large Labelled Datset Dataset [Dataset]. https://universe.roboflow.com/unmasking-trash-empowering-automated-object-recognition-with-internet-intelligence-thesis/large-labelled-datset
Explore at:
zipAvailable download formats
Dataset updated
Mar 21, 2024
Dataset authored and provided by
Unmasking Trash Empowering Automated Object Recognition with internet Intelligence Thesis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Waste Litter Garbage Trash Bounding Boxes
Description
Large Labelled Datset

## Overview Large Labelled Datset is a dataset for object detection tasks - it contains Waste Litter Garbage Trash annotations for 12,919 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
N
Big Spring, TX Age Group Population Dataset: A complete breakdown of Big...
neilsberg.com
csv, json
Updated Sep 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2023). Big Spring, TX Age Group Population Dataset: A complete breakdown of Big Spring age demographics from 0 to 85 years, distributed across 18 age groups [Dataset]. https://www.neilsberg.com/research/datasets/6fe2e819-3d85-11ee-9abe-0aa64bf2eeb2/
Explore at:
json, csvAvailable download formats
Dataset updated
Sep 16, 2023
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Texas, Big Spring
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Big Spring population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Big Spring. The dataset can be utilized to understand the population distribution of Big Spring by age. For example, using this dataset, we can identify the largest age group in Big Spring.

Key observations

The largest age group in Big Spring, TX was for the group of age 35-39 years with a population of 2,222 (8.48%), according to the 2021 American Community Survey. At the same time, the smallest age group in Big Spring, TX was the 80-84 years with a population of 286 (1.09%). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group in consideration

Population: The population for the specific age group in the Big Spring is shown in this column.

% of Total Population: This column displays the population of each age group as a proportion of Big Spring total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Big Spring Population by Age. You can refer the same here

Facebook

Twitter

Click to copy link

Link copied

Cite

Hajra Amir (2024). Large Customer Churn Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/hajraamir21/large-customer-churn-analysis-dataset

Large Customer Churn Analysis Dataset

Predict Customer Churn with Synthetic Data

Explore at:

zip(17387 bytes)Available download formats

Dataset updated

Dec 18, 2024

Authors

Hajra Amir

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset contains synthetic data generated for customer churn analysis. It includes 1000 entries representing customer information, such as demographics, account details, subscription types, and churn status. The data is ideal for predictive modeling, machine learning algorithms, and exploratory data analysis (EDA). Features: CustomerID: A unique identifier for each customer. Gender: Male or Female. Age: Customer's age in years. Geography: Country or region of the customer (e.g., Germany, France, UK). Tenure: Number of months the customer has been with the company. Contract: Type of subscription (Month-to-month, One-year, Two-year). MonthlyCharges: The amount billed monthly. TotalCharges: The total amount billed to date. PaymentMethod: Method used for payments (e.g., Credit card, Direct debit). IsActiveMember: Whether the customer is an active member (1 = Active, 0 = Inactive). Churn: Indicates whether the customer has churned (Yes/No).

Clear search

Close search

Google apps

Main menu

Large Customer Churn Analysis Dataset

A dataset to investigate ChatGPT for enhancing Students' Learning Experience...

DISL

Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...

Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

References

Description of files and sheets

Version history

CIFAR-100

(HS 3) Large Spatial Sample Datasets in Maryland

Big-Math-RL-UNVERIFIED

DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

roberta-large

Dataset

Contents

Data from: FloodCastBench: A Large-Scale Dataset and Foundation Models for...

ai_music_large

RBD24 - Risk Activities Dataset 2024

Introduction

Summary of the Datasets

Methodology

Sample Representation

Dataset Format

Student Skill Gap Analysis

large-data

Dataset

Contents

Big Flat, AR Age Group Population Dataset: A Complete Breakdown of Big Flat...

About this dataset

Content

Inspiration

Recommended for further research

Big Flat, AR Population Dataset: Yearly Figures, Population Change, and...

About this dataset

Content

Inspiration

Recommended for further research

3 Big Data Dataset

3 BIG DATA

Big Falls Town, Wisconsin Census Bureau Gender Demographics and Population...

About this dataset

Content

Inspiration

Large Labelled Datset Dataset

Large Labelled Datset

Big Spring, TX Age Group Population Dataset: A complete breakdown of Big...

About this dataset

Content

Inspiration

Recommended for further research

Large Customer Churn Analysis Dataset

Predict Customer Churn with Synthetic Data