roborovski/synthetic-tool-calls-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a simulated dataset based on both real phone interactions and conversations usually performed by real call centres. It is comprised of several simulated customer interactions with an agent representative, both roles performed by actors. Phone call recordings are performed using different mobile and landline devices. The scripting, both from customer and agent, aims to develop typical scenarios by telco-oriented call centre operations. Both raw waveform recordings and speech transcription are provided. The latter as obtained by an automatic speech recognition (ASR) prototype developed by TID. The word segmentation timestamps are also provided for those recognized. Additionally, a confidence score is also provided per token basis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a simulated dataset based on both real phone interactions and conversations usually performed by real call centres. It is comprised of several simulated customer interactions with an agent representative, both roles performed by actors. Phone call recordings are performed using different mobile and landline devices. The scripting, both from customer and agent, aims to develop typical scenarios by telco-oriented call centre operations. Both raw waveform recordings and speech transcription are provided. The latter as obtained by an automatic speech recognition (ASR) prototype developed by TID. The word segmentation timestamps are also provided for those recognized. Additionally, a confidence score is also provided per token basis.
roborovski/synthetic-tool-calls-v2-dpo-pairs dataset hosted on Hugging Face and contributed by the HF Datasets community
A series of synthesized sounds were played back from multiple locations around and within a grid of four DIFAR sonobuoys. Four type 53F DIFAR sonobuoys with attached SPOT GPS devices were deployed in a square with ~1nmi (2km) between each sonobuoy. Sound sources include periodic deployment of weighted light bulbs (producing an impulsive sound when the bulb implodes at depth) and synthetic tonal sounds broadcast through an underwater speaker. A time-synchronized multi-channel recording was made of the sounds received on the four sonobuoys. These experimental data have been used for multiple studies including testing methods to estimate sonobuoy drift, testing novel methods for detection and localization of sonobuoy signals, and testing methods for Acoustic Spatial Capture-Recapture (ASCR) methods for estimating call density. We encourage use of these methods for additional research and development.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Rule-based Synthetic Data for Japanese GEC. Dataset Contents:This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows:Synthetic Corpus - synthesized_data.tsv. This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction.These paired sentences are derived from data scraped from the keyword-lookup site
Title: System Call Traces from Real and Synthetic Sources
Description: This dataset comprises a collection of system call procedure traces collected across various devices and environments. It includes both real-world system call sequences (captured from actual android operating systems) and synthetically generated sequences designed to simulate realistic system behavior.
The data is structured to support a range of use cases, including:
Intrusion detection systems Anomaly detection Behavioral profiling of applications
The dataset is ideal for training and evaluating machine learning models that require low-level OS interaction data. By including both real and synthetic traces, it allows for balanced experimentation in controlled and uncontrolled conditions.
Features:
Real system call traces from multiple devices Synthetic traces designed to mimic real patterns Labelled for supervised learning tasks (if applicable) Suitable for time-series, classification, or sequence modeling
Intended Use: This dataset can be used in academic research, cybersecurity benchmarking, and development of intelligent systems call analysis tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version 1.0, September 2019.
Elizabeth Mendoza (1), Vincent Lostanlen (2, 3, 4), Justin Salamon (3, 4), Andrew Farnsworth (2), Steve Kelling (2), and Juan Pablo Bello (3, 4).
(1): Forest Hills High School, New York, NY, USA (2): Cornell Lab of Ornithology, Cornell University, Ithaca, NY, USA (3): Center for Urban Science and Progress, New York University, New York, NY, USA (4): Music and Audio Research Lab, New York University, New York, NY, USA
The BirdVox-scaper-10k dataset contains 9983 artificial soundscapes. Each soundscape lasts exactly ten seconds and contains one or several avian flight calls from up to 30 different species of New World warblers (Parulidae). Alongside each audio file, we include an annotation file describing the start time and end time of each flight call in the corresponding soundscape, as well as the species of warbler it belongs to.
In order to synthesize soundscapes in BirdVox-scaper-10k, we mixed natural sounds from various pre-recorded sources. First, we extracted isolated recordings of flight calls containing little or no background noise from the CLO-43SD dataset [1]. Secondly, we extracted 10-second "empty" acoustic scenes from the BirdVox-DCASE-20k dataset [2]. These acoustic scenes contain various sources of real-world background noise, including biophony (insects) and anthropophony (vehicles), yet are guaranteed to be devoid of any flight calls. Lastly, we "fill" each acoustic scene by mixing it with flight calls sampled at random.
Although the BirdVox-scaper-10k does not consist of natural recordings, we have taken several measures to ensure the plausibility of each synthesized soundscape, both from qualitative and quantitative standpoints.
The BirdVox-scaper-10k dataset can be used, among other things, for the research, development, and testing of bioacoustic classification models.
For details on the hardware of ROBIN recording units, we refer the reader to [2].
[1] J. Salamon, J. Bello. Fusing shallow and deep learning for bioacoustic bird species classification. Proc. IEEE ICASSP, 2017.
[2] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J. Bello. BirdVox-full-night: a dataset and benchmark for avian flight call detection. Proc. IEEE ICASSP, 2018.
[3] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring. PLoS One, 2016.
@inproceedings{lostanlen2018icassp, title = {BirdVox-full-night: a dataset and benchmark for avian flight call detection}, author = {Lostanlen, Vincent and Salamon, Justin and Farnsworth, Andrew and Kelling, Steve and Bello, Juan Pablo}, booktitle = {Proc. IEEE ICASSP}, year = {2018}, published = {IEEE}, venue = {Calgary, Canada}, month = {April}, }
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This study sets out to establish the suitability of saliva-based whole-genome sequencing (WGS) through a comparison against blood-based WGS. To fully appraise the observed differences, we developed a novel technique of pseudo-replication. We also investigated the potential of characterizing individual salivary microbiomes from non-human DNA fragments found in saliva. We observed that the majority of discordant genotype calls between blood and saliva fell into known regions of the human genome that are typically sequenced with low confidence; and could be identified by quality control measures. Pseudo-replication demonstrated that the levels of discordance between blood- and saliva-derived WGS data were entirely similar to what one would expect between technical replicates if an individual's blood or saliva had been sequenced twice. Finally, we successfully sequenced salivary microbiomes in parallel to human genomes as demonstrated by a comparison against the Human Microbiome Project. A synthetic data set has been generated that allows the replication of our principal results but without a full disclosure of individual level sequencing data. Read counts and relative abundances for the microbiome profiling analyses are similarly available.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Multi-Turn Scam and Non-Scam Phone Conversation Dataset with Agentic Personalities
Dataset Description
The Synthetic Multi-Turn Scam and Non-Scam Phone Dialogue Dataset with Agentic Personalities is an enhanced collection of simulated phone conversations between two AI agents, one acting as a scammer or non-scammer and the other as an innocent receiver. Each dialogue is labeled as either a scam or non-scam interaction. This dataset is designed to help develop… See the full description on the dataset page: https://huggingface.co/datasets/BothBosu/multi-agent-scam-conversation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Automatic Identification System (AIS) allows vessels to share identification, characteristics, and location data through self-reporting. This information is periodically broadcast and can be received by other vessels with AIS transceivers, as well as ground or satellite sensors. Since the International Maritime Organisation (IMO) mandated AIS for vessels above 300 gross tonnage, extensive datasets have emerged, becoming a valuable resource for maritime intelligence.
Maritime collisions occur when two vessels collide or when a vessel collides with a floating or stationary object, such as an iceberg. Maritime collisions hold significant importance in the realm of marine accidents for several reasons:
Injuries and fatalities of vessel crew members and passengers.
Environmental effects, especially in cases involving large tanker ships and oil spills.
Direct and indirect economic losses on local communities near the accident area.
Adverse financial consequences for ship owners, insurance companies and cargo owners including vessel loss and penalties.
As sea routes become more congested and vessel speeds increase, the likelihood of significant accidents during a ship's operational life rises. The increasing congestion on sea lanes elevates the probability of accidents and especially collisions between vessels.
The development of solutions and models for the analysis, early detection and mitigation of vessel collision events is a significant step towards ensuring future maritime safety. In this context, a synthetic vessel proximity event dataset is created using real vessel AIS messages. The synthetic dataset of trajectories with reconstructed timestamps is generated so that a pair of trajectories reach simultaneously their intersection point, simulating an unintended proximity event (collision close call). The dataset aims to provide a basis for the development of methods for the detection and mitigation of maritime collisions and proximity events, as well as the study and training of vessel crews in simulator environments.
The dataset consists of 4658 samples/AIS messages of 213 unique vessels from the Aegean Sea. The steps that were followed to create the collision dataset are:
Given 2 vessels X (vessel_id1) and Y (vessel_id2) with their current known location (LATITUDE [lat], LONGITUDE [lon]):
Check if the trajectories of vessels X and Y are spatially intersecting.
If the trajectories of vessels X and Y are intersecting, then align temporally the timestamp of vessel Y at the intersect point according to X’s timestamp at the intersect point. The temporal alignment is performed so the spatial intersection (nearest proximity point) occurs at the same time for both vessels.
Also for each vessel pair the timestamp of the proximity event is different from a proximity event that occurs later so that different vessel trajectory pairs do not overlap temporarily.
Two csv files are provided. vessel_positions.csv includes the AIS positions vessel_id, t, lon, lat, heading, course, speed of all vessels. Simulated_vessel_proximity_events.csv includes the id, position and timestamp of each identified proximity event along with the vessel_id number of the associated vessels. The final sum of unintended proximity events in the dataset is 237. Examples of unintended vessel proximity events are visualized in the respective png and gif files.
The research leading to these results has received funding from the European Union's Horizon Europe Programme under the CREXDATA Project, grant agreement n° 101092749.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for 🥤SODA Synthetic Dialogue
Dataset Summary
🥤SODA Synthetic Dialogue is a set of synthetic dialogues between Assistant and User. In each conversation, User asks Assistant to perform summarization or story generation tasks based on a snippet of an existing dialogue, story, or from a title or theme. This data was created by synthesizing the dialogues in 🥤Soda and applying a set of templates to generate the conversation. The original research paper can be… See the full description on the dataset page: https://huggingface.co/datasets/emozilla/soda_synthetic_dialogue.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We create a Thai multi-turn conversation dataset from airesearch/WangchanThaiInstruct (Batch 1) by LLM. It was created from synthetic method using open source LLM in Thai language.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a synthetic dataset to teach students about using clinical and genetic covariates to predict cardiovascular risk in a realistic (but synthetic) dataset.For the workshop materials, please go here: https://github.com/laderast/cvdNight1Contents:1) dataDictionary.pdf - pdf file describing all covariates in the synthetic dataset.2) fullPatientData.csv - csv file with multiple covariates3) genoData.csv - subset of patients in fullPatientData.csv with additional SNP calls.
WiserBrand offers a unique dataset of real consumer-to-business phone conversations. These high-quality audio recordings capture authentic interactions between consumers and support agents across industries. Unlike synthetic data or scripted samples, our dataset reflects natural speech patterns, emotion, intent, and real-world phrasing — making it ideal for:
Training ASR (Automatic Speech Recognition) systems
Improving voice assistants and LLM audio understanding
Enhancing call center AI tools (e.g., sentiment analysis, intent detection)
Benchmarking conversational AI performance with real-world noise and context
We ensure strict data privacy: all personally identifiable information (PII) is removed before delivery. Recordings are produced on demand and can be tailored by vertical (e.g., telecom, finance, e-commerce) or use case.
Whether you're building next-gen voice technology or need realistic conversational datasets to test models, this dataset provides what synthetic corpora lack — realism, variation, and authenticity.
This is a synthetic data stream based on real-time, cell network events. These events are picked up by the antennas that are closer to the mobile phone thus providing an approximate location of the device. Every transaction of a mobile phone generates one of those events. A transaction can be, for instance, placing or receiving a call, sending or receiving an SMS, asking for a specific URL in your mobile phone browser, or sending a text message or a data transaction from/to any mobile phone app. There are also some synchronization events like, for instance, turning your mobile phone on or off, or when switching between location area networks (relatively big geographical areas comprising several cell towers).
In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multiturn Multimodal
We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal
multi-images
synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main
Example data
{'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.
Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The synthetic NCD dataset, based on Athens pilot cases, is a simulated yet realistic resource that can support the testing of AI and analytics algorithms for crime resolution.
The first version of the dataset (SAEX1), was generated to simulate terminal movements within a 2x2 km area over a four-hour window on July 28, 2023, from 08:00 to 12:00. It contains over 6.9 million signaling events across 32,485 terminals, with each terminal averaging 213 events. Each event logs a terminal’s location, timestamp and its proximity to a serving cell along with the cell coordinates. To align with the investigation of a crime scene, the dataset was filtered to include only those individuals whose movement patterns intersected a predefined bounding box around Kerameikos, Athens, Greece, during the specified timeframe. This filtering process resulted in a refined dataset of trajectory data for 30,703 individuals. Figure 1 illustrates a visual representation of the final generated data for the SAEX1 dataset.
Figure 1: Initial SAEX Dataset.
As part of the TRACY project, the implemented algorithms are being evaluated using real pilot cases, provided by LEA Partners. This evaluation revealed the need for an update to the initial version of the simulated data, as the cases included additional information beyond the cell data. Specifically, they also featured Call Detail Records (CDRs), which encompassed call logs, SMS messages and mobile data. To address this, the initial data was enriched with the following details:
To accomplish this, a custom CDR Simulator was developed. The simulator generates random pairs of terminals that engage in actions (Calls or SMSs), along with the corresponding timeframes during which these actions occur. Additionally, the simulator selects specific terminals to assign browsing activities. Finally, the generated data is enriched with information about the first and last serving cells, ensuring the creation of a realistic synthetic case dataset.
Finally, since the real case data is provided in multiple files per provider, the same structure has been adopted in the simulated dataset. It is important to note that each file contains cell information only for terminals that are customers of the respective provider. For all other terminals, the cell information is left blank. This approach reflects the structure and content of the real cases and enables the use of the same data pipeline process developed within the TRACY project.
SAEX2 employs the same simulation settings as SAEX1. However, it involves a single criminal (ground truth) who follows a predetermined route within a specified timeframe, thereby enabling detection by the TRACY algorithm. Table 2 shows the dataset’s main characteristics.
Metric |
Value |
Simulation Area Width x Height (Km) |
2x2 |
Simulation Start-time |
2023-07-28 08:00:00 |
Simulation End-time |
2023-07-28 12:00:00 |
Simulation Duration (minutes) |
240 |
Number of Terminals |
30692 |
Number of Events |
1469515 |
Number of Produced CSVs |
6 (2 per provider) |
Table 2: SAEX2 Vital Statistics.
In the absence of real data for experimentation, having a synthetic yet realistic NCD dataset is crucial for developing and evaluating crime resolution techniques.
Open-source data has been used during the development of this synthetic dataset, more specifially, open-source data from OpenCellid, as well as open government data from the Hellenic Statistical Authority and the Antenna Construction Information Portal developed by the Hellenic Telecommunications and Post Commission (EETT).
TRACY is funded under DIGITAL-2022-DEPLOY-02-LAW-SECURITY-AI (GA: 101102641)
roborovski/synthetic-tool-calls-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community