90 datasets found

h
synthetic-tool-calls-v2
huggingface.co
Updated Mar 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Fitzgerald (2024). synthetic-tool-calls-v2 [Dataset]. https://huggingface.co/datasets/roborovski/synthetic-tool-calls-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2024
Authors
Brian Fitzgerald
Description
roborovski/synthetic-tool-calls-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
I-BiDaaS - TID - Synthetic Call Centre Data
zenodo.org
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordi Luque Serrano; Ioannis Arapakis; Jordi Luque Serrano; Ioannis Arapakis (2024). I-BiDaaS - TID - Synthetic Call Centre Data [Dataset]. http://doi.org/10.5281/zenodo.4274454
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4274454
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jordi Luque Serrano; Ioannis Arapakis; Jordi Luque Serrano; Ioannis Arapakis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a simulated dataset based on both real phone interactions and conversations usually performed by real call centres. It is comprised of several simulated customer interactions with an agent representative, both roles performed by actors. Phone call recordings are performed using different mobile and landline devices. The scripting, both from customer and agent, aims to develop typical scenarios by telco-oriented call centre operations. Both raw waveform recordings and speech transcription are provided. The latter as obtained by an automatic speech recognition (ASR) prototype developed by TID. The word segmentation timestamps are also provided for those recognized. Additionally, a confidence score is also provided per token basis.
Z
I-BiDaaS - TID - Synthetic Call Centre Data
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordi Luque Serrano (2024). I-BiDaaS - TID - Synthetic Call Centre Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4274453
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Ioannis Arapakis
Jordi Luque Serrano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a simulated dataset based on both real phone interactions and conversations usually performed by real call centres. It is comprised of several simulated customer interactions with an agent representative, both roles performed by actors. Phone call recordings are performed using different mobile and landline devices. The scripting, both from customer and agent, aims to develop typical scenarios by telco-oriented call centre operations. Both raw waveform recordings and speech transcription are provided. The latter as obtained by an automatic speech recognition (ASR) prototype developed by TID. The word segmentation timestamps are also provided for those recognized. Additionally, a confidence score is also provided per token basis.
h
synthetic-tool-calls-v2-dpo-pairs
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Fitzgerald, synthetic-tool-calls-v2-dpo-pairs [Dataset]. https://huggingface.co/datasets/roborovski/synthetic-tool-calls-v2-dpo-pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Brian Fitzgerald
Description
roborovski/synthetic-tool-calls-v2-dpo-pairs dataset hosted on Hugging Face and contributed by the HF Datasets community
Experimental testing of detection, localization, and call density of...
datasets.ai
catalog.ogopendata.com
+2more
0
Updated Aug 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Oceanic and Atmospheric Administration, Department of Commerce (2024). Experimental testing of detection, localization, and call density of synthetic sounds using Navy surplus sonobuoys. [Dataset]. https://datasets.ai/datasets/experimental-testing-of-detection-localization-and-call-density-of-synthetic-sounds-using-navy-
Explore at:
0Available download formats
Dataset updated
Aug 8, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
National Oceanic and Atmospheric Administration, Department of Commerce
Description
A series of synthesized sounds were played back from multiple locations around and within a grid of four DIFAR sonobuoys. Four type 53F DIFAR sonobuoys with attached SPOT GPS devices were deployed in a square with ~1nmi (2km) between each sonobuoy. Sound sources include periodic deployment of weighted light bulbs (producing an impulsive sound when the bulb implodes at depth) and synthetic tonal sounds broadcast through an underwater speaker. A time-synchronized multi-channel recording was made of the sounds received on the four sonobuoys. These experimental data have been used for multiple studies including testing methods to estimate sonobuoy drift, testing novel methods for detection and localization of sonobuoy signals, and testing methods for Acoustic Spatial Capture-Recapture (ASCR) methods for estimating call density. We encourage use of these methods for additional research and development.
E
Rule-based Synthetic Data for Japanese GEC
live.european-language-grid.eu
data.niaid.nih.gov
+1more
tsv
Updated Oct 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Rule-based Synthetic Data for Japanese GEC [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7679
Explore at:
tsvAvailable download formats
Dataset updated
Oct 28, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title: Rule-based Synthetic Data for Japanese GEC. Dataset Contents:This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows:Synthetic Corpus - synthesized_data.tsv. This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction.These paired sentences are derived from data scraped from the keyword-lookup site
Android System call Dataset
kaggle.com
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akarsh nair (2025). Android System call Dataset [Dataset]. https://www.kaggle.com/datasets/akarshnair/android-system-call-dataset
Explore at:
zip(435042128 bytes)Available download formats
Dataset updated
Jun 11, 2025
Authors
Akarsh nair
Description
Title: System Call Traces from Real and Synthetic Sources

Description: This dataset comprises a collection of system call procedure traces collected across various devices and environments. It includes both real-world system call sequences (captured from actual android operating systems) and synthetically generated sequences designed to simulate realistic system behavior.

The data is structured to support a range of use cases, including:

Intrusion detection systems Anomaly detection Behavioral profiling of applications

The dataset is ideal for training and evaluating machine learning models that require low-level OS interaction data. By including both real and synthetic traces, it allows for balanced experimentation in controlled and uncontrolled conditions.

Features:

Real system call traces from multiple devices Synthetic traces designed to mimic real patterns Labelled for supervised learning tasks (if applicable) Suitable for time-series, classification, or sequence modeling

Intended Use: This dataset can be used in academic research, cybersecurity benchmarking, and development of intelligent systems call analysis tools.
Z
BirdVox-scaper-10k: a synthetic dataset for multilabel species...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mendoza, Elizabeth (2020). BirdVox-scaper-10k: a synthetic dataset for multilabel species classification of flight calls from 10-second audio recordings [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2560772
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Lostanlen, Vincent
Farnsworth, Andrew
Salamon, Justin
Kelling, Steve
Bello, Juan Pablo
Mendoza, Elizabeth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BirdVox-scaper-10k: a synthetic dataset for multilabel species classification of flight calls from 10-second audio recordings

Version 1.0, September 2019.

Created By

Elizabeth Mendoza (1), Vincent Lostanlen (2, 3, 4), Justin Salamon (3, 4), Andrew Farnsworth (2), Steve Kelling (2), and Juan Pablo Bello (3, 4).

(1): Forest Hills High School, New York, NY, USA (2): Cornell Lab of Ornithology, Cornell University, Ithaca, NY, USA (3): Center for Urban Science and Progress, New York University, New York, NY, USA (4): Music and Audio Research Lab, New York University, New York, NY, USA

https://wp.nyu.edu/birdvox

Description

The BirdVox-scaper-10k dataset contains 9983 artificial soundscapes. Each soundscape lasts exactly ten seconds and contains one or several avian flight calls from up to 30 different species of New World warblers (Parulidae). Alongside each audio file, we include an annotation file describing the start time and end time of each flight call in the corresponding soundscape, as well as the species of warbler it belongs to.

In order to synthesize soundscapes in BirdVox-scaper-10k, we mixed natural sounds from various pre-recorded sources. First, we extracted isolated recordings of flight calls containing little or no background noise from the CLO-43SD dataset [1]. Secondly, we extracted 10-second "empty" acoustic scenes from the BirdVox-DCASE-20k dataset [2]. These acoustic scenes contain various sources of real-world background noise, including biophony (insects) and anthropophony (vehicles), yet are guaranteed to be devoid of any flight calls. Lastly, we "fill" each acoustic scene by mixing it with flight calls sampled at random.

Although the BirdVox-scaper-10k does not consist of natural recordings, we have taken several measures to ensure the plausibility of each synthesized soundscape, both from qualitative and quantitative standpoints.

The BirdVox-scaper-10k dataset can be used, among other things, for the research, development, and testing of bioacoustic classification models.

For details on the hardware of ROBIN recording units, we refer the reader to [2].

[1] J. Salamon, J. Bello. Fusing shallow and deep learning for bioacoustic bird species classification. Proc. IEEE ICASSP, 2017.

[2] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J. Bello. BirdVox-full-night: a dataset and benchmark for avian flight call detection. Proc. IEEE ICASSP, 2018.

[3] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring. PLoS One, 2016.

@inproceedings{lostanlen2018icassp, title = {BirdVox-full-night: a dataset and benchmark for avian flight call detection}, author = {Lostanlen, Vincent and Salamon, Justin and Farnsworth, Andrew and Kelling, Steve and Bello, Juan Pablo}, booktitle = {Proc. IEEE ICASSP}, year = {2018}, published = {IEEE}, venue = {Calgary, Canada}, month = {April}, }
R
Synthetic data for GAZEL-ADN blood-saliva comparison
entrepot.recherche.data.gouv.fr
tsv, vcf
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Herzig; Anthony Herzig (2023). Synthetic data for GAZEL-ADN blood-saliva comparison [Dataset]. http://doi.org/10.57745/MFIXFW
Explore at:
tsv(6740), vcf(8269859015)Available download formats
Unique identifier
https://doi.org/10.57745/MFIXFW
Dataset updated
Mar 13, 2023
Dataset provided by
Recherche Data Gouv
Authors
Anthony Herzig; Anthony Herzig
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Dataset funded by
French Ministry of Research PFMG2025
Description
This study sets out to establish the suitability of saliva-based whole-genome sequencing (WGS) through a comparison against blood-based WGS. To fully appraise the observed differences, we developed a novel technique of pseudo-replication. We also investigated the potential of characterizing individual salivary microbiomes from non-human DNA fragments found in saliva. We observed that the majority of discordant genotype calls between blood and saliva fell into known regions of the human genome that are typically sequenced with low confidence; and could be identified by quality control measures. Pseudo-replication demonstrated that the levels of discordance between blood- and saliva-derived WGS data were entirely similar to what one would expect between technical replicates if an individual's blood or saliva had been sequenced twice. Finally, we successfully sequenced salivary microbiomes in parallel to human genomes as demonstrated by a comparison against the Human Microbiome Project. A synthetic data set has been generated that allows the replication of our principal results but without a full disclosure of individual level sequencing data. Read counts and relative abundances for the microbiome profiling analyses are similarly available.
h
multi-agent-scam-conversation
huggingface.co
Updated Jul 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pitipat Gumphusiri (2024). multi-agent-scam-conversation [Dataset]. https://huggingface.co/datasets/BothBosu/multi-agent-scam-conversation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2024
Authors
Pitipat Gumphusiri
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Multi-Turn Scam and Non-Scam Phone Conversation Dataset with Agentic Personalities

Dataset Description

The Synthetic Multi-Turn Scam and Non-Scam Phone Dialogue Dataset with Agentic Personalities is an enhanced collection of simulated phone conversations between two AI agents, one acting as a scammer or non-scammer and the other as an innocent receiver. Each dialogue is labeled as either a scam or non-scam interaction. This dataset is designed to help develop… See the full description on the dataset page: https://huggingface.co/datasets/BothBosu/multi-agent-scam-conversation.
Z
Synthetic AIS Dataset of Vessel Proximity Events
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilias Chamatidis (2024). Synthetic AIS Dataset of Vessel Proximity Events [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8358664
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Giannis Spiliopoulos
Ilias Chamatidis
Konstantina Bereta
Georgios Grigoropoulos
Manolis Kaliorakis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Automatic Identification System (AIS) allows vessels to share identification, characteristics, and location data through self-reporting. This information is periodically broadcast and can be received by other vessels with AIS transceivers, as well as ground or satellite sensors. Since the International Maritime Organisation (IMO) mandated AIS for vessels above 300 gross tonnage, extensive datasets have emerged, becoming a valuable resource for maritime intelligence.

Maritime collisions occur when two vessels collide or when a vessel collides with a floating or stationary object, such as an iceberg. Maritime collisions hold significant importance in the realm of marine accidents for several reasons:

Injuries and fatalities of vessel crew members and passengers.

Environmental effects, especially in cases involving large tanker ships and oil spills.

Direct and indirect economic losses on local communities near the accident area.

Adverse financial consequences for ship owners, insurance companies and cargo owners including vessel loss and penalties.

As sea routes become more congested and vessel speeds increase, the likelihood of significant accidents during a ship's operational life rises. The increasing congestion on sea lanes elevates the probability of accidents and especially collisions between vessels.

The development of solutions and models for the analysis, early detection and mitigation of vessel collision events is a significant step towards ensuring future maritime safety. In this context, a synthetic vessel proximity event dataset is created using real vessel AIS messages. The synthetic dataset of trajectories with reconstructed timestamps is generated so that a pair of trajectories reach simultaneously their intersection point, simulating an unintended proximity event (collision close call). The dataset aims to provide a basis for the development of methods for the detection and mitigation of maritime collisions and proximity events, as well as the study and training of vessel crews in simulator environments.

The dataset consists of 4658 samples/AIS messages of 213 unique vessels from the Aegean Sea. The steps that were followed to create the collision dataset are:

Given 2 vessels X (vessel_id1) and Y (vessel_id2) with their current known location (LATITUDE [lat], LONGITUDE [lon]):

Check if the trajectories of vessels X and Y are spatially intersecting.

If the trajectories of vessels X and Y are intersecting, then align temporally the timestamp of vessel Y at the intersect point according to X’s timestamp at the intersect point. The temporal alignment is performed so the spatial intersection (nearest proximity point) occurs at the same time for both vessels.

Also for each vessel pair the timestamp of the proximity event is different from a proximity event that occurs later so that different vessel trajectory pairs do not overlap temporarily.

Two csv files are provided. vessel_positions.csv includes the AIS positions vessel_id, t, lon, lat, heading, course, speed of all vessels. Simulated_vessel_proximity_events.csv includes the id, position and timestamp of each identified proximity event along with the vessel_id number of the associated vessels. The final sum of unintended proximity events in the dataset is 237. Examples of unintended vessel proximity events are visualized in the respective png and gif files.

The research leading to these results has received funding from the European Union's Horizon Europe Programme under the CREXDATA Project, grant agreement n° 101092749.
h
soda_synthetic_dialogue
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Quesnelle (2023). soda_synthetic_dialogue [Dataset]. https://huggingface.co/datasets/emozilla/soda_synthetic_dialogue
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2023
Authors
Jeffrey Quesnelle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for 🥤SODA Synthetic Dialogue

Dataset Summary

🥤SODA Synthetic Dialogue is a set of synthetic dialogues between Assistant and User. In each conversation, User asks Assistant to perform summarization or story generation tasks based on a snippet of an existing dialogue, story, or from a title or theme. This data was created by synthesizing the dialogues in 🥤Soda and applying a set of templates to generate the conversation. The original research paper can be… See the full description on the dataset page: https://huggingface.co/datasets/emozilla/soda_synthetic_dialogue.
WangchanThaiInstruct Multi-turn Conversation Dataset
zenodo.org
huggingface.co
bin
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sirapatch Thammaleelakul; Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun; Sirapatch Thammaleelakul (2024). WangchanThaiInstruct Multi-turn Conversation Dataset [Dataset]. http://doi.org/10.5281/zenodo.13132633
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13132633
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sirapatch Thammaleelakul; Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun; Sirapatch Thammaleelakul
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
WangchanThaiInstruct Multi-turn Conversation Dataset

We create a Thai multi-turn conversation dataset from airesearch/WangchanThaiInstruct (Batch 1) by LLM. It was created from synthetic method using open source LLM in Thai language.
f
CVD Risk Prediction Synthetic Dataset
figshare.com
pdf
Updated Sep 25, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ted Laderas; David Dorr; Nicole Vasilevsky; Shannon McWeeney; Melissa Haendel; Bjorn Pederson (2017). CVD Risk Prediction Synthetic Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.5439991.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5439991.v1
Dataset updated
Sep 25, 2017
Dataset provided by
figshare
Authors
Ted Laderas; David Dorr; Nicole Vasilevsky; Shannon McWeeney; Melissa Haendel; Bjorn Pederson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a synthetic dataset to teach students about using clinical and genetic covariates to predict cardiovascular risk in a realistic (but synthetic) dataset.For the workshop materials, please go here: https://github.com/laderast/cvdNight1Contents:1) dataDictionary.pdf - pdf file describing all covariates in the synthetic dataset.2) fullPatientData.csv - csv file with multiple covariates3) genoData.csv - subset of patients in fullPatientData.csv with additional SNP calls.
d
AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of...
datarade.ai
.mp3, .wav
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2023). AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of the calls between consumers and companies [Dataset]. https://datarade.ai/data-products/ai-training-data-audio-data-unique-consumer-sentiment-data-wiserbrand-com
Explore at:
.mp3, .wavAvailable download formats
Dataset updated
Dec 8, 2023
Dataset provided by
WiserBrand.com
Area covered
United States of America
Description
WiserBrand offers a unique dataset of real consumer-to-business phone conversations. These high-quality audio recordings capture authentic interactions between consumers and support agents across industries. Unlike synthetic data or scripted samples, our dataset reflects natural speech patterns, emotion, intent, and real-world phrasing — making it ideal for:

Training ASR (Automatic Speech Recognition) systems

Improving voice assistants and LLM audio understanding

Enhancing call center AI tools (e.g., sentiment analysis, intent detection)

Benchmarking conversational AI performance with real-world noise and context

We ensure strict data privacy: all personally identifiable information (PII) is removed before delivery. Recordings are produced on demand and can be tailored by vertical (e.g., telecom, finance, e-commerce) or use case.

Whether you're building next-gen voice technology or need realistic conversational datasets to test models, this dataset provides what synthetic corpora lack — realism, variation, and authenticity.
o
I-BiDaaS - TID - Synthetic Mobility Data
explore.openaire.eu
data.niaid.nih.gov
Updated Nov 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ioannis Arapakis; Jordi Luque Serrano (2020). I-BiDaaS - TID - Synthetic Mobility Data [Dataset]. http://doi.org/10.5281/zenodo.4274458
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4274458
Dataset updated
Nov 15, 2020
Authors
Ioannis Arapakis; Jordi Luque Serrano
Description
This is a synthetic data stream based on real-time, cell network events. These events are picked up by the antennas that are closer to the mobile phone thus providing an approximate location of the device. Every transaction of a mobile phone generates one of those events. A transaction can be, for instance, placing or receiving a call, sending or receiving an SMS, asking for a specific URL in your mobile phone browser, or sending a text message or a data transaction from/to any mobile phone app. There are also some synchronization events like, for instance, turning your mobile phone on or off, or when switching between location area networks (relatively big geographical areas comprising several cell towers).
P
VietMed-Sum Dataset
paperswithcode.com
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy (2024). VietMed-Sum Dataset [Dataset]. https://paperswithcode.com/dataset/vietmed-sum
Explore at:
Dataset updated
Jun 21, 2024
Authors
Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy
Description
In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.
h
synthetic-multiturn-multimodal
huggingface.co
Updated Jan 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mesolitica (2024). synthetic-multiturn-multimodal [Dataset]. https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2024
Dataset authored and provided by
Mesolitica
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Multiturn Multimodal

We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal

multi-images

synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main

Example data

{'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.
d
Solving a prisoner's dilemma in distributed anomaly detection
catalog.data.gov
datasets.ai
+5more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Solving a prisoner's dilemma in distributed anomaly detection [Dataset]. https://catalog.data.gov/dataset/solving-a-prisoners-dilemma-in-distributed-anomaly-detection
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.

A Synthetic NCD based on Athens pilot cases

zenodo.org

Updated Jun 13, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ioannis Fourfouris; Michaela Antonopoulou; Manolis Tsangaris; Pradeep Rangappa; Dairazalia Sanchez-Cortes; Petr Motlicek; Ioannis Fourfouris; Michaela Antonopoulou; Manolis Tsangaris; Pradeep Rangappa; Dairazalia Sanchez-Cortes; Petr Motlicek (2025). A Synthetic NCD based on Athens pilot cases [Dataset]. http://doi.org/10.5281/zenodo.15585355

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.15585355

Dataset updated

Jun 13, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered

Athens

Description

The synthetic NCD dataset, based on Athens pilot cases, is a simulated yet realistic resource that can support the testing of AI and analytics algorithms for crime resolution.

The first version of the dataset (SAEX1), was generated to simulate terminal movements within a 2x2 km area over a four-hour window on July 28, 2023, from 08:00 to 12:00. It contains over 6.9 million signaling events across 32,485 terminals, with each terminal averaging 213 events. Each event logs a terminal’s location, timestamp and its proximity to a serving cell along with the cell coordinates. To align with the investigation of a crime scene, the dataset was filtered to include only those individuals whose movement patterns intersected a predefined bounding box around Kerameikos, Athens, Greece, during the specified timeframe. This filtering process resulted in a refined dataset of trajectory data for 30,703 individuals. Figure 1 illustrates a visual representation of the final generated data for the SAEX1 dataset.

Figure 1: Initial SAEX Dataset.

As part of the TRACY project, the implemented algorithms are being evaluated using real pilot cases, provided by LEA Partners. This evaluation revealed the need for an update to the initial version of the simulated data, as the cases included additional information beyond the cell data. Specifically, they also featured Call Detail Records (CDRs), which encompassed call logs, SMS messages and mobile data. To address this, the initial data was enriched with the following details:

Provider Information: Identifying whether a terminal is a customer of one of the Greek telecommunications providers (Cosmote, Vodafone or Nova).
Actions: Indicating whether a terminal made or received a call, sent or received an SMS or engaged in internet browsing activities.

To accomplish this, a custom CDR Simulator was developed. The simulator generates random pairs of terminals that engage in actions (Calls or SMSs), along with the corresponding timeframes during which these actions occur. Additionally, the simulator selects specific terminals to assign browsing activities. Finally, the generated data is enriched with information about the first and last serving cells, ensuring the creation of a realistic synthetic case dataset.

Finally, since the real case data is provided in multiple files per provider, the same structure has been adopted in the simulated dataset. It is important to note that each file contains cell information only for terminals that are customers of the respective provider. For all other terminals, the cell information is left blank. This approach reflects the structure and content of the real cases and enables the use of the same data pipeline process developed within the TRACY project.

SAEX2 employs the same simulation settings as SAEX1. However, it involves a single criminal (ground truth) who follows a predetermined route within a specified timeframe, thereby enabling detection by the TRACY algorithm. Table 2 shows the dataset’s main characteristics.

Metric	Value
Simulation Area Width x Height (Km)	2x2
Simulation Start-time	2023-07-28 08:00:00
Simulation End-time	2023-07-28 12:00:00
Simulation Duration (minutes)	240
Number of Terminals	30692
Number of Events	1469515
Number of Produced CSVs	6 (2 per provider)

Table 2: SAEX2 Vital Statistics.

In the absence of real data for experimentation, having a synthetic yet realistic NCD dataset is crucial for developing and evaluating crime resolution techniques.

Open-source data has been used during the development of this synthetic dataset, more specifially, open-source data from OpenCellid, as well as open government data from the Hellenic Statistical Authority and the Antenna Construction Information Portal developed by the Hellenic Telecommunications and Post Commission (EETT).

TRACY is funded under DIGITAL-2022-DEPLOY-02-LAW-SECURITY-AI (GA: 101102641)

Facebook

Twitter

Click to copy link

Link copied

Cite

Brian Fitzgerald (2024). synthetic-tool-calls-v2 [Dataset]. https://huggingface.co/datasets/roborovski/synthetic-tool-calls-v2

synthetic-tool-calls-v2

roborovski/synthetic-tool-calls-v2

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 15, 2024

Authors

Brian Fitzgerald

Description

roborovski/synthetic-tool-calls-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

synthetic-tool-calls-v2

I-BiDaaS - TID - Synthetic Call Centre Data

I-BiDaaS - TID - Synthetic Call Centre Data

synthetic-tool-calls-v2-dpo-pairs

Experimental testing of detection, localization, and call density of...

Rule-based Synthetic Data for Japanese GEC

Android System call Dataset

BirdVox-scaper-10k: a synthetic dataset for multilabel species...

BirdVox-scaper-10k: a synthetic dataset for multilabel species classification of flight calls from 10-second audio recordings

Created By

Description

Synthetic data for GAZEL-ADN blood-saliva comparison

multi-agent-scam-conversation

Synthetic AIS Dataset of Vessel Proximity Events

soda_synthetic_dialogue

WangchanThaiInstruct Multi-turn Conversation Dataset

WangchanThaiInstruct Multi-turn Conversation Dataset

CVD Risk Prediction Synthetic Dataset

AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of...

I-BiDaaS - TID - Synthetic Mobility Data

VietMed-Sum Dataset

synthetic-multiturn-multimodal

Solving a prisoner's dilemma in distributed anomaly detection

A Synthetic NCD based on Athens pilot cases

synthetic-tool-calls-v2See More Versions

roborovski/synthetic-tool-calls-v2

synthetic-tool-calls-v2