90 datasets found
  1. h

    synthetic-tool-calls-v2

    • huggingface.co
    Updated Mar 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Fitzgerald (2024). synthetic-tool-calls-v2 [Dataset]. https://huggingface.co/datasets/roborovski/synthetic-tool-calls-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2024
    Authors
    Brian Fitzgerald
    Description

    roborovski/synthetic-tool-calls-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. I-BiDaaS - TID - Synthetic Call Centre Data

    • zenodo.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordi Luque Serrano; Ioannis Arapakis; Jordi Luque Serrano; Ioannis Arapakis (2024). I-BiDaaS - TID - Synthetic Call Centre Data [Dataset]. http://doi.org/10.5281/zenodo.4274454
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jordi Luque Serrano; Ioannis Arapakis; Jordi Luque Serrano; Ioannis Arapakis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a simulated dataset based on both real phone interactions and conversations usually performed by real call centres. It is comprised of several simulated customer interactions with an agent representative, both roles performed by actors. Phone call recordings are performed using different mobile and landline devices. The scripting, both from customer and agent, aims to develop typical scenarios by telco-oriented call centre operations. Both raw waveform recordings and speech transcription are provided. The latter as obtained by an automatic speech recognition (ASR) prototype developed by TID. The word segmentation timestamps are also provided for those recognized. Additionally, a confidence score is also provided per token basis.

  3. Z

    I-BiDaaS - TID - Synthetic Call Centre Data

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordi Luque Serrano (2024). I-BiDaaS - TID - Synthetic Call Centre Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4274453
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Ioannis Arapakis
    Jordi Luque Serrano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a simulated dataset based on both real phone interactions and conversations usually performed by real call centres. It is comprised of several simulated customer interactions with an agent representative, both roles performed by actors. Phone call recordings are performed using different mobile and landline devices. The scripting, both from customer and agent, aims to develop typical scenarios by telco-oriented call centre operations. Both raw waveform recordings and speech transcription are provided. The latter as obtained by an automatic speech recognition (ASR) prototype developed by TID. The word segmentation timestamps are also provided for those recognized. Additionally, a confidence score is also provided per token basis.

  4. h

    synthetic-tool-calls-v2-dpo-pairs

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Fitzgerald, synthetic-tool-calls-v2-dpo-pairs [Dataset]. https://huggingface.co/datasets/roborovski/synthetic-tool-calls-v2-dpo-pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Brian Fitzgerald
    Description

    roborovski/synthetic-tool-calls-v2-dpo-pairs dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Experimental testing of detection, localization, and call density of...

    • datasets.ai
    • catalog.ogopendata.com
    • +2more
    0
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Oceanic and Atmospheric Administration, Department of Commerce (2024). Experimental testing of detection, localization, and call density of synthetic sounds using Navy surplus sonobuoys. [Dataset]. https://datasets.ai/datasets/experimental-testing-of-detection-localization-and-call-density-of-synthetic-sounds-using-navy-
    Explore at:
    0Available download formats
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    National Oceanic and Atmospheric Administration, Department of Commerce
    Description

    A series of synthesized sounds were played back from multiple locations around and within a grid of four DIFAR sonobuoys. Four type 53F DIFAR sonobuoys with attached SPOT GPS devices were deployed in a square with ~1nmi (2km) between each sonobuoy. Sound sources include periodic deployment of weighted light bulbs (producing an impulsive sound when the bulb implodes at depth) and synthetic tonal sounds broadcast through an underwater speaker. A time-synchronized multi-channel recording was made of the sounds received on the four sonobuoys. These experimental data have been used for multiple studies including testing methods to estimate sonobuoy drift, testing novel methods for detection and localization of sonobuoy signals, and testing methods for Acoustic Spatial Capture-Recapture (ASCR) methods for estimating call density. We encourage use of these methods for additional research and development.

  6. E

    Rule-based Synthetic Data for Japanese GEC

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    tsv
    Updated Oct 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Rule-based Synthetic Data for Japanese GEC [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7679
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Oct 28, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Title: Rule-based Synthetic Data for Japanese GEC. Dataset Contents:This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows:Synthetic Corpus - synthesized_data.tsv. This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction.These paired sentences are derived from data scraped from the keyword-lookup site

  7. Android System call Dataset

    • kaggle.com
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akarsh nair (2025). Android System call Dataset [Dataset]. https://www.kaggle.com/datasets/akarshnair/android-system-call-dataset
    Explore at:
    zip(435042128 bytes)Available download formats
    Dataset updated
    Jun 11, 2025
    Authors
    Akarsh nair
    Description

    Title: System Call Traces from Real and Synthetic Sources

    Description: This dataset comprises a collection of system call procedure traces collected across various devices and environments. It includes both real-world system call sequences (captured from actual android operating systems) and synthetically generated sequences designed to simulate realistic system behavior.

    The data is structured to support a range of use cases, including:

    Intrusion detection systems Anomaly detection Behavioral profiling of applications

    The dataset is ideal for training and evaluating machine learning models that require low-level OS interaction data. By including both real and synthetic traces, it allows for balanced experimentation in controlled and uncontrolled conditions.

    Features:

    Real system call traces from multiple devices Synthetic traces designed to mimic real patterns Labelled for supervised learning tasks (if applicable) Suitable for time-series, classification, or sequence modeling

    Intended Use: This dataset can be used in academic research, cybersecurity benchmarking, and development of intelligent systems call analysis tools.

  8. Z

    BirdVox-scaper-10k: a synthetic dataset for multilabel species...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mendoza, Elizabeth (2020). BirdVox-scaper-10k: a synthetic dataset for multilabel species classification of flight calls from 10-second audio recordings [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2560772
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Lostanlen, Vincent
    Farnsworth, Andrew
    Salamon, Justin
    Kelling, Steve
    Bello, Juan Pablo
    Mendoza, Elizabeth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BirdVox-scaper-10k: a synthetic dataset for multilabel species classification of flight calls from 10-second audio recordings

    Version 1.0, September 2019.

    Created By

    Elizabeth Mendoza (1), Vincent Lostanlen (2, 3, 4), Justin Salamon (3, 4), Andrew Farnsworth (2), Steve Kelling (2), and Juan Pablo Bello (3, 4).

    (1): Forest Hills High School, New York, NY, USA (2): Cornell Lab of Ornithology, Cornell University, Ithaca, NY, USA (3): Center for Urban Science and Progress, New York University, New York, NY, USA (4): Music and Audio Research Lab, New York University, New York, NY, USA

    https://wp.nyu.edu/birdvox

    Description

    The BirdVox-scaper-10k dataset contains 9983 artificial soundscapes. Each soundscape lasts exactly ten seconds and contains one or several avian flight calls from up to 30 different species of New World warblers (Parulidae). Alongside each audio file, we include an annotation file describing the start time and end time of each flight call in the corresponding soundscape, as well as the species of warbler it belongs to.

    In order to synthesize soundscapes in BirdVox-scaper-10k, we mixed natural sounds from various pre-recorded sources. First, we extracted isolated recordings of flight calls containing little or no background noise from the CLO-43SD dataset [1]. Secondly, we extracted 10-second "empty" acoustic scenes from the BirdVox-DCASE-20k dataset [2]. These acoustic scenes contain various sources of real-world background noise, including biophony (insects) and anthropophony (vehicles), yet are guaranteed to be devoid of any flight calls. Lastly, we "fill" each acoustic scene by mixing it with flight calls sampled at random.

    Although the BirdVox-scaper-10k does not consist of natural recordings, we have taken several measures to ensure the plausibility of each synthesized soundscape, both from qualitative and quantitative standpoints.

    The BirdVox-scaper-10k dataset can be used, among other things, for the research, development, and testing of bioacoustic classification models.

    For details on the hardware of ROBIN recording units, we refer the reader to [2].

    [1] J. Salamon, J. Bello. Fusing shallow and deep learning for bioacoustic bird species classification. Proc. IEEE ICASSP, 2017.

    [2] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J. Bello. BirdVox-full-night: a dataset and benchmark for avian flight call detection. Proc. IEEE ICASSP, 2018.

    [3] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring. PLoS One, 2016.

    @inproceedings{lostanlen2018icassp, title = {BirdVox-full-night: a dataset and benchmark for avian flight call detection}, author = {Lostanlen, Vincent and Salamon, Justin and Farnsworth, Andrew and Kelling, Steve and Bello, Juan Pablo}, booktitle = {Proc. IEEE ICASSP}, year = {2018}, published = {IEEE}, venue = {Calgary, Canada}, month = {April}, }

  9. R

    Synthetic data for GAZEL-ADN blood-saliva comparison

    • entrepot.recherche.data.gouv.fr
    tsv, vcf
    Updated Mar 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Herzig; Anthony Herzig (2023). Synthetic data for GAZEL-ADN blood-saliva comparison [Dataset]. http://doi.org/10.57745/MFIXFW
    Explore at:
    tsv(6740), vcf(8269859015)Available download formats
    Dataset updated
    Mar 13, 2023
    Dataset provided by
    Recherche Data Gouv
    Authors
    Anthony Herzig; Anthony Herzig
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    French Ministry of Research PFMG2025
    Description

    This study sets out to establish the suitability of saliva-based whole-genome sequencing (WGS) through a comparison against blood-based WGS. To fully appraise the observed differences, we developed a novel technique of pseudo-replication. We also investigated the potential of characterizing individual salivary microbiomes from non-human DNA fragments found in saliva. We observed that the majority of discordant genotype calls between blood and saliva fell into known regions of the human genome that are typically sequenced with low confidence; and could be identified by quality control measures. Pseudo-replication demonstrated that the levels of discordance between blood- and saliva-derived WGS data were entirely similar to what one would expect between technical replicates if an individual's blood or saliva had been sequenced twice. Finally, we successfully sequenced salivary microbiomes in parallel to human genomes as demonstrated by a comparison against the Human Microbiome Project. A synthetic data set has been generated that allows the replication of our principal results but without a full disclosure of individual level sequencing data. Read counts and relative abundances for the microbiome profiling analyses are similarly available.

  10. h

    multi-agent-scam-conversation

    • huggingface.co
    Updated Jul 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pitipat Gumphusiri (2024). multi-agent-scam-conversation [Dataset]. https://huggingface.co/datasets/BothBosu/multi-agent-scam-conversation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2024
    Authors
    Pitipat Gumphusiri
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic Multi-Turn Scam and Non-Scam Phone Conversation Dataset with Agentic Personalities

      Dataset Description
    

    The Synthetic Multi-Turn Scam and Non-Scam Phone Dialogue Dataset with Agentic Personalities is an enhanced collection of simulated phone conversations between two AI agents, one acting as a scammer or non-scammer and the other as an innocent receiver. Each dialogue is labeled as either a scam or non-scam interaction. This dataset is designed to help develop… See the full description on the dataset page: https://huggingface.co/datasets/BothBosu/multi-agent-scam-conversation.

  11. Z

    Synthetic AIS Dataset of Vessel Proximity Events

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilias Chamatidis (2024). Synthetic AIS Dataset of Vessel Proximity Events [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8358664
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Giannis Spiliopoulos
    Ilias Chamatidis
    Konstantina Bereta
    Georgios Grigoropoulos
    Manolis Kaliorakis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Automatic Identification System (AIS) allows vessels to share identification, characteristics, and location data through self-reporting. This information is periodically broadcast and can be received by other vessels with AIS transceivers, as well as ground or satellite sensors. Since the International Maritime Organisation (IMO) mandated AIS for vessels above 300 gross tonnage, extensive datasets have emerged, becoming a valuable resource for maritime intelligence.

    Maritime collisions occur when two vessels collide or when a vessel collides with a floating or stationary object, such as an iceberg. Maritime collisions hold significant importance in the realm of marine accidents for several reasons:

    Injuries and fatalities of vessel crew members and passengers.

    Environmental effects, especially in cases involving large tanker ships and oil spills.

    Direct and indirect economic losses on local communities near the accident area.

    Adverse financial consequences for ship owners, insurance companies and cargo owners including vessel loss and penalties.

    As sea routes become more congested and vessel speeds increase, the likelihood of significant accidents during a ship's operational life rises. The increasing congestion on sea lanes elevates the probability of accidents and especially collisions between vessels.

    The development of solutions and models for the analysis, early detection and mitigation of vessel collision events is a significant step towards ensuring future maritime safety. In this context, a synthetic vessel proximity event dataset is created using real vessel AIS messages. The synthetic dataset of trajectories with reconstructed timestamps is generated so that a pair of trajectories reach simultaneously their intersection point, simulating an unintended proximity event (collision close call). The dataset aims to provide a basis for the development of methods for the detection and mitigation of maritime collisions and proximity events, as well as the study and training of vessel crews in simulator environments.

    The dataset consists of 4658 samples/AIS messages of 213 unique vessels from the Aegean Sea. The steps that were followed to create the collision dataset are:

    Given 2 vessels X (vessel_id1) and Y (vessel_id2) with their current known location (LATITUDE [lat], LONGITUDE [lon]):

    Check if the trajectories of vessels X and Y are spatially intersecting.

    If the trajectories of vessels X and Y are intersecting, then align temporally the timestamp of vessel Y at the intersect point according to X’s timestamp at the intersect point. The temporal alignment is performed so the spatial intersection (nearest proximity point) occurs at the same time for both vessels.

    Also for each vessel pair the timestamp of the proximity event is different from a proximity event that occurs later so that different vessel trajectory pairs do not overlap temporarily.

    Two csv files are provided. vessel_positions.csv includes the AIS positions vessel_id, t, lon, lat, heading, course, speed of all vessels. Simulated_vessel_proximity_events.csv includes the id, position and timestamp of each identified proximity event along with the vessel_id number of the associated vessels. The final sum of unintended proximity events in the dataset is 237. Examples of unintended vessel proximity events are visualized in the respective png and gif files.

    The research leading to these results has received funding from the European Union's Horizon Europe Programme under the CREXDATA Project, grant agreement n° 101092749.

  12. h

    soda_synthetic_dialogue

    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey Quesnelle (2023). soda_synthetic_dialogue [Dataset]. https://huggingface.co/datasets/emozilla/soda_synthetic_dialogue
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2023
    Authors
    Jeffrey Quesnelle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for 🥤SODA Synthetic Dialogue

      Dataset Summary
    

    🥤SODA Synthetic Dialogue is a set of synthetic dialogues between Assistant and User. In each conversation, User asks Assistant to perform summarization or story generation tasks based on a snippet of an existing dialogue, story, or from a title or theme. This data was created by synthesizing the dialogues in 🥤Soda and applying a set of templates to generate the conversation. The original research paper can be… See the full description on the dataset page: https://huggingface.co/datasets/emozilla/soda_synthetic_dialogue.

  13. WangchanThaiInstruct Multi-turn Conversation Dataset

    • zenodo.org
    • huggingface.co
    bin
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sirapatch Thammaleelakul; Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun; Sirapatch Thammaleelakul (2024). WangchanThaiInstruct Multi-turn Conversation Dataset [Dataset]. http://doi.org/10.5281/zenodo.13132633
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sirapatch Thammaleelakul; Wannaphong Phatthiyaphaibun; Wannaphong Phatthiyaphaibun; Sirapatch Thammaleelakul
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    WangchanThaiInstruct Multi-turn Conversation Dataset

    We create a Thai multi-turn conversation dataset from airesearch/WangchanThaiInstruct (Batch 1) by LLM. It was created from synthetic method using open source LLM in Thai language.

  14. f

    CVD Risk Prediction Synthetic Dataset

    • figshare.com
    pdf
    Updated Sep 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ted Laderas; David Dorr; Nicole Vasilevsky; Shannon McWeeney; Melissa Haendel; Bjorn Pederson (2017). CVD Risk Prediction Synthetic Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.5439991.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 25, 2017
    Dataset provided by
    figshare
    Authors
    Ted Laderas; David Dorr; Nicole Vasilevsky; Shannon McWeeney; Melissa Haendel; Bjorn Pederson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a synthetic dataset to teach students about using clinical and genetic covariates to predict cardiovascular risk in a realistic (but synthetic) dataset.For the workshop materials, please go here: https://github.com/laderast/cvdNight1Contents:1) dataDictionary.pdf - pdf file describing all covariates in the synthetic dataset.2) fullPatientData.csv - csv file with multiple covariates3) genoData.csv - subset of patients in fullPatientData.csv with additional SNP calls.

  15. d

    AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of...

    • datarade.ai
    .mp3, .wav
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2023). AI Training Data | Audio Data| Unique Consumer Sentiment Data: Recordings of the calls between consumers and companies [Dataset]. https://datarade.ai/data-products/ai-training-data-audio-data-unique-consumer-sentiment-data-wiserbrand-com
    Explore at:
    .mp3, .wavAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    WiserBrand.com
    Area covered
    United States of America
    Description

    WiserBrand offers a unique dataset of real consumer-to-business phone conversations. These high-quality audio recordings capture authentic interactions between consumers and support agents across industries. Unlike synthetic data or scripted samples, our dataset reflects natural speech patterns, emotion, intent, and real-world phrasing — making it ideal for:

    Training ASR (Automatic Speech Recognition) systems

    Improving voice assistants and LLM audio understanding

    Enhancing call center AI tools (e.g., sentiment analysis, intent detection)

    Benchmarking conversational AI performance with real-world noise and context

    We ensure strict data privacy: all personally identifiable information (PII) is removed before delivery. Recordings are produced on demand and can be tailored by vertical (e.g., telecom, finance, e-commerce) or use case.

    Whether you're building next-gen voice technology or need realistic conversational datasets to test models, this dataset provides what synthetic corpora lack — realism, variation, and authenticity.

  16. o

    I-BiDaaS - TID - Synthetic Mobility Data

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Nov 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ioannis Arapakis; Jordi Luque Serrano (2020). I-BiDaaS - TID - Synthetic Mobility Data [Dataset]. http://doi.org/10.5281/zenodo.4274458
    Explore at:
    Dataset updated
    Nov 15, 2020
    Authors
    Ioannis Arapakis; Jordi Luque Serrano
    Description

    This is a synthetic data stream based on real-time, cell network events. These events are picked up by the antennas that are closer to the mobile phone thus providing an approximate location of the device. Every transaction of a mobile phone generates one of those events. A transaction can be, for instance, placing or receiving a call, sending or receiving an SMS, asking for a specific URL in your mobile phone browser, or sending a text message or a data transaction from/to any mobile phone app. There are also some synchronization events like, for instance, turning your mobile phone on or off, or when switching between location area networks (relatively big geographical areas comprising several cell towers).

  17. P

    VietMed-Sum Dataset

    • paperswithcode.com
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy (2024). VietMed-Sum Dataset [Dataset]. https://paperswithcode.com/dataset/vietmed-sum
    Explore at:
    Dataset updated
    Jun 21, 2024
    Authors
    Khai Le-Duc; Khai-Nguyen Nguyen; Long Vo-Dang; Truong-Son Hy
    Description

    In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.

  18. h

    synthetic-multiturn-multimodal

    • huggingface.co
    Updated Jan 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mesolitica (2024). synthetic-multiturn-multimodal [Dataset]. https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2024
    Dataset authored and provided by
    Mesolitica
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multiturn Multimodal

    We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal

      multi-images
    

    synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main

      Example data
    

    {'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.

  19. d

    Solving a prisoner's dilemma in distributed anomaly detection

    • catalog.data.gov
    • datasets.ai
    • +5more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Solving a prisoner's dilemma in distributed anomaly detection [Dataset]. https://catalog.data.gov/dataset/solving-a-prisoners-dilemma-in-distributed-anomaly-detection
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.

  20. A Synthetic NCD based on Athens pilot cases

    • zenodo.org
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ioannis Fourfouris; Michaela Antonopoulou; Manolis Tsangaris; Pradeep Rangappa; Dairazalia Sanchez-Cortes; Petr Motlicek; Ioannis Fourfouris; Michaela Antonopoulou; Manolis Tsangaris; Pradeep Rangappa; Dairazalia Sanchez-Cortes; Petr Motlicek (2025). A Synthetic NCD based on Athens pilot cases [Dataset]. http://doi.org/10.5281/zenodo.15585355
    Explore at:
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ioannis Fourfouris; Michaela Antonopoulou; Manolis Tsangaris; Pradeep Rangappa; Dairazalia Sanchez-Cortes; Petr Motlicek; Ioannis Fourfouris; Michaela Antonopoulou; Manolis Tsangaris; Pradeep Rangappa; Dairazalia Sanchez-Cortes; Petr Motlicek
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    Athens
    Description

    The synthetic NCD dataset, based on Athens pilot cases, is a simulated yet realistic resource that can support the testing of AI and analytics algorithms for crime resolution.

    The first version of the dataset (SAEX1), was generated to simulate terminal movements within a 2x2 km area over a four-hour window on July 28, 2023, from 08:00 to 12:00. It contains over 6.9 million signaling events across 32,485 terminals, with each terminal averaging 213 events. Each event logs a terminal’s location, timestamp and its proximity to a serving cell along with the cell coordinates. To align with the investigation of a crime scene, the dataset was filtered to include only those individuals whose movement patterns intersected a predefined bounding box around Kerameikos, Athens, Greece, during the specified timeframe. This filtering process resulted in a refined dataset of trajectory data for 30,703 individuals. Figure 1 illustrates a visual representation of the final generated data for the SAEX1 dataset.



    Figure 1: Initial SAEX Dataset.

    As part of the TRACY project, the implemented algorithms are being evaluated using real pilot cases, provided by LEA Partners. This evaluation revealed the need for an update to the initial version of the simulated data, as the cases included additional information beyond the cell data. Specifically, they also featured Call Detail Records (CDRs), which encompassed call logs, SMS messages and mobile data. To address this, the initial data was enriched with the following details:

    • Provider Information: Identifying whether a terminal is a customer of one of the Greek telecommunications providers (Cosmote, Vodafone or Nova).
    • Actions: Indicating whether a terminal made or received a call, sent or received an SMS or engaged in internet browsing activities.

    To accomplish this, a custom CDR Simulator was developed. The simulator generates random pairs of terminals that engage in actions (Calls or SMSs), along with the corresponding timeframes during which these actions occur. Additionally, the simulator selects specific terminals to assign browsing activities. Finally, the generated data is enriched with information about the first and last serving cells, ensuring the creation of a realistic synthetic case dataset.

    Finally, since the real case data is provided in multiple files per provider, the same structure has been adopted in the simulated dataset. It is important to note that each file contains cell information only for terminals that are customers of the respective provider. For all other terminals, the cell information is left blank. This approach reflects the structure and content of the real cases and enables the use of the same data pipeline process developed within the TRACY project.

    SAEX2 employs the same simulation settings as SAEX1. However, it involves a single criminal (ground truth) who follows a predetermined route within a specified timeframe, thereby enabling detection by the TRACY algorithm. Table 2 shows the dataset’s main characteristics.

    Metric

    Value

    Simulation Area Width x Height (Km)

    2x2

    Simulation Start-time

    2023-07-28 08:00:00

    Simulation End-time

    2023-07-28 12:00:00

    Simulation Duration (minutes)

    240

    Number of Terminals

    30692

    Number of Events

    1469515

    Number of Produced CSVs

    6 (2 per provider)

    Table 2: SAEX2 Vital Statistics.

    In the absence of real data for experimentation, having a synthetic yet realistic NCD dataset is crucial for developing and evaluating crime resolution techniques.

    Open-source data has been used during the development of this synthetic dataset, more specifially, open-source data from OpenCellid, as well as open government data from the Hellenic Statistical Authority and the Antenna Construction Information Portal developed by the Hellenic Telecommunications and Post Commission (EETT).

    TRACY is funded under DIGITAL-2022-DEPLOY-02-LAW-SECURITY-AI (GA: 101102641)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Brian Fitzgerald (2024). synthetic-tool-calls-v2 [Dataset]. https://huggingface.co/datasets/roborovski/synthetic-tool-calls-v2

synthetic-tool-calls-v2

roborovski/synthetic-tool-calls-v2

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2024
Authors
Brian Fitzgerald
Description

roborovski/synthetic-tool-calls-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu