100+ datasets found
  1. s

    Data from: Shaping Multilingual Access through Respeaking Technology,...

    • openresearch.surrey.ac.uk
    • beta.ukdataservice.ac.uk
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Davitti; Anna-Stiina Wallinheimo (2024). Shaping Multilingual Access through Respeaking Technology, Project Data, 2021 [Dataset]. https://openresearch.surrey.ac.uk/esploro/outputs/dataset/Shaping-Multilingual-Access-through-Respeaking-Technology/99862365802346
    Explore at:
    Dataset updated
    Jan 25, 2024
    Dataset provided by
    UK Data Service
    Authors
    Elena Davitti; Anna-Stiina Wallinheimo
    Time period covered
    2024
    Dataset funded by
    Economic and Social Research Council (United Kingdom, Swindon) - ESRC
    Description

    The recent global surge in audiovisual content has emphasized the importance of accessibility for wider audiences. The SMART project addressed this by exploring interlingual respeaking, a novel practice combining speech recognition technology with human interpreting and subtitling skills to produce real-time, high-quality speech-to-text services across languages. This method evolved from intralingual respeaking, which is widely used in broadcasting to create live subtitles for the deaf and hard-of-hearing. Interlingual respeaking, which involves translating live content into another language and subtitling it, could revolutionize subtitle production for foreign-language content, overcoming sensory and language barriers.. Interlingual respeaking is defined as a type of simultaneous interpreting, producing text with minimal delay. It involves two shifts: interlingual (from one language to another) and intermodal (from spoken to written). This practice combines the challenges of simultaneous interpreting with the requirements of subtitling. Respeakers must accurately convey messages in another language to a speech recognition system, adding punctuation and making real-time edits for clarity and readability. This method leverages speech recognition technology and human translation skills to ensure efficient and high-quality translated subtitles.. Interlingual respeaking offers immense potential for making multilingual content accessible to international and hearing-impaired audiences. It's particularly relevant for television, conferences, and live events. However, research into its feasibility, accuracy, and the skills required for language professionals is still in its early stages.. The SMART project aimed to address these research gaps. It focused on the cognitive and interpersonal profiles needed for successful interlingual respeaking. The project extended a pilot study, including language professionals from interpreting, subtitling, translation, and intralingual respeaking, to explore how cognitive and interpersonal factors influence learning and performance in this field.. The SMART project's main goals were to study interlingual respeaking's complexity, focusing on the acquisition and implementation of relevant skills, and the accuracy of the final subtitles. The research involved 23 postgraduate students with backgrounds in interpreting, subtitling, and intralingual respeaking.. The research program examined three areas: process, product, and upskilling. It sought to understand the variables contributing to language professionals' performance, challenges faced during performance, and how performance can be sustained. Regarding the product, it aimed to identify factors affecting the accuracy of interlingual respeaking and the impact of various individual and content characteristics on accuracy. For upskilling, the focus was on the challenges and strengths of the training course.. Key findings included the importance of working memory in predicting high performance and the enhancement of certain cognitive abilities through training. Interpersonal traits like conscientiousness and integrated regulation were also examined. In terms of product accuracy, the average was 95.37%, with omissions being the strongest negative predictor of accuracy. High performers outperformed low performers across all scenarios.. The upskilling course was innovative, focusing on modular training and combining intralingual and interlingual practices. It addressed real-world challenges and was tailored to different professional backgrounds. The approach proved effective, with 82% of participants finding the course met their expectations and 86% acknowledging its challenging nature. The study confirmed the benefits of a modular and personalized training approach, highlighting the need for flexibility and adaptability to different skill levels and backgrounds.

  2. o

    Multilingual Medical Text Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Multilingual Medical Text Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/64f1a101-d243-4290-a4fc-af738f8ba252
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Healthcare Providers & Services Utilization
    Description

    This dataset provides a curated collection of accurate and reliable medical translation data. It is an invaluable resource designed for medical professionals, researchers, and language experts. The data encompasses a wide array of medical topics, including diagnoses, treatment plans, clinical research findings, and pharmaceutical information [1]. It supports various languages spoken across the globe, facilitating cross-cultural comparisons and analysis. Each translation has been meticulously crafted by professional translators with specialist knowledge in the medical domain to ensure authenticity and fidelity to the original source text [1]. This dataset aims to improve understanding and communication within the healthcare sector globally, enhancing accessibility to vital medical information regardless of language barriers and ensuring precision in patient care [1].

    Columns

    • translation: Contains the original text in a specific language that requires translation [1].
    • translation: Contains the translated text in another language [1].
      • Note: The dataset contains 13,149 unique entries across these translation columns [2].

    Distribution

    The data is provided in a CSV file format (specifically, train.csv) [1]. The dataset contains 13,149 records [2].

    Usage

    This dataset offers various ideal applications and use cases: * Natural Language Processing (NLP) Research: Suitable for training and evaluating NLP models specifically for medical translation tasks, aiding in the development of new algorithms and techniques to enhance accuracy and efficiency [1]. * Machine Learning in Healthcare: Can be used to train machine learning algorithms for automatic translation of medical documents or text, thereby speeding up processes and providing healthcare professionals with timely access to essential information [1]. * Development of Medical Translation Applications: Its accurate translations are beneficial for creating mobile or web-based applications that offer instant translation services for healthcare providers, patients, or anyone seeking reliable medical content translations [1]. * Enhanced Global Communication: Supports improved communication with patients who speak different languages and facilitates the accurate transfer of vital medical information across borders [1].

    Coverage

    The dataset covers various languages spoken worldwide, enabling cross-cultural analysis and supporting global healthcare communications among diverse populations [1]. The region of coverage is Global [3].

    License

    CC0

    Who Can Use It

    • Medical Professionals: To enhance communication with patients speaking different languages or facilitate transfer of medical information [1].
    • Researchers: For training machine learning models to automate medical translation or conducting linguistic analyses [1].
    • Language Experts: As a reliable source of accurate medical translations [1].
    • Healthcare Providers: To improve patient care and understanding [1].
    • Individuals: Seeking accurate and reliable translations of medical content [1].

    Dataset Name Suggestions

    • Global Medical Translations
    • Accurate Healthcare Language Data
    • Clinical Translation Corpus
    • Multilingual Medical Text Dataset
    • Healthcare Communication Translations

    Attributes

    Original Data Source: Accurate Medical Translation Data

  3. M

    Multilingual Machine Translation Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Multilingual Machine Translation Report [Dataset]. https://www.datainsightsmarket.com/reports/multilingual-machine-translation-531184
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The multilingual machine translation (MMT) market is experiencing robust growth, driven by the increasing need for seamless global communication across diverse linguistic landscapes. The market, estimated at $15 billion in 2025, is projected to expand at a Compound Annual Growth Rate (CAGR) of 18% between 2025 and 2033, reaching approximately $50 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising globalization of businesses necessitates efficient and cost-effective translation solutions for expanding into new markets. Secondly, technological advancements in neural machine translation (NMT) are significantly improving the accuracy and fluency of translated text, making MMT increasingly viable for various applications. The increasing availability of multilingual data sets further fuels NMT's progress, leading to more nuanced and contextually appropriate translations. Finally, the growing adoption of MMT across diverse sectors, including global communication, literary translation, technical documentation, and administrative processes, contributes to the market's rapid growth. However, the market faces certain restraints. Accuracy remains a challenge, particularly for complex or nuanced language, requiring ongoing refinement of NMT algorithms. Data bias and the preservation of cultural context in translation are also ongoing concerns. Furthermore, the security and privacy of sensitive data translated using MMT platforms require robust security protocols and regulations. Despite these challenges, the MMT market's positive trajectory is expected to continue, driven by continuous technological innovation, growing demand from various industries, and the ongoing expansion of global communication. The market segmentation, encompassing various application types (global communication, literary, professional, technical, administrative translation) and translation methodologies (rule-based, statistical, neural, hybrid), provides diverse opportunities for market players. The competition among established players like Google Translate, DeepL, and newer entrants like ChatGPT signifies a dynamic and innovative market landscape.

  4. D

    Replication Data for: Attending multiple languages: the relation between...

    • dataverse.nl
    Updated Nov 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saskia Nijmeijer; Saskia Nijmeijer (2022). Replication Data for: Attending multiple languages: the relation between individual multilingual language use and attentional control [Dataset]. http://doi.org/10.34894/TMF6J8
    Explore at:
    csv(24445), csv(4214), application/matlab-mat(141003075)Available download formats
    Dataset updated
    Nov 8, 2022
    Dataset provided by
    DataverseNL
    Authors
    Saskia Nijmeijer; Saskia Nijmeijer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Individuals speaking multiple language have been asserted to have a cognitive advantage, perhaps specifically in the domain of selective attention, although this claim has recently been challenged. The diversity of multilingual experiences and use seems of great importance here, and suggestions have been made that advantages especially emerge for individuals with higher ‘multilingual load’, referring to language experience and use factors including duration of multilingualism, number of languages mastered, and use of multiple languages in daily life. We captured multilingual language diversity using a language entropy measure, which encompasses several language use factors into one metric. We related individual differences in language entropy to selective attention as measured with an attentional blink (AB) task in 53 diverse multilingual individuals. During task performance, brain activity in the lateral prefrontal cortex was measured using fNIRS. We found no support for the claim that language diversity, or other individual factors related to language experience and use, influence AB magnitude. However, relations with T1 identification accuracy were observed and brain activity in the DLPFC during the attentional blink task also related to higher language diversity, jointly suggesting that language diversity may promote alertness and attention. This study is the first to relate simultaneous behavioral and brain attentional blink data to the language entropy measure. This is a dataset of 55 multilingual students, all of whom were enrolled in the English track of the psychology undergraduate degree program of the University of Groningen in the Netherlands. The dataset contains demographic information, and data on language use, experience, background and self-rated proficiency (assessed using a slightly modified version of the German LEAP-Q). Furthermore, there is data on language switching behavior, assessed using the Bilingual Switching Questionnaire. In addition to the self-reported language proficiency collected by means of the LEAP-Q questionnaires, objective language proficiency was assessed using the LexTALE language proficiency test. Participants have performed an attentional blink (AB) task as a measure of selective attention. In addition to questionnaire and task data, brain activity during performance of the AB task was measured using Functional Near-Infrared Spectroscopy (fNIRS). fNIRS is a non-invasive technique that measures the level of oxygenated- and de-oxygenated hemoglobin in the cerebral blood flow.

  5. Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data|...

    • datarade.ai
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data| Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Puerto Rico, Pakistan, Taiwan, Chile, Poland, France, Luxembourg, Italy, Uruguay, Japan
    Description
    1. Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

    Recording environment : quiet indoor environment, without echo

    Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

    Speaker : native speaker, gender balance

    Device : Android mobile phone, iPhone

    Language : 100+ languages

    Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

    Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

    Application scenarios : speech recognition, voiceprint recognition

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
  6. Multilingual Spoken Words Corpus - MLCommons Association

    • console.cloud.google.com
    Updated Jul 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&inv=1&invt=Ab2sOw (2022). Multilingual Spoken Words Corpus - MLCommons Association [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/mswc
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. It was generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset. Please see the paper for a detailed analysis of the contents of the data and methods for detecting potential outliers, along with baseline accuracy metrics on keyword spotting models trained from the dataset compared to models trained on a manually-recorded keyword dataset. The dataset was released by the MLCommons Association; latest information at mlcommons.org/words. This public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to learn how to access public datasets on Google Cloud Storage.

  7. Z

    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cizinsky, Ludek (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10413067
    Explore at:
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Cizinsky, Ludek
    Nutter, Peter
    Senghaas, Mika
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

    Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

    Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

    curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    Fine-tuning and advancing Homepage2Vec or similar website classification models

    Research on LLM-generated datasets for text classification tasks

    Exploration of multilingual website classification

    Additional Information:

    Project and report repository: https://github.com/CS-433/ml-project-2-mlp

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  8. Multilingual Speech Analytics Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Multilingual Speech Analytics Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/multilingual-speech-analytics-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jun 28, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Multilingual Speech Analytics Market Outlook



    According to our latest research, the global multilingual speech analytics market size stood at USD 2.45 billion in 2024, reflecting robust adoption across diverse sectors. The market is projected to reach USD 10.92 billion by 2033, expanding at a remarkable CAGR of 18.1% during the forecast period. This growth is primarily driven by the escalating demand for advanced analytics solutions that enable organizations to extract actionable insights from voice data in multiple languages, thereby enhancing customer experience and operational efficiency.



    Several factors are fueling the rapid expansion of the multilingual speech analytics market. The proliferation of global businesses and the increasing need to cater to a linguistically diverse customer base have compelled organizations to invest in sophisticated speech analytics solutions. Companies are recognizing that understanding customer sentiment, intent, and feedback across various languages is critical for delivering personalized services and maintaining a competitive edge. Furthermore, the surge in omnichannel communication and the exponential growth of contact centers worldwide have intensified the requirement for real-time multilingual analytics, driving further market growth.



    Technological advancements play a pivotal role in the market’s trajectory. The integration of artificial intelligence, machine learning, and natural language processing into speech analytics platforms has significantly improved the accuracy and efficiency of multilingual transcription and sentiment analysis. These innovations have enabled businesses to automate complex processes, reduce manual intervention, and gain deeper insights from unstructured voice data. Additionally, the increasing adoption of cloud-based solutions has made these analytics tools more accessible and scalable for organizations of all sizes, fostering widespread market adoption.



    Another crucial growth factor is the rising emphasis on regulatory compliance and risk management across industries. Sectors such as BFSI, healthcare, and government are under mounting pressure to monitor and analyze customer interactions for compliance with data privacy laws and industry standards. Multilingual speech analytics solutions empower these organizations to detect fraudulent activities, ensure adherence to regulations, and mitigate risks by providing comprehensive analysis across multiple languages. This capability not only enhances security but also builds trust with clients and stakeholders, further bolstering market demand.



    From a regional perspective, North America currently dominates the multilingual speech analytics market owing to its advanced technological infrastructure and high concentration of global enterprises. However, Asia Pacific is emerging as a lucrative market, driven by the rapid digital transformation of businesses, expanding contact center operations, and the region’s vast linguistic diversity. Europe and the Middle East & Africa are also witnessing steady adoption, propelled by the growing focus on customer experience and regulatory compliance. The interplay of these regional dynamics is shaping a vibrant and competitive landscape for multilingual speech analytics worldwide.





    Component Analysis



    The component segment of the multilingual speech analytics market comprises software and services, both of which play integral roles in delivering comprehensive analytics solutions. The software sub-segment dominates the market, accounting for the majority of the revenue share in 2024. This dominance is attributed to the increasing sophistication of speech analytics platforms, which leverage advanced algorithms and machine learning to transcribe, analyze, and interpret voice data in real time. These software solutions are continuously evolving to support a broader range of languages and dialects, thereby expanding their applicability across global enterprises. The demand for user-friendly interfaces and seamless integration with existing CRM and contact center sy

  9. Parallel Corpus Data | 200 Million Pairs | Machine Translation Data |...

    • datarade.ai
    Updated Jan 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Parallel Corpus Data | 200 Million Pairs | Machine Translation Data | Natural Language Processing Data | Translation Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-parallel-corpus-data-200-million-pai-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 29, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Peru, Bosnia and Herzegovina, Saudi Arabia, South Africa, Israel, Russian Federation, Mexico, Switzerland, Colombia, Philippines
    Description
    1. Overview Off-the-shelf parallel corpus data (Translation Data) covers many fields including spoken language, traveling, medical treatment,news, and finance. Data cleaning, desensitization, and quality inspection have been carried out.

    2. Specifications Storage format : TXT Data content : Parallel Corpus Data Data size : 200 million pairs Language : 20 languages Application scenario : machine translation Accuracy rate : 90%

    3. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Translation Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/nlu?source=Datarade

  10. o

    NLP Multilingual Social Media - 60,000 categorized phrases in 6 languages...

    • opendatabay.com
    .undefined
    Updated May 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AILinguaTech (2025). NLP Multilingual Social Media - 60,000 categorized phrases in 6 languages (English, Spanish, French, German, Chinese, Arabic) [Dataset]. https://www.opendatabay.com/data/dataset/39e13c0f-079b-457f-a8f0-76c98b4d27c9
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    May 25, 2025
    Dataset authored and provided by
    AILinguaTech
    Area covered
    French, Social Media and Networking
    Description

    Multilingual Social Media Phrase Corpus

    Size: 60,000 categorized phrases Source: Social media platforms (e.g., Twitter, Reddit, Instagram) Languages: 6 (English, Spanish, French, German, Chinese, Arabic) Format: Structured text (Excel, CSV) with phrases categorized based on its field

    Data Composition

    Phrases: Short text snippets (tweets, comments, captions) from public social media posts Categories: Topic(s) (e.g., technology, health, politics)

    Collection Methodology

    Period: Data spans 2020–2025 Process: Scraped via manual extraction with compliance to platform terms; anonymized to remove personal identifiers Quality Control: Manual and automated checks for relevance, accuracy, and category consistency

    Key Features

    Multilingual: Covers 6 major languages, enabling cross-lingual NLP applications Scale: Large volume supports robust model training Diversity: Varied platforms and user bases ensure broad representation Categorized: Pre-labeled for topic reducing preprocessing needs

    Applications

    Sentiment Analysis: Gauge public opinion across languages Trend Detection: Identify emerging topics or market shifts Customer Insights: Analyze feedback for brand monitoring Chatbot Training: Enhance multilingual conversational AI Cross-Lingual Research: Study linguistic patterns or cultural differences

    Technical Details

    Storage: Available in Excel (6.97 MB), CSV (600 MB) Access: Download; available on a license basis Schema: - phrase: Text content (string) - language: Language denoted by tab - category: Topic (string)

    Sample Entry: { "phrase": "Love the new phone update!","category": "phone/general", "Notes": "Phrase stating love for a new phone update"}

    Use Cases

    Marketing: Track brand sentiment globally Product Development: Identify user pain points from feedback Research: Study social media trends across cultures AI Development: Train NLP models for multilingual applications AI Research: Creation or further development of models for research purposes

    Limitations

    Bias: Reflects social media demographics, may skew younger Noise: Some phrases may contain slang or errors Coverage: Limited to 6 languages; other languages underrepresented Privacy: Anonymized, but public post origins may limit sensitivity

    Getting Started

    Access: Access according to owner (AILinguaTech LLC) or via listed marketplace’s terms and conditions

  11. M

    Multilingual Transcription Services Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Multilingual Transcription Services Report [Dataset]. https://www.datainsightsmarket.com/reports/multilingual-transcription-services-522981
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    May 23, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global multilingual transcription services market is experiencing robust growth, driven by the increasing demand for multilingual content across various sectors. The rising globalization of businesses, coupled with the expanding need for accessibility and inclusivity, fuels this demand. Industries like media and entertainment, legal, healthcare, and education are significant contributors, requiring accurate and timely transcriptions in multiple languages. Technological advancements, such as the development of sophisticated speech-to-text software and AI-powered translation tools, are further accelerating market expansion. While challenges exist, such as ensuring high accuracy across diverse languages and dialects, and managing data privacy concerns, the overall market outlook remains positive. A projected Compound Annual Growth Rate (CAGR) of, let's assume, 15% (a reasonable estimate for a rapidly growing tech-enabled service sector) from 2025 to 2033 indicates substantial market expansion. This growth is fueled by continuous innovation in AI-powered transcription and translation, and a growing need for real-time multilingual communication. The market is segmented based on language pairs, service types (e.g., on-demand, project-based), industry verticals, and geographic regions. Key players such as Language Scientific, JR Language, and others are actively competing through strategic partnerships, technological advancements, and global expansion initiatives. The competitive landscape is dynamic, with both established players and emerging startups striving to offer superior quality, speed, and cost-effectiveness. Regional variations in market penetration exist, with North America and Europe currently leading, but developing economies in Asia and Latin America present significant untapped potential. The continued emphasis on accurate and efficient multilingual transcription services will solidify this market's trajectory and position it for long-term success. Maintaining data security and accuracy will be critical for continued growth and market confidence.

  12. QAmeleon Dataset

    • kaggle.com
    Updated Aug 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Awsaf (2023). QAmeleon Dataset [Dataset]. https://www.kaggle.com/awsaf49/qameleon-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Awsaf
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    QAmeleon introduces synthetic multilingual QA data contaning in 8 langauges using PaLM-540B, a large language model. This dataset was generated by prompt tuning PaLM with only five examples per language. We use the synthetic data to finetune downstream QA models leading to improved accuracy in comparison to English-only and translation-based baselines.

    This dataset contains a total of 47173 Question Answer instances across 8 langauges, following is the count per language.

    Source

    Link: https://github.com/google-research-datasets/QAmeleon

    Citation

    @misc{agrawal2022qameleon,
       title={QAmeleon: Multilingual QA with Only 5 Examples}, 
       author={Priyanka Agrawal and Chris Alberti and Fantine Huot and Joshua Maynez and Ji Ma and Sebastian Ruder and Kuzman Ganchev and Dipanjan Das and Mirella Lapata},
       year={2022},
       eprint={2211.08264},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
    }
    
  13. L

    Language Detection API Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Language Detection API Report [Dataset]. https://www.datainsightsmarket.com/reports/language-detection-api-1965815
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Language Detection API market is experiencing robust growth, driven by the increasing need for globalized communication and the surge in multilingual content across various sectors. The market, estimated at $1.5 billion in 2025, is projected to maintain a healthy Compound Annual Growth Rate (CAGR) of 20% from 2025 to 2033, reaching approximately $7 billion by 2033. This expansion is fueled by several key factors. Firstly, the rise of e-commerce and digital marketing necessitates accurate and efficient language identification for targeted advertising and personalized user experiences. Secondly, the increasing volume of unstructured data generated across social media, customer service interactions, and other online platforms requires sophisticated language detection capabilities for effective analysis and processing. Thirdly, advancements in Natural Language Processing (NLP) and machine learning are continually improving the accuracy and speed of language detection APIs, making them more accessible and cost-effective. The growing adoption of cloud-based solutions further contributes to market growth, as businesses can leverage these services without significant upfront investments. Major players such as AWS, Google Cloud, Microsoft Azure, and IBM Watson dominate the market, offering comprehensive and reliable language detection services. However, several smaller, specialized providers are also emerging, focusing on niche applications or specific language sets. Competitive pressures are pushing innovation, leading to the development of more accurate, faster, and cost-effective solutions. While data privacy and security concerns pose potential restraints, the market's overall growth trajectory remains positive, primarily driven by the increasing demand for multilingual applications across various industries, including healthcare, finance, and customer service. Future growth will likely be influenced by factors such as advancements in multilingual NLP, the increasing adoption of AI-powered solutions, and evolving global regulatory landscapes around data privacy.

  14. o

    NLP Multilingual Social Media - 10,000 categorized phrases in Chinese

    • opendatabay.com
    .undefined
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AILinguaTech (2025). NLP Multilingual Social Media - 10,000 categorized phrases in Chinese [Dataset]. https://www.opendatabay.com/data/ai-ml/044de711-6fa9-421d-8f83-353d8d73ecf8
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    May 30, 2025
    Dataset authored and provided by
    AILinguaTech
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    Multilingual Social Media Phrase Corpus

    Size: 10,000 categorized phrases Source: Social media platforms (e.g., Twitter, Reddit, Instagram) Languages: 1 (Chinese) Format: Structured text (Excel, CSV) with phrases categorized based on its field

    Data Composition

    Phrases: Short text snippets (tweets, comments, captions) from public social media posts Categories: Topic(s) (e.g., technology, health, politics)

    Collection Methodology

    Period: Data spans 2020–2025 Process: Scraped via manual extraction with compliance to platform terms; anonymized to remove personal identifiers Quality Control: Manual and automated checks for relevance, accuracy, and category consistency

    Key Features

    Multilingual: Covers 1 major language, enabling cross-lingual NLP applications Scale: Large volume supports robust model training Diversity: Varied platforms and user bases ensure broad representation Categorized: Pre-labeled for topic reducing preprocessing needs

    Applications

    Sentiment Analysis: Gauge public opinion across languages Trend Detection: Identify emerging topics or market shifts Customer Insights: Analyze feedback for brand monitoring Chatbot Training: Enhance multilingual conversational AI Cross-Lingual Research: Study linguistic patterns or cultural differences

    Technical Details

    Storage: Available in Excel (768 KB) Access: Download; available on a license basis Schema: - phrase: Text content (string) - language: Language denoted by tab - category: Topic (string)

    Sample Entry: { "phrase": "Love the new phone update!","category": "phone/general", "Notes": "Phrase stating love for a new phone update"}

    Use Cases

    Marketing: Track brand sentiment globally Product Development: Identify user pain points from feedback Research: Study social media trends across cultures AI Development: Train NLP models for multilingual applications AI Research: Creation or further development of models for research purposes

    Limitations

    Bias: Reflects social media demographics, may skew younger Noise: Some phrases may contain slang or errors Coverage: Limited to 1 language; other languages underrepresented Privacy: Anonymized, but public post origins may limit sensitivity

    Getting Started

    Access: Access according to owner (AILinguaTech LLC) or via listed marketplace’s terms and conditions

  15. M

    Multilingual Machine Translation Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Multilingual Machine Translation Report [Dataset]. https://www.marketreportanalytics.com/reports/multilingual-machine-translation-52682
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The multilingual machine translation (MMT) market is experiencing robust growth, driven by the increasing demand for cross-lingual communication across various sectors. The global market, estimated at $15 billion in 2025, is projected to expand at a compound annual growth rate (CAGR) of 20% through 2033, reaching approximately $60 billion. This surge is fueled by several key factors. Firstly, the globalization of businesses necessitates efficient and cost-effective translation solutions for international expansion. Secondly, the rise of e-commerce and digital content creation necessitates seamless cross-lingual communication with diverse customer bases. Thirdly, advancements in neural machine translation (NMT) are leading to significant improvements in translation accuracy and fluency, making MMT more accessible and reliable. Finally, the increasing availability of multilingual datasets is further fueling the development of more sophisticated and accurate translation models. The market is segmented by application (global communication, literary translation, professional translation, technical translation, and administrative translation) and by type (rule-based, statistical, neural, and hybrid machine translation). North America and Europe currently dominate the market share, but Asia-Pacific is anticipated to witness significant growth due to its expanding digital economy and growing adoption of multilingual technologies. While the market exhibits strong growth potential, challenges remain. These include addressing the complexities of handling nuanced linguistic features, particularly in low-resource languages. Ensuring data privacy and security, especially for sensitive business and personal information being translated, is another key concern. Furthermore, achieving perfect translation accuracy across all languages and contexts remains a significant ongoing challenge. Overcoming these limitations through continued research and development in NMT and hybrid approaches, coupled with addressing ethical considerations surrounding data usage and bias within algorithms, will be crucial for sustained market growth and broader adoption of MMT solutions. Competition among existing established players and new entrants such as ChatGPT is intensifying, further driving innovation and improving the quality and affordability of MMT services.

  16. Multilingual NER Dataset

    • kaggle.com
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Multilingual NER Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/multilingual-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Multilingual NER Dataset

    Multilingual NER Dataset for Named Entity Recognition

    By Babelscape (From Huggingface) [source]

    About this dataset

    The Babelscape/wikineural NER Dataset is a comprehensive and diverse collection of multilingual text data specifically designed for the task of Named Entity Recognition (NER). It offers an extensive range of labeled sentences in nine different languages: French, German, Portuguese, Spanish, Polish, Dutch, Russian, English, and Italian.

    Each sentence in the dataset contains tokens (words or characters) that have been labeled with named entity recognition tags. These tags provide valuable information about the type of named entity each token represents. The dataset also includes a language column to indicate the language in which each sentence is written.

    This dataset serves as an invaluable resource for developing and evaluating NER models across multiple languages. It encompasses various domains and contexts to ensure diversity and representativeness. Researchers and practitioners can utilize this dataset to train and test their NER models in real-world scenarios.

    By using this dataset for NER tasks, users can enhance their understanding of how named entities are recognized across different languages. Furthermore, it enables benchmarking performance comparisons between various NER models developed for specific languages or trained on multiple languages simultaneously.

    Whether you are an experienced researcher or a beginner exploring multilingual NER tasks, the Babelscape/wikineural NER Dataset provides a highly informative and versatile resource that can contribute to advancements in natural language processing and information extraction applications on a global scale

    How to use the dataset

    • Understand the Data Structure:

      • The dataset consists of labeled sentences in nine different languages: French (fr), German (de), Portuguese (pt), Spanish (es), Polish (pl), Dutch (nl), Russian (ru), English (en), and Italian (it).
      • Each sentence is represented by three columns: tokens, ner_tags, and lang.
      • The tokens column contains the individual words or characters in each labeled sentence.
      • The ner_tags column provides named entity recognition tags for each token, indicating their entity types.
      • The lang column specifies the language of each sentence.
    • Explore Different Languages:

      • Since this dataset covers multiple languages, you can choose to focus on a specific language or perform cross-lingual analysis.
      • Analyzing multiple languages can help uncover patterns and differences in named entities across various linguistic contexts.
    • Preprocessing and Cleaning:

      • Before training your NER models or applying any NLP techniques to this dataset, it's essential to preprocess and clean the data.
      • Consider removing any unnecessary punctuation marks or special characters unless they carry significant meaning in certain languages.
    • Training Named Entity Recognition Models: 4a. Data Splitting: Divide the dataset into training, validation, and testing sets based on your requirements using appropriate ratios. 4b. Feature Extraction: Prepare input features from tokenized text data such as word embeddings or character-level representations depending on your model choice. 4c. Model Training: Utilize state-of-the-art NER models (e.g., LSTM-CRF, Transformer-based models) to train on the labeled sentences and ner_tags columns. 4d. Evaluation: Evaluate your trained model's performance using the provided validation dataset or test datasets specific to each language.

    • Applying Pretrained Models:

      • Instead of training a model from scratch, you can leverage existing pretrained NER models like BERT, GPT-2, or SpaCy's named entity recognition capabilities.
      • Fine-tune these pre-trained models on your specific NER task using the labeled

    Research Ideas

    • Training NER models: This dataset can be used to train NER models in multiple languages. By providing labeled sentences and their corresponding named entity recognition tags, the dataset can help train models to accurately identify and classify named entities in different languages.
    • Evaluating NER performance: The dataset can be used as a benchmark to evaluate the performance of pre-trained or custom-built NER models. By using the labeled sentences as test data, developers and researchers can measure the accuracy, precision, recall, and F1-score of their models across multiple languages.
    • Cross-lingual analysis: With labeled sentences available in nine different languages, researchers can perform cross-lingual analysis...
  17. S

    XBMU-MC: A Multilingual Parallel Corpus

    • scidb.cn
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan Qidong; Basangzhuzha; Baimaquzha; Akbar Yimit; Madinam; Mubarak Ablikim; Surina; Aliya; Ma Ning (2025). XBMU-MC: A Multilingual Parallel Corpus [Dataset]. http://doi.org/10.57760/sciencedb.25100
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Yan Qidong; Basangzhuzha; Baimaquzha; Akbar Yimit; Madinam; Mubarak Ablikim; Surina; Aliya; Ma Ning
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The XBMU-MC Multilingual Parallel Corpus consists of 22,000 high-quality parallel corpora covering Chinese-Tibetan, Chinese-Uighur and Chinese-Mongolian low-resource language pairs. Each data sample contains text pairs in both source and target languages, where the source language is Chinese and the target languages include Tibetan, Uyghur and Mongolian. Each sample has a uniform structure, including two main fields: instruction and input, and the corresponding output. the instruction field is used to describe the type or requirement of the translation task, the input field contains the original text in the source language, and the output field is the translated text in the target language.To ensure the quality and consistency of the data, each translation pair is manually reviewed and automatically evaluated to ensure alignment accuracy between source and target languages as well as translation accuracy. The dataset covers a wide range of fields such as culture, science and technology, and society, and each translated text contains different expressions and language structures, which helps to enhance the robustness and generalization ability of the model. The dataset is stored in standard JSON format, which is convenient for subsequent task processing, analysis and model training.

  18. o

    NLP Multilingual Social Media - 10,000 categorized phrases in Arabic

    • opendatabay.com
    .undefined
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AILinguaTech (2025). NLP Multilingual Social Media - 10,000 categorized phrases in Arabic [Dataset]. https://www.opendatabay.com/data/ai-ml/c36d116f-5d7e-471c-81bc-9a9e14a2073c
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset authored and provided by
    AILinguaTech
    Area covered
    Social Media and Networking
    Description

    Multilingual Social Media Phrase Corpus

    Size: 10,000 categorized phrases Source: Social media platforms (e.g., Twitter, Reddit, Instagram) Languages: 1 (Arabic) Format: Structured text (Excel, CSV) with phrases categorized based on its field

    Data Composition

    Phrases: Short text snippets (tweets, comments, captions) from public social media posts Categories: Topic(s) (e.g., technology, health, politics)

    Collection Methodology

    Period: Data spans 2020–2025 Process: Scraped via manual extraction with compliance to platform terms; anonymized to remove personal identifiers Quality Control: Manual and automated checks for relevance, accuracy, and category consistency

    Key Features

    Multilingual: Covers 1 major language, enabling cross-lingual NLP applications Scale: Large volume supports robust model training Diversity: Varied platforms and user bases ensure broad representation Categorized: Pre-labeled for topic reducing preprocessing needs

    Applications

    Sentiment Analysis: Gauge public opinion across languages Trend Detection: Identify emerging topics or market shifts Customer Insights: Analyze feedback for brand monitoring Chatbot Training: Enhance multilingual conversational AI Cross-Lingual Research: Study linguistic patterns or cultural differences

    Technical Details

    Storage: Available in Excel (791 KB) Access: Download; available on a license basis Schema: - phrase: Text content (string) - language: Language denoted by tab - category: Topic (string)

    Sample Entry: { "phrase": "Love the new phone update!","category": "phone/general", "Notes": "Phrase stating love for a new phone update"}

    Use Cases

    Marketing: Track brand sentiment globally Product Development: Identify user pain points from feedback Research: Study social media trends across cultures AI Development: Train NLP models for multilingual applications AI Research: Creation or further development of models for research purposes

    Limitations

    Bias: Reflects social media demographics, may skew younger Noise: Some phrases may contain slang or errors Coverage: Limited to 1 languages; other languages underrepresented Privacy: Anonymized, but public post origins may limit sensitivity

    Getting Started

    Access: Access according to owner (AILinguaTech LLC) or via listed marketplace’s terms and conditions

  19. M

    Multilingual Offline Translator Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Multilingual Offline Translator Report [Dataset]. https://www.datainsightsmarket.com/reports/multilingual-offline-translator-1319299
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    May 1, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global multilingual offline translator market is experiencing robust growth, driven by increasing international travel, globalization of businesses, and the rising demand for seamless communication across language barriers. The market, estimated at $5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $15 billion by 2033. This growth is fueled by several key factors. Firstly, technological advancements are leading to more accurate and efficient translation algorithms, smaller and more portable devices, and improved user interfaces. Secondly, the increasing adoption of smartphones and other mobile devices provides a readily available platform for these translation tools, expanding their accessibility to a broader user base. Thirdly, the rising demand for multilingual communication in various sectors, including tourism, international trade, education, and healthcare, is significantly driving market expansion. The segment witnessing the highest growth is online sales, owing to the convenience and reach offered by digital platforms. Multi-line scan translators are also gaining popularity due to their ability to handle multiple languages simultaneously. However, challenges like the need for continuous improvement in translation accuracy, offline data limitations, and the potential for high initial investment costs represent restraints to market growth. Despite these restraints, the market presents significant opportunities. Companies like Moaan (Xiaomi), HONOR, Readboy, ROOBO, Youdao Dictionary, Lenovo, iFlytek, and eKamus are actively competing to innovate and capture market share. Regional dominance is expected to be shared, with North America and Asia-Pacific leading the charge, followed by Europe and other regions. Future growth will likely depend on continuous technological innovation, strategic partnerships, and the successful penetration of emerging markets. Further development in offline capabilities, leveraging AI and machine learning, will likely be pivotal for sustained market expansion. The focus will be on enhancing accuracy, improving voice recognition, and broadening language support to cater to diverse linguistic needs across the globe.

  20. 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...

    • datarade.ai
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Argentina, United Arab Emirates, Philippines, Romania, Czech Republic, Vietnam, Poland, Singapore, Netherlands, United States of America
    Description
    1. Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

    Environment : quiet indoor environment, without echo;

    Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

    Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

    Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

    Device : Telephony recording system;

    Language : 100+ Languages;

    Application scenarios : speech recognition; voiceprint recognition;

    Accuracy rate : the word accuracy rate is not less than 98%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Elena Davitti; Anna-Stiina Wallinheimo (2024). Shaping Multilingual Access through Respeaking Technology, Project Data, 2021 [Dataset]. https://openresearch.surrey.ac.uk/esploro/outputs/dataset/Shaping-Multilingual-Access-through-Respeaking-Technology/99862365802346

Data from: Shaping Multilingual Access through Respeaking Technology, Project Data, 2021

Related Article
Explore at:
Dataset updated
Jan 25, 2024
Dataset provided by
UK Data Service
Authors
Elena Davitti; Anna-Stiina Wallinheimo
Time period covered
2024
Dataset funded by
Economic and Social Research Council (United Kingdom, Swindon) - ESRC
Description

The recent global surge in audiovisual content has emphasized the importance of accessibility for wider audiences. The SMART project addressed this by exploring interlingual respeaking, a novel practice combining speech recognition technology with human interpreting and subtitling skills to produce real-time, high-quality speech-to-text services across languages. This method evolved from intralingual respeaking, which is widely used in broadcasting to create live subtitles for the deaf and hard-of-hearing. Interlingual respeaking, which involves translating live content into another language and subtitling it, could revolutionize subtitle production for foreign-language content, overcoming sensory and language barriers.. Interlingual respeaking is defined as a type of simultaneous interpreting, producing text with minimal delay. It involves two shifts: interlingual (from one language to another) and intermodal (from spoken to written). This practice combines the challenges of simultaneous interpreting with the requirements of subtitling. Respeakers must accurately convey messages in another language to a speech recognition system, adding punctuation and making real-time edits for clarity and readability. This method leverages speech recognition technology and human translation skills to ensure efficient and high-quality translated subtitles.. Interlingual respeaking offers immense potential for making multilingual content accessible to international and hearing-impaired audiences. It's particularly relevant for television, conferences, and live events. However, research into its feasibility, accuracy, and the skills required for language professionals is still in its early stages.. The SMART project aimed to address these research gaps. It focused on the cognitive and interpersonal profiles needed for successful interlingual respeaking. The project extended a pilot study, including language professionals from interpreting, subtitling, translation, and intralingual respeaking, to explore how cognitive and interpersonal factors influence learning and performance in this field.. The SMART project's main goals were to study interlingual respeaking's complexity, focusing on the acquisition and implementation of relevant skills, and the accuracy of the final subtitles. The research involved 23 postgraduate students with backgrounds in interpreting, subtitling, and intralingual respeaking.. The research program examined three areas: process, product, and upskilling. It sought to understand the variables contributing to language professionals' performance, challenges faced during performance, and how performance can be sustained. Regarding the product, it aimed to identify factors affecting the accuracy of interlingual respeaking and the impact of various individual and content characteristics on accuracy. For upskilling, the focus was on the challenges and strengths of the training course.. Key findings included the importance of working memory in predicting high performance and the enhancement of certain cognitive abilities through training. Interpersonal traits like conscientiousness and integrated regulation were also examined. In terms of product accuracy, the average was 95.37%, with omissions being the strongest negative predictor of accuracy. High performers outperformed low performers across all scenarios.. The upskilling course was innovative, focusing on modular training and combining intralingual and interlingual practices. It addressed real-world challenges and was tailored to different professional backgrounds. The approach proved effective, with 82% of participants finding the course met their expectations and 86% acknowledging its challenging nature. The study confirmed the benefits of a modular and personalized training approach, highlighting the need for flexibility and adaptability to different skill levels and backgrounds.

Search
Clear search
Close search
Google apps
Main menu