100+ datasets found
  1. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    Dataset updated
    Jun 21, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  2. Turkish Sentiment Analysis Dataset

    • humirapps.cs.hacettepe.edu.tr
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Multimedia Information Retrieval Laboratory (2017). Turkish Sentiment Analysis Dataset [Dataset]. http://doi.org/10.1109/SITIS.2016.57
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2017
    Dataset provided by
    Hacettepe Universityhttp://www.hacettepe.edu.tr/
    Authors
    Hacettepe University Multimedia Information Retrieval Laboratory
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.

  3. k

    Turkish-Dialog-Dataset

    • kaggle.com
    Updated May 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Turkish-Dialog-Dataset [Dataset]. https://www.kaggle.com/datasets/talharzgarakku/turkish-dialog-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 17, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Introducing the Turkish Dialog Dataset

    The Turkish Dialog Dataset is a new resource for researchers and developers working on natural language processing (NLP) and machine learning (ML) projects. This dataset contains a large collection of conversational data in Turkish, providing a valuable resource for training and testing NLP and ML models.

    The dataset includes conversations from a variety of sources, including translated Cornell Movie Dialog dataset, Ubuntu Dialog dataset, speacial datasets. The data has been carefully curated and annotated to ensure high quality and accuracy.

    One of the key features of the Turkish Dialog Dataset is its focus on real-world conversational data. This makes it an ideal resource for developing NLP and ML models that can understand and generate natural-sounding Turkish text.

    This dataset can be used to train more sophisticated models that can understand the context of a conversation.

    Overall, the Turkish Dialog Dataset is an exciting new resource for anyone working on NLP or ML projects in Turkish. Its large size and high quality make it an invaluable tool for developing advanced models that can understand and generate natural-sounding Turkish text.

  4. s

    Turkish Language Speech Datasets | NLP, Conversational AI & Machine Learning...

    • shaip.com
    • pl.shaip.com
    • +26more
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Turkish Language Speech Datasets | NLP, Conversational AI & Machine Learning [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Enhance your Conversational AI model with our Off-the-Shelf Turkish Language Dataset (Turkish Language Speech Datasets). Shaip high-quality audio datasets are a quick and effective solution for model training.

  5. E

    GlobalPhone Turkish

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) (2017). GlobalPhone Turkish [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0206/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Turkish corpus was produced using the Zaman newspaper. It contains recordings of 100 speakers (28 males, 72 females) recorded in Istanbul, Turkey. The following age distribution has been obtained: 30 speakers are below 19, 30 speakers are between 20 and 29, 23 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 3 speakers are over 50.

  6. F

    Travel Call Center Speech Data: Turkish (Turkey)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Travel Call Center Speech Data: Turkish (Turkey) [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/travel-call-center-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Area covered
    Türkiye
    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Turkish Language Call Center Speech Dataset for the Travel domain. It is a specialized and comprehensive collection of voice data designed to enhance the development of call center speech recognition models specifically for the Travel industry.

    With high-quality call center audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms in the Travel domain. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Turkish language spoken in Turkey.

    Speech Data:

    This training dataset comprises 30 hours of call center audio recordings covering various topics and scenarios related to the Travel domain, to build robust and accurate customer service speech technology.

    To curate realistic call center interactions, we collaborated with a diverse network of 60 expert native Turkish speakers from different states/provinces of Turkey. This collaborative effort ensures a balanced representation of Turkey accents, dialects, and demographics, promoting inclusivity and reducing biases in the dataset.

    Each audio recording captures the essence of unscripted and spontaneous conversations between call center agents and customers, with an average duration ranging from 5 to 15 minutes per call. The dataset includes both inbound and outbound calls, covering scenarios such as inquiries, promotional offers, complaints, technical support, and more. Additionally, the dataset contains call center conversations with both positive and negative outcomes, providing a diverse and realistic dataset.

    The speech data is available in WAV format with stereo channels, a bit depth of 16 bits, and a sample rate of 8 kHz, ensuring high-quality audio for accurate analysis. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This includes the participant’s age, gender, country, state, and dialect. Additionally, it includes metadata like domain, topic, call type, outcome, bit depth, and sample rate for each conversation.

    The metadata serves as a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Turkish language call center speech recognition models for the Travel domain.

    Transcription:

    To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags, covering both the agent and customer conversations.

    These ready-to-use transcriptions accelerate the development of Travel call center conversational AI and ASR models for the Turkish language.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our call center voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This Travel call center audio dataset is created by FutureBeeAI and is available for commercial use!

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, or building state-of-the-art voice assistants to improve customer experiences in the Travel sector, our dataset serves as a trusted resource to meet your goals

  7. Turkish-English Dataset

    • kaggle.com
    zip
    Updated Jul 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enes Kulak (2022). Turkish-English Dataset [Dataset]. https://www.kaggle.com/datasets/eneskulak/turkishenglish-dataset
    Explore at:
    zip(39208289 bytes)Available download formats
    Dataset updated
    Jul 20, 2022
    Authors
    Enes Kulak
    Description

    Dataset

    This dataset was created by Enes Kulak

    Contents

  8. s

    Wake Word Turkish Dataset

    • shaip.com
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word Turkish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Oct 12, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Wake Word Turkish Dataset is a collection of audio recordings specifically curated for training and evaluating wake word detection systems in the Turkish language. This dataset includes a variety of speakers, environments, and scenarios to ensure robustness and effectiveness in wake word detection algorithms. It serves as a valuable resource for researchers and developers working on voice-controlled systems and natural language processing applications in Turkish.

  9. d

    Turkish Dataset for Identification of Author Gender - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Oct 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Turkish Dataset for Identification of Author Gender - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/65f524b1-e6d3-50b5-a959-19109494daca
    Explore at:
    Dataset updated
    Oct 24, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The IAG-TNKU Dataset is a large collection of Turkish news articles that can be used in different Turkish Text Classification NLP tasks such as Identification of Author Gender In Turkish News. The text data belong to 32 female and 38 male authors, has been extracted from the archive of a newspaper (www.hurriyet.com.tr) for the interval 08.11.1997 and 24.04.2019. The dataset divided into males and females in a balanced way consists of a total of 43.292 articles.How to use the IAG-TNKU Dataset:_1. Unzip compressed resources.2. There are two folder (Females and Males)3. Each folder contains a set of article files in .txt formatted corresponding to its category.

  10. c

    Data from: Turkish-English parallel corpus MaCoCu-tr-en 2.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Apr 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marta Bañón; Malina Chichirau; Miquel Esplà-Gomis; Mikel L. Forcada; Aarón Galiano-Jiménez; Cristian García-Romero; Taja Kuzman; Nikola Ljubešić; Rik van Noord; Leopoldo Pla Sempere; Gema Ramírez-Sánchez; Peter Rupnik; Vít Suchomel; Antonio Toral; Jaume Zaragoza-Bernabeu (2023). Turkish-English parallel corpus MaCoCu-tr-en 2.0 [Dataset]. https://www.clarin.si/repository/xmlui/handle/11356/1816
    Explore at:
    Dataset updated
    Apr 26, 2023
    Authors
    Marta Bañón; Malina Chichirau; Miquel Esplà-Gomis; Mikel L. Forcada; Aarón Galiano-Jiménez; Cristian García-Romero; Taja Kuzman; Nikola Ljubešić; Rik van Noord; Leopoldo Pla Sempere; Gema Ramírez-Sánchez; Peter Rupnik; Vít Suchomel; Antonio Toral; Jaume Zaragoza-Bernabeu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Turkish-English parallel corpus MaCoCu-tr-en 2.0 was built by crawling the “.tr” and “.cy” internet top-level domains in 2021, extending the crawl dynamically to other domains as well.

    All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus.

    The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level.

    Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document.

    The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American).

    As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner corpus. The new version also provides additional metadata, such as the position of the sentence in the paragraph and document, and information whether the sentence is related to a DSI. Moreover, the corpus is now also provided in a document-level format.

    Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

  11. d

    English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

    • b2find.dkrz.de
    Updated Nov 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/6524b0d6-b642-5dfa-8084-ab210abc2387
    Explore at:
    Dataset updated
    Nov 2, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization. Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities. By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences. We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions. All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.

  12. E

    Turkish web corpus MaCoCu-tr 1.0

    • live.european-language-grid.eu
    • clarin.si
    xml
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Turkish web corpus MaCoCu-tr 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/19770
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Apr 26, 2022
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" internet top-level domain in 2021, extending the crawl dynamically to other domains as well (https://github.com/macocu/MaCoCu-crawler).

    Considerable efforts were devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies.

    Each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score (based on a language model). The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality and fluency, the automatically identified language of the text in the paragraph, and information whether the paragraph contains personal information.

    This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

  13. F

    Telecom Call Center Speech Data: Turkish (Turkey)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Telecom Call Center Speech Data: Turkish (Turkey) [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/telecom-call-center-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Area covered
    Türkiye
    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Turkish Language Call Center Speech Dataset for the Telecom domain. It is a specialized and comprehensive collection of voice data designed to enhance the development of call center speech recognition models specifically for the Telecom industry.

    With high-quality call center audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms in the Telecom domain. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Turkish language spoken in Turkey.

    Speech Data:

    This training dataset comprises 30 hours of call center audio recordings covering various topics and scenarios related to the Telecom domain, to build robust and accurate customer service speech technology.

    To curate realistic call center interactions, we collaborated with a diverse network of 60 expert native Turkish speakers from different states/provinces of Turkey. This collaborative effort ensures a balanced representation of Turkey accents, dialects, and demographics, promoting inclusivity and reducing biases in the dataset.

    Each audio recording captures the essence of unscripted and spontaneous conversations between call center agents and customers, with an average duration ranging from 5 to 15 minutes per call. The dataset includes both inbound and outbound calls, covering scenarios such as inquiries, promotional offers, complaints, technical support, and more. Additionally, the dataset contains call center conversations with both positive and negative outcomes, providing a diverse and realistic dataset.

    The speech data is available in WAV format with stereo channels, a bit depth of 16 bits, and a sample rate of 8 kHz, ensuring high-quality audio for accurate analysis. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This includes the participant’s age, gender, country, state, and dialect. Additionally, it includes metadata like domain, topic, call type, outcome, bit depth, and sample rate for each conversation.

    The metadata serves as a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Turkish language call center speech recognition models for the Telecom domain.

    Transcription:

    To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags, covering both the agent and customer conversations.

    These ready-to-use transcriptions accelerate the development of Telecom call center conversational AI and ASR models for the Turkish language.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our call center voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This Telecom call center audio dataset is created by FutureBeeAI and is available for commercial use!

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, or building state-of-the-art voice assistants to improve customer experiences in the Telecom sector, our dataset serves as a trusted resource to meet your goals

  14. P

    Turkish Punctuation Restoration Dataset

    • paperswithcode.com
    Updated Sep 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Turkish Punctuation Restoration Dataset [Dataset]. https://paperswithcode.com/dataset/turkish-punctuation-restoration
    Explore at:
    Dataset updated
    Sep 14, 2023
    Description

    we have prepared a dataset using publicly available TED Talks transcripts [27] and selected the Turkish corpus. The resulting Turkish punctuation restoration dataset currently consists of 146K sentences and 1.8M tokens. The ratio of the train, validation, and test splits are 0.8, 0.1, and 0.1, respectively. Data files contain two columns. The first column has the tokens separated by white space. The second column includes tags for each token.

  15. T

    Turkey Turkish Airlines: AS: International Flights: North America

    • ceicdata.com
    Updated Aug 8, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2018). Turkey Turkish Airlines: AS: International Flights: North America [Dataset]. https://www.ceicdata.com/en/turkey/airlines-statistics-turkish-airlines
    Explore at:
    Dataset updated
    Aug 8, 2018
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2009 - Dec 1, 2017
    Area covered
    Türkiye
    Variables measured
    Vehicle Traffic
    Description

    Turkish Airlines: AS: International Flights: North America data was reported at 23,448,567.000 Unit/km th in 2017. This records a decrease from the previous number of 25,047,950.000 Unit/km th for 2016. Turkish Airlines: AS: International Flights: North America data is updated yearly, averaging 13,715,060.000 Unit/km th from Dec 2009 to 2017, with 9 observations. The data reached an all-time high of 25,047,950.000 Unit/km th in 2016 and a record low of 3,345,763.000 Unit/km th in 2009. Turkish Airlines: AS: International Flights: North America data remains active status in CEIC and is reported by Turkish Airlines, Incorporation. The data is categorized under Global Database’s Turkey – Table TR.TA011: Airlines Statistics: Turkish Airlines .

  16. h

    stsb-mt-turkish

    • huggingface.co
    Updated Dec 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrecan Çelik (2021). stsb-mt-turkish [Dataset]. https://huggingface.co/datasets/emrecan/stsb-mt-turkish
    Explore at:
    Dataset updated
    Dec 25, 2021
    Authors
    Emrecan Çelik
    Description

    STSb Turkish

    Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STSb English dataset. This dataset is not reviewed by expert human translators. Uploaded from this repository.

  17. T

    Turkey Imports

    • tradingeconomics.com
    • fr.tradingeconomics.com
    • +16more
    csv, excel, json, xml
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2024). Turkey Imports [Dataset]. https://tradingeconomics.com/turkey/imports
    Explore at:
    xml, excel, csv, jsonAvailable download formats
    Dataset updated
    Apr 4, 2024
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 31, 1957 - Mar 31, 2024
    Area covered
    Türkiye
    Description

    Imports in Turkey increased to 30100 USD Million in March from 27852.81 USD Million in February of 2024. This dataset provides the latest reported value for - Turkey Imports - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.

  18. E

    Turkish Speecon database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). Turkish Speecon database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0178/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Turkish Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 550 adult Turkish speakers (280 males, 270 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). 2) The second set comprises the recordings of 50 child Turkish speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room). This database is partitioned into 28 DVDs (first set) and 4 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:Calibration data: 6 noise recordingsThe “silence word” recordingFree spontaneous items (adults only):3 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address222 application specific words and phrases per session (adults)74 toy commands, 14 general commands, 31 phone commands and 4 application word synonyms (children)The following age distribution has been obtained: Adults: 244 speakers are between 15 and 30, 235 speakers are between 31 and 45, and 71 speakers are over 46.Children: 25 speakers are between 8 and 10, 25 speakers are between 11 and 15.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

  19. F

    Turkish Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/turkish-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Turkish Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Turkish language, advancing the field of artificial intelligence.,

    Dataset Content: This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Turkish. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. ,

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Turkish people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.,

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.,

    Question Diversity: To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.,

    Answer Formats: To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.,

    Data Format and Annotation Details: This fully labeled Turkish Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.,

    Quality and Accuracy: The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.,

    The Turkish versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.,

    Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.,

    License: The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Turkish Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  20. P

    NLI-TR Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrah Budur; Rıza Özçelik; Tunga Güngör; Christopher Potts, NLI-TR Dataset [Dataset]. https://paperswithcode.com/dataset/nli-tr
    Explore at:
    Authors
    Emrah Budur; Rıza Özçelik; Tunga Güngör; Christopher Potts
    Description

    Natural Language Inference in Turkish (NLI-TR) provides translations of two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset

turkish-sentiment-analysis-dataset

Turkish Sentiment Dataset

winvoker/turkish-sentiment-analysis-dataset

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 21, 2022
Authors
Batuhan
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

Search
Clear search
Close search
Google apps
Main menu