100+ datasets found
  1. All Turkish Words Dataset 📃🖊️

    • kaggle.com
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enis Tuna (2024). All Turkish Words Dataset 📃🖊️ [Dataset]. https://www.kaggle.com/datasets/enistuna/all-turkish-words-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Enis Tuna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ALL TURKISH WORDS DATASET

    This dataset contains all the Turkish words I've managed to fetch from the web. The dataset has approximately 7 million lines of Turkish word tokens, each seperated by " " so it is easier to read.

    Some words are different variations of the same word e.g. "araba", "arabada", "arabadan". Feel free to use lemmatization algorithms to reduce the data size.

    I believe this dataset could be improved upon. It certainly is not finished. I will update this dataset if I can get my hands on new words in the future.

    My Linkedin: https://www.linkedin.com/in/enistuna/ My Github: https://github.com/enistuna

  2. h

    stsb-mt-turkish

    • huggingface.co
    Updated Dec 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrecan Çelik (2021). stsb-mt-turkish [Dataset]. https://huggingface.co/datasets/emrecan/stsb-mt-turkish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2021
    Authors
    Emrecan Çelik
    Description

    STSb Turkish

    Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STSb English dataset. This dataset is not reviewed by expert human translators. Uploaded from this repository.

  3. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  4. s

    Turkish Language Speech Datasets | NLP, Conversational AI & Machine Learning...

    • mg.shaip.com
    • uz.shaip.com
    • +71more
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Turkish Language Speech Datasets | NLP, Conversational AI & Machine Learning [Dataset]. https://mg.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Dec 9, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Türkiye
    Description

    Enhance your Conversational AI model with our Off-the-Shelf Turkish Language Dataset (Turkish Language Speech Datasets). Shaip high-quality audio datasets are a quick and effective solution for model training.

  5. E

    GlobalPhone Turkish

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GlobalPhone Turkish [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1917
    Explore at:
    audio formatAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Turkish corpus was produced using the Zaman newspaper. It contains recordings of 100 speakers (28 males, 72 females) recorded in Istanbul, Turkey. The following age distribution has been obtained: 30 speakers are below 19, 30 speakers are between 20 and 29, 23 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 3 speakers are over 50.

  6. Turkish Wikipedia Dataset

    • kaggle.com
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osman Kagan Kurnaz (2024). Turkish Wikipedia Dataset [Dataset]. https://www.kaggle.com/datasets/osmankagankurnaz/turkish-wikipedia-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Osman Kagan Kurnaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
    • The articles in this dataset are not specifically tagged for a particular task and the dataset is untagged.
    • This dataset is written in Turkish and was created by a team of volunteers using community engagement methods.
    • This dataset is an original dataset created from the Turkish Wikipedia.

    Thanks for using the Turkish Wikipedia dataset! We hope it will be useful for your language modeling and text generation tasks.

    Since the Turkish Wikipedia dataset is not on Kaggle, I took a shared dataset on Huggingface. I merged the shared dataset as 2 parquet files and shared it on Kaggle. You can go to the version of the dataset shared on Huggingface from the link below. I would also like to thank https://huggingface.co/musabg for creating this dataset.

    Original link to this dataset: https://huggingface.co/datasets/musabg/wikipedia-tr

  7. h

    data-turkish-class

    • huggingface.co
    Updated Feb 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    savc (2023). data-turkish-class [Dataset]. https://huggingface.co/datasets/pnrr/data-turkish-class
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 25, 2023
    Authors
    savc
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    pnrr/data-turkish-class dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. F

    General Domain Scripted Monologue Speech Data: Turkish (Turkey)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). General Domain Scripted Monologue Speech Data: Turkish (Turkey) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/general-scripted-speech-monologues-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Area covered
    Türkiye
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Turkish Scripted Monologue Speech Dataset for the General Domain. This meticulously curated dataset is designed to advance the development of General domain Turkish language speech recognition models.

    Speech Data

    This training dataset comprises over 6,000 high-quality scripted prompt recordings in Turkish. These recordings cover various General domain topics and scenarios, designed to build robust and accurate speech technology.

    Participant Diversity:
    Speakers: 60 native Turkish speakers from different regions of Turkey.
    Regions: Ensures a balanced representation of Turkish accents, dialects, and demographics.
    Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
    Recording Details:
    Recording Nature: Audio recordings of scripted prompts/monologues.
    Audio Duration: Average duration of 5 to 30 seconds per recording.
    Formats: WAV format with mono channels, a bit depth of 16 bits, and sample rates of 8 kHz and 16 kHz.
    Environment: Recordings are conducted in quiet settings without background noise and echo.
    Topic Diversity: The dataset encompasses a wide array of topics and conversational scenarios from the General domain. Topics include:
    Daily Conversations
    Topic Specific Conversation
    General Information and Advice
    Idoms and Sayings
    Other Elements: To enhance realism and utility, the scripted prompts incorporate various elements commonly encountered in general interactions:
    Names: Region-specific names of males and females in various formats.
    Addresses: Region-specific addresses in different spoken formats.
    Dates & Times: Inclusion of date and time in various contexts.
    Organization Names: Names of different types of organizations.
    Numbers & Currencies: Various numbers and currencies in domain-specific interactions.

    Each scripted prompt is crafted to reflect real-life scenarios encountered in the General domain, ensuring applicability in training robust natural language processing and speech recognition models.

    Transcription Data

    In addition to high-quality audio recordings, the dataset includes meticulously prepared text files with verbatim transcriptions of each audio file. These transcriptions are essential for training accurate and robust speech recognition models.

    Content: Each text file contains the exact scripted prompt corresponding to its audio file, ensuring consistency.
    Format: Transcriptions are provided in plain text (.TXT) format, with files named to match their associated audio files for easy reference.
    Quality: All transcriptions are verified for accuracy and consistency by native Turkish transcribers.

    Metadata

    The dataset provides comprehensive metadata for each audio recording and participant:

    Participant Metadata: Unique identifier, age, gender, country, state, and dialect.
    Other Metadata:

  9. F

    Real Estate Call Center Speech Data: Turkish (Turkey)

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Real Estate Call Center Speech Data: Turkish (Turkey) [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Area covered
    Türkiye
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Turkish Call Center Speech Dataset for the Real Estate domain designed to enhance the development of call center speech recognition models specifically for the Real Estate industry. This dataset is meticulously curated to support advanced speech recognition, natural language processing, conversational AI, and generative voice AI algorithms.

    Speech Data:

    This training dataset comprises 30 Hours of call center audio recordings covering various topics and scenarios related to the Real Estate domain, designed to build robust and accurate customer service speech technology.

    Participant Diversity:
    Speakers: 60 expert native Turkish speakers from the FutureBeeAI Community.
    Regions: Different states/provinces of Turkey, ensuring a balanced representation of Turkish accents, dialects, and demographics.
    Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
    Recording Details:
    Conversation Nature: Unscripted and spontaneous conversations between call center agents and customers.
    Call Duration: Average duration of 5 to 15 minutes per call.
    Formats: WAV format with stereo channels, a bit depth of 16 bits, and a sample rate of 8 and 16 kHz.
    Environment: Without background noise and without echo.

    Topic Diversity

    This dataset offers a diverse range of conversation topics, call types, and outcomes, including both inbound and outbound calls with positive, neutral, and negative outcomes.

    Inbound Calls:
    Property Inquiry
    Rental Property Search & Availability
    Renovation Inquiries
    Property Features & Amenities Inquiry
    Investment Property Analysis & Advice
    Property History & Ownership Details, and many more
    Outbound Calls:
    New Property Listing Update
    Post Purchase Follow-ups
    Investment Opportunities & Property Recommendations
    Property Value Updates
    Customer Satisfaction Surveys, and many more

    This extensive coverage ensures the dataset includes realistic call center scenarios, which is essential for developing effective customer support speech recognition models.

    Transcription

    To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. These transcriptions feature:

    Speaker-wise Segmentation: Time-coded segments for both agents and customers.
    Non-Speech Labels: Tags and labels for non-speech elements.
    Word Error Rate: Word error rate is less than 5% thanks to the dual layer of QA.

    These ready-to-use transcriptions accelerate the development of the Real Estate domain call center conversational AI and ASR models for the Turkish language.

    Metadata

    The dataset provides comprehensive metadata for each conversation and participant:

    Participant Metadata: Unique identifier, age, gender, country, state, district, accent and dialect.
    Conversation Metadata: Domain, topic, call type, outcome/sentiment, bit depth, and sample rate.

    This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Turkish call center speech recognition models.

    Usage and

  10. m

    English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

    • data.mendeley.com
    Updated Feb 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
    Explore at:
    Dataset updated
    Feb 9, 2017
    Authors
    H. Bahadir Sahin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

    Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

    By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

    We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

    All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.

  11. h

    turkish-sentiment-analysis-dataset_ENG

    • huggingface.co
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    turkish-sentiment-analysis-dataset_ENG [Dataset]. https://huggingface.co/datasets/naytin/turkish-sentiment-analysis-dataset_ENG
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2024
    Authors
    inayet Cizmeci
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    naytin/turkish-sentiment-analysis-dataset_ENG dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    Turkish-Biomedical-corpus-trM

    • huggingface.co
    Updated Apr 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hazal türkmen (2023). Turkish-Biomedical-corpus-trM [Dataset]. https://huggingface.co/datasets/hazal/Turkish-Biomedical-corpus-trM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2023
    Authors
    hazal türkmen
    Description

    hazal/Turkish-Biomedical-corpus-trM dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. P

    NLI-TR Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrah Budur; Rıza Özçelik; Tunga Güngör; Christopher Potts, NLI-TR Dataset [Dataset]. https://paperswithcode.com/dataset/nli-tr
    Explore at:
    Authors
    Emrah Budur; Rıza Özçelik; Tunga Güngör; Christopher Potts
    Description

    Natural Language Inference in Turkish (NLI-TR) provides translations of two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.

  14. s

    Wake Word Turkish Dataset | Shaip

    • mg.shaip.com
    • kn.shaip.com
    • +69more
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Turkish Dataset | Shaip [Dataset]. https://mg.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Dec 8, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Wake Word Turkish Dataset is a collection of audio recordings specifically curated for training and evaluating wake word detection systems in the Turkish language. This dataset includes a variety of speakers, environments, and scenarios to ensure robustness and effectiveness in wake word detection algorithms. It serves as a valuable resource for researchers and developers working on voice-controlled systems and natural language processing applications in Turkish.

  15. E

    Turkish web corpus MaCoCu-tr 1.0

    • live.european-language-grid.eu
    • clarin.si
    xml
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Turkish web corpus MaCoCu-tr 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/19770
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Apr 26, 2022
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" internet top-level domain in 2021, extending the crawl dynamically to other domains as well (https://github.com/macocu/MaCoCu-crawler).

    Considerable efforts were devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies.

    Each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score (based on a language model). The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality and fluency, the automatically identified language of the text in the paragraph, and information whether the paragraph contains personal information.

    This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

  16. 504 Hours - Turkish(Turkey) Real-world Casual Conversation and Monologue...

    • m.nexdata.ai
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 504 Hours - Turkish(Turkey) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1324
    Explore at:
    Dataset updated
    Feb 9, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    World, Türkiye
    Variables measured
    Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Turkish(Turkey) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  17. F

    Turkish Intervention: Central Bank of Turkey Purchases of USD (Millions of...

    • fred.stlouisfed.org
    json
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Turkish Intervention: Central Bank of Turkey Purchases of USD (Millions of USD) [Dataset]. https://fred.stlouisfed.org/series/TRINTDEXR
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 2, 2025
    License

    https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required

    Description

    Graph and download economic data for Turkish Intervention: Central Bank of Turkey Purchases of USD (Millions of USD) (TRINTDEXR) from 2002-01-01 to 2025-01-02 about intervention, Turkey, banks, and depository institutions.

  18. F

    Turkish Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/turkish-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Turkish Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Turkish language, advancing the field of artificial intelligence.

    Dataset Content: This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Turkish. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Turkish people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity: To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.Answer Formats: To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Turkish Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.Quality and Accuracy: The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The Turkish versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Turkish Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  19. h

    turkish-english-translate

    • huggingface.co
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    turkish-english-translate [Dataset]. https://huggingface.co/datasets/hasancanonder/turkish-english-translate
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2025
    Authors
    Hasan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    hasancanonder/turkish-english-translate dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. flickr-turkish-dataset

    • kaggle.com
    Updated Aug 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Enes Kulak
    Description

    Dataset

    This dataset was created by Enes Kulak

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Enis Tuna (2024). All Turkish Words Dataset 📃🖊️ [Dataset]. https://www.kaggle.com/datasets/enistuna/all-turkish-words-dataset
Organization logo

All Turkish Words Dataset 📃🖊️

All Turkish words and possible variations are included.

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Enis Tuna
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ALL TURKISH WORDS DATASET

This dataset contains all the Turkish words I've managed to fetch from the web. The dataset has approximately 7 million lines of Turkish word tokens, each seperated by " " so it is easier to read.

Some words are different variations of the same word e.g. "araba", "arabada", "arabadan". Feel free to use lemmatization algorithms to reduce the data size.

I believe this dataset could be improved upon. It certainly is not finished. I will update this dataset if I can get my hands on new words in the future.

My Linkedin: https://www.linkedin.com/in/enistuna/ My Github: https://github.com/enistuna

Search
Clear search
Close search
Google apps
Main menu