100+ datasets found
  1. Turkish Tweets Dataset

    • kaggle.com
    zip
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anil Guven (2021). Turkish Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/anil1055/turkish-tweet-dataset
    Explore at:
    zip(170312 bytes)Available download formats
    Dataset updated
    Apr 9, 2021
    Authors
    Anil Guven
    Description

    Dataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.

    You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.

    Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435

  2. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    • kaggle.com
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  3. h

    turkish-offensive-language-detection

    • huggingface.co
    Updated Sep 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toygar Tanyel (2022). turkish-offensive-language-detection [Dataset]. https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2022
    Authors
    Toygar Tanyel
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Summary

    This dataset is enhanced version of existing offensive language studies. Existing studies are highly imbalanced, and solving this problem is too costly. To solve this, we proposed contextual data mining method for dataset augmentation. Our method is basically prevent us from retrieving random tweets and label individually. We can directly access almost exact hate related tweets and label them directly without any further human interaction in order to solve imbalanced… See the full description on the dataset page: https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection.

  4. NER Dataset(Turkish)

    • kaggle.com
    zip
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akay (2024). NER Dataset(Turkish) [Dataset]. https://www.kaggle.com/datasets/akay16/ner-datasetturkish
    Explore at:
    zip(9149708 bytes)Available download formats
    Dataset updated
    Apr 14, 2024
    Authors
    Akay
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data split:

    • 18.000 train
    • 1000 test
    • 1000 dev

    Labels:

    • CARDINAL
    • DATE
    • EVENT
    • FAC
    • GPE
    • LANGUAGE
    • LAW
    • LOC
    • MONEY
    • NORP
    • ORDINAL
    • ORG
    • PERCENT
    • PERSON
    • PRODUCT
    • QUANTITY
    • TIME
    • TITLE
    • WORK_OF_ART

    I do not **own **this dataset. I changed the original format for easier use and turned it into a **csv **and .spacy file.

    you reach the original version of the dataset from the **link **below

    https://github.com/turkish-nlp-suite/Turkish-Wiki-NER-Dataset

  5. Turkish Wikipedia Dataset

    • kaggle.com
    zip
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osman Kagan Kurnaz (2024). Turkish Wikipedia Dataset [Dataset]. https://www.kaggle.com/datasets/osmankagankurnaz/turkish-wikipedia-dataset
    Explore at:
    zip(458865119 bytes)Available download formats
    Dataset updated
    Mar 19, 2024
    Authors
    Osman Kagan Kurnaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
    • The articles in this dataset are not specifically tagged for a particular task and the dataset is untagged.
    • This dataset is written in Turkish and was created by a team of volunteers using community engagement methods.
    • This dataset is an original dataset created from the Turkish Wikipedia.

    Thanks for using the Turkish Wikipedia dataset! We hope it will be useful for your language modeling and text generation tasks.

    Since the Turkish Wikipedia dataset is not on Kaggle, I took a shared dataset on Huggingface. I merged the shared dataset as 2 parquet files and shared it on Kaggle. You can go to the version of the dataset shared on Huggingface from the link below. I would also like to thank https://huggingface.co/musabg for creating this dataset.

    Original link to this dataset: https://huggingface.co/datasets/musabg/wikipedia-tr

  6. Genius-Turkish-Dataset

    • kaggle.com
    zip
    Updated Nov 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Kemal Çıngıl (2025). Genius-Turkish-Dataset [Dataset]. https://www.kaggle.com/datasets/mustafakemal0146/genius-turkish-dataset
    Explore at:
    zip(22818735 bytes)Available download formats
    Dataset updated
    Nov 2, 2025
    Authors
    Mustafa Kemal Çıngıl
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    TURKISH SONG LYRICS FROM GENIUS DATASET

    DATASET DESCRIPTION

    This dataset contains a comprehensive collection of 44,692 Turkish song lyrics, extracted from the larger "Genius Song Lyrics with Language Information" dataset available on Kaggle. The original 9.07 GB dataset was filtered to include only songs identified with the language code 'tr' (Turkish), making it a clean and focused resource for Turkish Natural Language Processing (NLP) tasks.

    [TR] Bu veri seti, Kaggle'da bulunan "Genius Song Lyrics with Language Information" adlı büyük veri setinden ayıklanmış 44,692 adet Türkçe şarkı sözü içermektedir. Orijinal 9.07 GB'lık veri seti, dil kodu 'tr' (Türkçe) olarak tanımlanmış şarkıları içerecek şekilde filtrelenmiştir. Bu, Türkçe Doğal Dil İşleme (DDİ) görevleri için temiz ve odaklanmış bir kaynak oluşturmaktadır.

    HOW TO USE

    You can easily load this dataset using the Hugging Face datasets library.

    [TR] Bu veri setini Hugging Face datasets kütüphanesini kullanarak kolayca yükleyebilirsiniz.

    Örnek Python Kodu: ```python from datasets import load_dataset

    Load the dataset from the Hugging Face Hub

    Veri setini Hugging Face Hub'dan yükleyin

    dataset = load_dataset("mustafakemal0146/Genius-Turkish-Dataset")

    Example: Access the lyrics of the first song in the training split

    Örnek: Eğitim setindeki ilk şarkının sözlerine erişim

    print(dataset['train'][0]['lyrics']) ```

    DATASET STRUCTURE

    The dataset consists of a single CSV file, loaded as the train split, with the following columns:

    [TR] Veri seti, train bölünmüşü olarak yüklenen ve aşağıdaki sütunları içeren tek bir CSV dosyasından oluşur:

    • title (string): The title of the song / Şarkının başlığı.
    • tag (string): The genre tag associated with the song (e.g., 'rap', 'pop') / Şarkıyla ilişkilendirilen tür etiketi.
    • artist (string): The name of the primary artist / Ana sanatçının adı.
    • year (int): The release year of the song / Şarkının çıkış yılı.
    • views (int): The number of views on Genius.com / Genius.com'daki görüntülenme sayısı.
    • features (string): A string representation of featuring artists / Düet yapılan sanatçıların metin formatı.
    • lyrics (string): The full lyrics of the song / Şarkının tam sözleri.
    • id (int): A unique identifier from the original dataset / Orijinal veri setinden gelen benzersiz ID.
    • language_cld3 (string): Language code detected by CLD3 model (all 'tr') / CLD3 ile tespit edilen dil kodu (tümü 'tr').
    • language_ft (string): Language code detected by FastText model (all 'tr') / FastText ile tespit edilen dil (tümü 'tr').
    • language (string): Final aggregated language code (all 'tr') / Nihai dil kodu (tümü 'tr').

    DATA SOURCE AND CURATION

    This dataset is a curated subset of the "Genius Song Lyrics with Language Information" (Link: https://www.kaggle.com/datasets/pavanelisetty/genius-song-lyrics-with-language-information) dataset on Kaggle, originally collected from Genius.com. The filtering process involved reading the main song_lyrics.csv file and selecting all rows where the language column was equal to 'tr'.

    [TR] Bu veri seti, orijinal olarak Genius.com'dan toplanmış olan Kaggle'daki "Genius Song Lyrics with Language Information" veri setinin düzenlenmiş bir alt kümesidir. Filtreleme işlemi, ana song_lyrics.csv dosyasını okuyarak language sütununun 'tr' olduğu tüm satırların seçilmesiyle yapılmıştır.

    LICENSE

    The original dataset on Kaggle does not specify a license. This curated version is shared under the "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)" (Link: https://creativecommons.org/licenses/by-nc-sa/4.0/) license, assuming it will be used for non-commercial research and educational purposes. Please refer to the original data source for any commercial use inquiries.

    Created by MustafaKemal0146 (Hugging Face Profile: https://huggingface.co/MustafaKemal0146)

  7. F

    Turkish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Turkish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Turkish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Turkish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Turkish speech models that understand and respond to authentic Turkish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Turkish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:

    - Speakers: 60 verified native Turkish speakers from FutureBeeAI’s contributor community.

    - Regions: Representing various provinces of Turkey to ensure dialectal diversity and demographic balance.

    - Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

    Recording Details:

    - Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

    - Duration: Each conversation ranges from 15 to 60 minutes.

    - Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

    - Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:

    - Family & Relationships

    - Food & Recipes

    - Education & Career

    - Healthcare Discussions

    - Social Issues

    - Technology & Gadgets

    - Travel & Local Culture

    - Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:

    - Speaker-segmented dialogues

    - Time-coded utterances

    - Non-speech elements (pauses, laughter, etc.)

    - High transcription accuracy, achieved through double QA pass, average WER< 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    License

    This Turkish General Conversation Dataset is created by FutureBeeAI and is available for commercial licensing.

  8. h

    stsb-mt-turkish

    • huggingface.co
    Updated Dec 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrecan Çelik (2021). stsb-mt-turkish [Dataset]. https://huggingface.co/datasets/emrecan/stsb-mt-turkish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2021
    Authors
    Emrecan Çelik
    Description

    STSb Turkish

    Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STSb English dataset. This dataset is not reviewed by expert human translators. Uploaded from this repository.

  9. F

    Turkish General Domain Scripted Monologue Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish General Domain Scripted Monologue Speech Data [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/general-scripted-speech-monologues-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Turkish Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Turkish language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Turkish speech data.

    Speech Data

    This dataset features over 6,000 high-quality scripted monologue recordings in Turkish. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.

    Participant Diversity

    - Speakers: 60 native Turkish speakers

    - Regions: Broad regional coverage ensures diverse accents and dialects

    - Demographics: Participants aged 18 to 70, with a 60:40 male-to-female ratio

    Recording Specifications

    - Recording Type: Scripted monologues and prompt-based recordings

    - Audio Duration: 5 to 30 seconds per file

    - Format: WAV, mono channel, 16-bit, 8 kHz & 16 kHz sample rates

    - Environment: Clean, noise-free conditions to ensure clarity and usability

    Topic Coverage

    The dataset covers a wide variety of general conversation scenarios, including:

    Daily Conversations

    Topic-Specific Discussions

    General Knowledge and Advice

    Idioms and Sayings

    Contextual Features

    To enhance authenticity, the prompts include:

    Names: Male and female names specific to different Turkey regions

    Addresses: Commonly used address formats in daily Turkish speech

    Dates & Times: References used in general scheduling and time expressions

    Organization Names: Names of businesses, institutions, and other entities

    Numbers & Currencies: Mentions of quantities, prices, and monetary values

    Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.

    Transcription

    Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.

    Content: Exact match to the spoken audio

    Format: Plain text (.TXT), named identically to the corresponding audio file

    Quality Control: All transcripts are validated by native Turkish transcribers

    Metadata

    Rich metadata is included for detailed filtering and analysis:

    Speaker Metadata: Unique speaker ID, age, gender, region, and dialect

    Audio Metadata: Prompt transcript, recording setup, device specs, sample rate, bit depth, and format.

    License

    This dataset is developed and owned by FutureBeeAI and is available for commercial use, offering high-value resources for enterprises and research organizations developing Turkish speech technologies.

  10. All Turkish Words Dataset 📃🖊️

    • kaggle.com
    zip
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enis Tuna (2024). All Turkish Words Dataset 📃🖊️ [Dataset]. https://www.kaggle.com/datasets/enistuna/all-turkish-words-dataset
    Explore at:
    zip(42391799 bytes)Available download formats
    Dataset updated
    Mar 14, 2024
    Authors
    Enis Tuna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ALL TURKISH WORDS DATASET

    This dataset contains all the Turkish words I've managed to fetch from the web. The dataset has approximately 7 million lines of Turkish word tokens, each seperated by " " so it is easier to read.

    Some words are different variations of the same word e.g. "araba", "arabada", "arabadan". Feel free to use lemmatization algorithms to reduce the data size.

    I believe this dataset could be improved upon. It certainly is not finished. I will update this dataset if I can get my hands on new words in the future.

    My Linkedin: https://www.linkedin.com/in/enistuna/ My Github: https://github.com/enistuna

  11. m

    Turkish Dataset for Identification of Author Gender

    • data.mendeley.com
    Updated Jul 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pınar Tüfekci (2020). Turkish Dataset for Identification of Author Gender [Dataset]. http://doi.org/10.17632/8f93rjhgjk.1
    Explore at:
    Dataset updated
    Jul 6, 2020
    Authors
    Pınar Tüfekci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The IAG-TNKU Dataset is a large collection of Turkish news articles that can be used in different Turkish Text Classification NLP tasks such as Identification of Author Gender In Turkish News. The text data belong to 32 female and 38 male authors, has been extracted from the archive of a newspaper (www.hurriyet.com.tr) for the interval 08.11.1997 and 24.04.2019. The dataset divided into males and females in a balanced way consists of a total of 43.292 articles.

    How to use the IAG-TNKU Dataset:

    1. Unzip compressed resources.
    2. There are two folder (Females and Males)
    3. Each folder contains a set of article files in .txt formatted corresponding to its category.
  12. s

    Turkish Language Speech Datasets | NLP, Conversational AI & Machine Learning...

    • shaip.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Turkish Language Speech Datasets | NLP, Conversational AI & Machine Learning [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Türkiye
    Description

    Enhance your Conversational AI model with our Off-the-Shelf Turkish Language Dataset (Turkish Language Speech Datasets). Shaip high-quality audio datasets are a quick

  13. F

    Turkish Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/turkish-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Turkish General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Turkish usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Turkish conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Turkish speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700

    Turns per Chat: Up to 50 dialogue turns

    Contributors: 200 native Turkish speakers from the FutureBeeAI Crowd Community

    Format: TXT, DOCS, JSON or CSV (customizable)

    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies

    Health and wellness

    Children and parenting

    Family life and relationships

    Food and cooking

    Education and studying

    Festivals and traditions

    Environment and daily life

    Internet and tech usage

    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Turkish usage with:

    Colloquial expressions and local dialect influence

    Domain-relevant terminology

    Language-specific grammar, phrasing, and sentence flow

    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references

    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age

    Gender

    Country/Region

    Chat Domain

    Chat Topic

    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness

    Format checks for chat turns and metadata

    Linguistic verification by native speakers

    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Licensing

    This dataset is developed and owned by FutureBeeAI and is available under a commercial license. Custom licensing terms can be provided for academic, research, or enterprise clients.

  14. m

    Turkish Offensive Language Dataset

    • megatek.ai
    bin
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turkish Offensive Language Dataset [Dataset]. https://megatek.ai/en/dataset/turkish-offensive-language-dataset/
    Explore at:
    binAvailable download formats
    License

    https://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0

    Description

    The Turkish Offensive Language Dataset is a Turkish-language dataset collected from Twitter, designed for training models in offensive language detection, hate speech detection, and text classification tasks. Created by Gülzade Evni and Zeynep Baydemir, the dataset covers multiple subcategories of harmful content including racism, profanity, insult, and sexism.

    It consists of seven files and is distributed under the Apache-2.0 license, making it openly available for research and development purposes. The dataset is intended for practitioners and researchers working on natural language processing for Turkish social media content, addressing a recognized gap in low-resource language resources for content moderation applications.

  15. h

    turkish-academic-theses-dataset

    • huggingface.co
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umut Ertuğrul Daşgın (2025). turkish-academic-theses-dataset [Dataset]. https://huggingface.co/datasets/umutertugrul/turkish-academic-theses-dataset
    Explore at:
    Dataset updated
    Nov 10, 2025
    Authors
    Umut Ertuğrul Daşgın
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📚 Turkish Academic Theses Abstracts (TR/EN)

    A large-scale bilingual (Turkish–English) collection of abstracts from Turkish academic theses (YÖK Ulusal Tez Merkezi). This dataset focuses only on abstracts, provided in Turkish (abstract_tr) and English (abstract_en), suitable for summarization, translation, classification, and retrieval.

    Records: ~650k abstracts (TR & EN) Format: Parquet (.parquet) — fast & analytics-friendly Language: 🇹🇷 Turkish + 🇬🇧 English (parallel abstracts… See the full description on the dataset page: https://huggingface.co/datasets/umutertugrul/turkish-academic-theses-dataset.

  16. Turkish Call Center Conversations

    • kaggle.com
    zip
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anıl Sevinc (2025). Turkish Call Center Conversations [Dataset]. https://www.kaggle.com/datasets/anills/turkish-call-center-conversations
    Explore at:
    zip(442962 bytes)Available download formats
    Dataset updated
    Jul 1, 2025
    Authors
    Anıl Sevinc
    Description

    Turkish Multi-Domain Customer Service Conversations Dataset

    This dataset includes Turkish-language dialogues between customers and representatives across various service domains, such as finance, e-commerce, technical support, and general inquiries.

    Each conversation contains multiple turns and is structured with clearly labeled speaker roles (customer or representative), making it suitable for Natural Language Processing (NLP) tasks related to dialogue systems, intent detection, and chatbot development.

    🔹 Dataset Structure

    • conversation_id: Unique identifier for each conversation
    • category: Service domain label (e.g. Finance, Technical Support)
    • speaker: Role of the speaker (customer or representative)
    • text: Utterance in Turkish

    🔍 Use Cases

    • Chatbot training
    • Intent classification
    • Dialogue summarization
    • Speaker role detection
    • Turkish NLP pretraining/finetuning

    🧾 License

    CC BY 4.0 — Free to use with attribution

    🙋‍♂️ Creator

    Prepared by Anıl as part of a research and educational NLP project.

  17. F

    Turkish Call Center Data for Delivery & Logistics AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Call Center Data for Delivery & Logistics AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/delivery-call-center-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Turkish Call Center Speech Dataset for the Delivery and Logistics industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Turkish-speaking customers. With over 30 hours of real-world, unscripted call center audio, this dataset captures authentic delivery-related conversations essential for training high-performance ASR models.

    Curated by FutureBeeAI, this dataset empowers AI teams, logistics tech providers, and NLP researchers to build accurate, production-ready models for customer support automation in delivery and logistics.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Turkish speakers. Captured across various delivery and logistics service scenarios, these conversations cover everything from order tracking to missed delivery resolutions offering a rich, real-world training base for AI models.

    Participant Diversity:

    - Speakers: 60 native Turkish speakers from our verified contributor pool.

    - Regions: Multiple provinces of Turkey for accent and dialect diversity.

    - Participant Profile: Balanced gender distribution (60% male, 40% female) with ages ranging from 18 to 70.

    Recording Details:

    - Conversation Nature: Naturally flowing, unscripted customer-agent dialogues.

    - Call Duration: 5 to 15 minutes on average.

    - Audio Format: Stereo WAV, 16-bit depth, recorded at 8kHz and 16kHz.

    - Recording Environment: Captured in clean, noise-free, echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound delivery-related conversations, covering varied outcomes (positive, negative, neutral) to train adaptable voice models.

    Inbound Calls:

    - Order Tracking

    - Delivery Complaints

    - Undeliverable Addresses

    - Return Process Enquiries

    - Delivery Method Selection

    - Order Modifications, and more

    Outbound Calls:

    - Delivery Confirmations

    - Subscription Offer Calls

    - Incorrect Address Follow-ups

    - Missed Delivery Notifications

    - Delivery Feedback Surveys

    - Out-of-Stock Alerts, and others

    This comprehensive coverage reflects real-world logistics workflows, helping voice AI systems interpret context and intent with precision.

    Transcription

    All recordings come with high-quality, human-generated verbatim transcriptions in JSON format.

    Transcription Includes:

    - Speaker-Segmented Dialogues

    - Time-coded Segments

    - Non-speech Tags (e.g., pauses, noise)

    - High transcription accuracy with word error rate under 5% via dual-layer quality checks.

    These transcriptions support fast, reliable model development for Turkish voice AI applications in the delivery sector.

    Metadata

    Detailed metadata is included for each participant and conversation:

    Participant Metadata: ID, age, gender, region, accent, dialect.

    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical attributes.

    This metadata aids in training specialized models, filtering demographics, and running advanced analytics.

    License

    This Delivery and Logistics domain dataset is commercially licensed and ready for use in ASR, NLP, and voice automation projects in Turkish.

  18. Turkish Sentiment Analysis Dataset

    • humirapps.cs.hacettepe.edu.tr
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Multimedia Information Retrieval Laboratory (2017). Turkish Sentiment Analysis Dataset [Dataset]. http://doi.org/10.1109/SITIS.2016.57
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2017
    Dataset provided by
    Hacettepe Universityhttp://hacettepe.edu.tr/
    Authors
    Hacettepe University Multimedia Information Retrieval Laboratory
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.

  19. F

    Turkish Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/turkish-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Turkish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Turkish language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Turkish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Turkish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Turkish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in Turkish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Turkish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  20. Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün...

    • zenodo.org
    bin
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Ali Bayram; M. Ali Bayram (2024). Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün Türkçe Veri Seti [Dataset]. http://doi.org/10.5281/zenodo.13375018
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    M. Ali Bayram; M. Ali Bayram
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Önemli Not: Bu veri setinin cevap sütununda bir hata tespit edildi ve bu hata yeni sürümünde düzeltildi. Bu nedenle, son sürümünün kullanılması büyük önem taşımaktadır.

    Important Note: There was an error in the answer column of this dataset, which has been fixed in version the latest version. It is very important to use the latest version.

    Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün Türkçe Veri Seti (Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications)

    The Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications dataset is a comprehensive and original resource, specifically designed for training, fine-tuning, and evaluating AI models in Turkish. With 293,468 questions, this dataset stands as the most extensive collection in its field, covering a wide range of academic and professional subjects relevant to Turkey, including major exams like TUS (Medical Specialization Examination), KPSS (Public Personnel Selection Examination), and many others.

    Key Features:

    Completely Original Content: The dataset is entirely created from original Turkish sources, ensuring authenticity and relevance to the Turkish context. It has not been translated from other languages, which is crucial for maintaining the integrity of the language data.

    Extensive Data Volume: With nearly 300,000 questions, the dataset offers a substantial corpus for training models, enabling deep learning algorithms to gain a nuanced understanding of the Turkish language across diverse topics.

    Detailed Structure: The dataset is organized into six key columns:

    ‘bölüm’ (section): Indicates the broader exam or category.

    ‘konu’ (subject): Specifies the topic within the section.

    ‘soru’ (question): The question text itself.

    ‘cevap’ (answer): The correct answer to the question.

    ‘aciklama’ (explanation): Provides additional context or reasoning for the answer, crucial for models to understand the logic behind correct responses.

    ‘secenekler’ (options): The possible answer choices, essential for multiple-choice formats.

    Wide Range of Sections and Subjects: The dataset includes 67 sections covering over 800 unique subjects. These sections span from specialized medical fields in TUS to general knowledge and vocational exams like KPSS and Ehliyet, ensuring that the dataset reflects the complexity and breadth of Turkish academic and professional content.

    Dataset Source and Usage:

    Data Source: The dataset is compiled from publicly available data on the internet. While care has been taken to ensure that the data is original, there may be instances where some questions contain copyrighted material. If any copyright holders identify their material within the dataset, they are encouraged to contact the author, and the specific question will be promptly removed.

    Non-Commercial Use: This dataset is strictly intended for research and academic purposes. It cannot be used for commercial purposes under any circumstances.

    Importance for AI Models:

    1. Training: The vast number of questions, coupled with detailed explanations, makes this dataset an invaluable resource for training AI models to understand and process Turkish at a high level. The diversity of topics also ensures that the model is exposed to a wide range of vocabulary, concepts, and linguistic structures.

    2. Fine-Tuning: For researchers and developers looking to fine-tune existing models, such as GPT, BERT, or other transformer-based architectures, this dataset offers domain-specific content that can significantly enhance performance in areas like medical language processing, legal text analysis, or general-purpose Turkish language understanding.

    3. Evaluation: The Turkish MMLU dataset is ideal for evaluating the performance of AI models in Turkish. With its rich content and structured format, it allows for rigorous testing across various subjects, helping to measure how well a model can comprehend and generate accurate responses in Turkish.

    4. Real-World Application: Beyond academic research, this dataset is also highly applicable in developing AI-powered tools for exam preparation, automated tutoring systems, and educational applications that require a deep understanding of the Turkish language and its diverse domains.

    Example Sections:

    Medical Exams (TUS): Includes specialized subjects such as Farmakoloji, Patoloji, Mikrobiyoloji, and more, which are critical for training models intended for medical documentation or decision support systems.

    Public and Professional Exams (KPSS): Encompasses a wide array of subjects like Genel Kültür, Tarih, Coğrafya, and Vatandaşlık, making it valuable for general-purpose models.

    Diverse Topics: Ranging from Dini Bilgiler and Futbol to İlahiyat and İşletme, this dataset provides a robust foundation for models that need to handle a variety of real-world questions in Turkish.

    Potential Uses:

    Model Training: Utilize the dataset to train AI models from scratch, providing a foundational understanding of Turkish in both general and specialized contexts.

    Fine-Tuning Pre-Trained Models: Enhance existing models by fine-tuning them on this dataset, allowing them to achieve better performance in Turkish language tasks.

    Evaluation and Benchmarking: Test and benchmark the capabilities of AI models, ensuring they meet the necessary standards for comprehension and response generation in Turkish.

    AI-Powered Educational Tools: Develop intelligent tutoring systems or exam preparation tools that can assist students and professionals in mastering complex subjects.

    Conclusion:

    The Turkish MMLU dataset is not just a collection of questions and answers; it is a comprehensive and original tool designed to advance the development of AI in the Turkish language. Whether you are training new models, fine-tuning existing ones, or evaluating their performance, this dataset offers the depth and breadth needed to push the boundaries of natural language processing in Turkish. Its originality and extensive scope make it an indispensable resource for anyone working in this field.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anil Guven (2021). Turkish Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/anil1055/turkish-tweet-dataset
Organization logo

Turkish Tweets Dataset

Dataset for Sentiment Analysis

Explore at:
zip(170312 bytes)Available download formats
Dataset updated
Apr 9, 2021
Authors
Anil Guven
Description

Dataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.

You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.

Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435

Search
Clear search
Close search
Google apps
Main menu