100+ datasets found
  1. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    • kaggle.com
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  2. h

    stsb-mt-turkish

    • huggingface.co
    Updated Dec 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrecan Çelik (2021). stsb-mt-turkish [Dataset]. https://huggingface.co/datasets/emrecan/stsb-mt-turkish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2021
    Authors
    Emrecan Çelik
    Description

    STSb Turkish

    Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STSb English dataset. This dataset is not reviewed by expert human translators. Uploaded from this repository.

  3. Turkish Wikipedia Dataset

    • kaggle.com
    zip
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osman Kagan Kurnaz (2024). Turkish Wikipedia Dataset [Dataset]. https://www.kaggle.com/datasets/osmankagankurnaz/turkish-wikipedia-dataset
    Explore at:
    zip(458865119 bytes)Available download formats
    Dataset updated
    Mar 19, 2024
    Authors
    Osman Kagan Kurnaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
    • The articles in this dataset are not specifically tagged for a particular task and the dataset is untagged.
    • This dataset is written in Turkish and was created by a team of volunteers using community engagement methods.
    • This dataset is an original dataset created from the Turkish Wikipedia.

    Thanks for using the Turkish Wikipedia dataset! We hope it will be useful for your language modeling and text generation tasks.

    Since the Turkish Wikipedia dataset is not on Kaggle, I took a shared dataset on Huggingface. I merged the shared dataset as 2 parquet files and shared it on Kaggle. You can go to the version of the dataset shared on Huggingface from the link below. I would also like to thank https://huggingface.co/musabg for creating this dataset.

    Original link to this dataset: https://huggingface.co/datasets/musabg/wikipedia-tr

  4. Turkish Book Data Set

    • kaggle.com
    zip
    Updated Jan 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammed İbrahim Top (2024). Turkish Book Data Set [Dataset]. https://www.kaggle.com/datasets/muhammedbrahimtop/turkish-book-data-set
    Explore at:
    zip(17318668 bytes)Available download formats
    Dataset updated
    Jan 12, 2024
    Authors
    Muhammed İbrahim Top
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Turkish Book Data Set

    This dataset is a comprehensive compilation of Turkish books obtained through web scraping from the internet. Each record in the dataset contains essential information such as the book's title, author, publisher, publication year, page count, category, description, and image URL.

    This rich dataset can be utilized in various applications, particularly in the analysis through methods such as classification, content-based recommendation algorithms, and natural language processing (NLP). For researchers, students, and data scientists, this dataset serves as a valuable resource for exploring Turkish literature, generating book recommendations, or developing machine learning models.

  5. Turkish Polite Dataset

    • kaggle.com
    zip
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunus Emre Akca (2025). Turkish Polite Dataset [Dataset]. https://www.kaggle.com/datasets/yunusemreakca/turkish-polite-dataset
    Explore at:
    zip(98871 bytes)Available download formats
    Dataset updated
    Apr 17, 2025
    Authors
    Yunus Emre Akca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains chat-based dialogs in the Turkish language. The dialogs are written in a particularly natural, polite and supportive style. The interactions between the user and the chatbot aim to provide information and support on different topics. This dataset is suitable for Turkish language processing (NLP) projects and can be used in areas such as chatbots, language modeling and text analysis.

  6. Genius-Turkish-Dataset

    • kaggle.com
    zip
    Updated Nov 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Kemal Çıngıl (2025). Genius-Turkish-Dataset [Dataset]. https://www.kaggle.com/datasets/mustafakemal0146/genius-turkish-dataset
    Explore at:
    zip(22818735 bytes)Available download formats
    Dataset updated
    Nov 2, 2025
    Authors
    Mustafa Kemal Çıngıl
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    TURKISH SONG LYRICS FROM GENIUS DATASET

    DATASET DESCRIPTION

    This dataset contains a comprehensive collection of 44,692 Turkish song lyrics, extracted from the larger "Genius Song Lyrics with Language Information" dataset available on Kaggle. The original 9.07 GB dataset was filtered to include only songs identified with the language code 'tr' (Turkish), making it a clean and focused resource for Turkish Natural Language Processing (NLP) tasks.

    [TR] Bu veri seti, Kaggle'da bulunan "Genius Song Lyrics with Language Information" adlı büyük veri setinden ayıklanmış 44,692 adet Türkçe şarkı sözü içermektedir. Orijinal 9.07 GB'lık veri seti, dil kodu 'tr' (Türkçe) olarak tanımlanmış şarkıları içerecek şekilde filtrelenmiştir. Bu, Türkçe Doğal Dil İşleme (DDİ) görevleri için temiz ve odaklanmış bir kaynak oluşturmaktadır.

    HOW TO USE

    You can easily load this dataset using the Hugging Face datasets library.

    [TR] Bu veri setini Hugging Face datasets kütüphanesini kullanarak kolayca yükleyebilirsiniz.

    Örnek Python Kodu: ```python from datasets import load_dataset

    Load the dataset from the Hugging Face Hub

    Veri setini Hugging Face Hub'dan yükleyin

    dataset = load_dataset("mustafakemal0146/Genius-Turkish-Dataset")

    Example: Access the lyrics of the first song in the training split

    Örnek: Eğitim setindeki ilk şarkının sözlerine erişim

    print(dataset['train'][0]['lyrics']) ```

    DATASET STRUCTURE

    The dataset consists of a single CSV file, loaded as the train split, with the following columns:

    [TR] Veri seti, train bölünmüşü olarak yüklenen ve aşağıdaki sütunları içeren tek bir CSV dosyasından oluşur:

    • title (string): The title of the song / Şarkının başlığı.
    • tag (string): The genre tag associated with the song (e.g., 'rap', 'pop') / Şarkıyla ilişkilendirilen tür etiketi.
    • artist (string): The name of the primary artist / Ana sanatçının adı.
    • year (int): The release year of the song / Şarkının çıkış yılı.
    • views (int): The number of views on Genius.com / Genius.com'daki görüntülenme sayısı.
    • features (string): A string representation of featuring artists / Düet yapılan sanatçıların metin formatı.
    • lyrics (string): The full lyrics of the song / Şarkının tam sözleri.
    • id (int): A unique identifier from the original dataset / Orijinal veri setinden gelen benzersiz ID.
    • language_cld3 (string): Language code detected by CLD3 model (all 'tr') / CLD3 ile tespit edilen dil kodu (tümü 'tr').
    • language_ft (string): Language code detected by FastText model (all 'tr') / FastText ile tespit edilen dil (tümü 'tr').
    • language (string): Final aggregated language code (all 'tr') / Nihai dil kodu (tümü 'tr').

    DATA SOURCE AND CURATION

    This dataset is a curated subset of the "Genius Song Lyrics with Language Information" (Link: https://www.kaggle.com/datasets/pavanelisetty/genius-song-lyrics-with-language-information) dataset on Kaggle, originally collected from Genius.com. The filtering process involved reading the main song_lyrics.csv file and selecting all rows where the language column was equal to 'tr'.

    [TR] Bu veri seti, orijinal olarak Genius.com'dan toplanmış olan Kaggle'daki "Genius Song Lyrics with Language Information" veri setinin düzenlenmiş bir alt kümesidir. Filtreleme işlemi, ana song_lyrics.csv dosyasını okuyarak language sütununun 'tr' olduğu tüm satırların seçilmesiyle yapılmıştır.

    LICENSE

    The original dataset on Kaggle does not specify a license. This curated version is shared under the "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)" (Link: https://creativecommons.org/licenses/by-nc-sa/4.0/) license, assuming it will be used for non-commercial research and educational purposes. Please refer to the original data source for any commercial use inquiries.

    Created by MustafaKemal0146 (Hugging Face Profile: https://huggingface.co/MustafaKemal0146)

  7. F

    Turkish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Turkish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Turkish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Turkish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Turkish speech models that understand and respond to authentic Turkish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Turkish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:

    - Speakers: 60 verified native Turkish speakers from FutureBeeAI’s contributor community.

    - Regions: Representing various provinces of Turkey to ensure dialectal diversity and demographic balance.

    - Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

    Recording Details:

    - Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

    - Duration: Each conversation ranges from 15 to 60 minutes.

    - Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

    - Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:

    - Family & Relationships

    - Food & Recipes

    - Education & Career

    - Healthcare Discussions

    - Social Issues

    - Technology & Gadgets

    - Travel & Local Culture

    - Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:

    - Speaker-segmented dialogues

    - Time-coded utterances

    - Non-speech elements (pauses, laughter, etc.)

    - High transcription accuracy, achieved through double QA pass, average WER< 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    License

    This Turkish General Conversation Dataset is created by FutureBeeAI and is available for commercial licensing.

  8. Turkish Call Center Conversations

    • kaggle.com
    zip
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anıl Sevinc (2025). Turkish Call Center Conversations [Dataset]. https://www.kaggle.com/datasets/anills/turkish-call-center-conversations
    Explore at:
    zip(442962 bytes)Available download formats
    Dataset updated
    Jul 1, 2025
    Authors
    Anıl Sevinc
    Description

    Turkish Multi-Domain Customer Service Conversations Dataset

    This dataset includes Turkish-language dialogues between customers and representatives across various service domains, such as finance, e-commerce, technical support, and general inquiries.

    Each conversation contains multiple turns and is structured with clearly labeled speaker roles (customer or representative), making it suitable for Natural Language Processing (NLP) tasks related to dialogue systems, intent detection, and chatbot development.

    🔹 Dataset Structure

    • conversation_id: Unique identifier for each conversation
    • category: Service domain label (e.g. Finance, Technical Support)
    • speaker: Role of the speaker (customer or representative)
    • text: Utterance in Turkish

    🔍 Use Cases

    • Chatbot training
    • Intent classification
    • Dialogue summarization
    • Speaker role detection
    • Turkish NLP pretraining/finetuning

    🧾 License

    CC BY 4.0 — Free to use with attribution

    🙋‍♂️ Creator

    Prepared by Anıl as part of a research and educational NLP project.

  9. h

    turkish-offensive-language-detection

    • huggingface.co
    Updated Sep 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toygar Tanyel (2022). turkish-offensive-language-detection [Dataset]. https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2022
    Authors
    Toygar Tanyel
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Summary

    This dataset is enhanced version of existing offensive language studies. Existing studies are highly imbalanced, and solving this problem is too costly. To solve this, we proposed contextual data mining method for dataset augmentation. Our method is basically prevent us from retrieving random tweets and label individually. We can directly access almost exact hate related tweets and label them directly without any further human interaction in order to solve imbalanced… See the full description on the dataset page: https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection.

  10. h

    instruction-turkish

    • huggingface.co
    Updated Apr 7, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmet (2026). instruction-turkish [Dataset]. https://huggingface.co/datasets/atasoglu/instruction-turkish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2026
    Authors
    Ahmet
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is machine-translated version of HuggingFaceH4/instruction-dataset into Turkish.Translated with googletrans==3.1.0a0.

  11. Turkish Sentiment Analysis Dataset

    • humirapps.cs.hacettepe.edu.tr
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Multimedia Information Retrieval Laboratory (2017). Turkish Sentiment Analysis Dataset [Dataset]. http://doi.org/10.1109/SITIS.2016.57
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2017
    Dataset provided by
    Hacettepe Universityhttp://hacettepe.edu.tr/
    Authors
    Hacettepe University Multimedia Information Retrieval Laboratory
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.

  12. m

    Turkish Offensive Language Dataset

    • megatek.ai
    bin
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2026). Turkish Offensive Language Dataset [Dataset]. https://megatek.ai/en/dataset/turkish-offensive-language-dataset/
    Explore at:
    binAvailable download formats
    License

    https://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0

    Description

    The Turkish Offensive Language Dataset is a Turkish-language dataset collected from Twitter, designed for training models in offensive language detection, hate speech detection, and text classification tasks. Created by Gülzade Evni and Zeynep Baydemir, the dataset covers multiple subcategories of harmful content including racism, profanity, insult, and sexism.

    It consists of seven files and is distributed under the Apache-2.0 license, making it openly available for research and development purposes. The dataset is intended for practitioners and researchers working on natural language processing for Turkish social media content, addressing a recognized gap in low-resource language resources for content moderation applications.

  13. F

    Turkish General Domain Scripted Monologue Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish General Domain Scripted Monologue Speech Data [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/general-scripted-speech-monologues-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Turkish Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Turkish language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Turkish speech data.

    Speech Data

    This dataset features over 6,000 high-quality scripted monologue recordings in Turkish. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.

    Participant Diversity

    - Speakers: 60 native Turkish speakers

    - Regions: Broad regional coverage ensures diverse accents and dialects

    - Demographics: Participants aged 18 to 70, with a 60:40 male-to-female ratio

    Recording Specifications

    - Recording Type: Scripted monologues and prompt-based recordings

    - Audio Duration: 5 to 30 seconds per file

    - Format: WAV, mono channel, 16-bit, 8 kHz & 16 kHz sample rates

    - Environment: Clean, noise-free conditions to ensure clarity and usability

    Topic Coverage

    The dataset covers a wide variety of general conversation scenarios, including:

    Daily Conversations

    Topic-Specific Discussions

    General Knowledge and Advice

    Idioms and Sayings

    Contextual Features

    To enhance authenticity, the prompts include:

    Names: Male and female names specific to different Turkey regions

    Addresses: Commonly used address formats in daily Turkish speech

    Dates & Times: References used in general scheduling and time expressions

    Organization Names: Names of businesses, institutions, and other entities

    Numbers & Currencies: Mentions of quantities, prices, and monetary values

    Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.

    Transcription

    Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.

    Content: Exact match to the spoken audio

    Format: Plain text (.TXT), named identically to the corresponding audio file

    Quality Control: All transcripts are validated by native Turkish transcribers

    Metadata

    Rich metadata is included for detailed filtering and analysis:

    Speaker Metadata: Unique speaker ID, age, gender, region, and dialect

    Audio Metadata: Prompt transcript, recording setup, device specs, sample rate, bit depth, and format.

    License

    This dataset is developed and owned by FutureBeeAI and is available for commercial use, offering high-value resources for enterprises and research organizations developing Turkish speech technologies.

  14. F

    Turkish Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/turkish-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Turkish General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Turkish usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Turkish conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Turkish speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700

    Turns per Chat: Up to 50 dialogue turns

    Contributors: 200 native Turkish speakers from the FutureBeeAI Crowd Community

    Format: TXT, DOCS, JSON or CSV (customizable)

    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies

    Health and wellness

    Children and parenting

    Family life and relationships

    Food and cooking

    Education and studying

    Festivals and traditions

    Environment and daily life

    Internet and tech usage

    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Turkish usage with:

    Colloquial expressions and local dialect influence

    Domain-relevant terminology

    Language-specific grammar, phrasing, and sentence flow

    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references

    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age

    Gender

    Country/Region

    Chat Domain

    Chat Topic

    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness

    Format checks for chat turns and metadata

    Linguistic verification by native speakers

    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Licensing

    This dataset is developed and owned by FutureBeeAI and is available under a commercial license. Custom licensing terms can be provided for academic, research, or enterprise clients.

  15. i

    Turkish Question answering sentiment dataset (SCD)

    • ieee-dataport.org
    Updated May 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kadir tohma (2023). Turkish Question answering sentiment dataset (SCD) [Dataset]. https://ieee-dataport.org/documents/turkish-question-answering-sentiment-dataset-scd
    Explore at:
    Dataset updated
    May 17, 2023
    Authors
    kadir tohma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    containing a total of 13

  16. Tr Sign Language Dataset

    • kaggle.com
    zip
    Updated May 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berkay Kocaoglu (2020). Tr Sign Language Dataset [Dataset]. https://www.kaggle.com/datasets/berkaykocaoglu/tr-sign-language
    Explore at:
    zip(1141490648 bytes)Available download formats
    Dataset updated
    May 25, 2020
    Authors
    Berkay Kocaoglu
    Description

    Dataset

    This dataset was created by Berkay Kocaoglu

    Contents

  17. h

    Turkish_Speech_Corpus

    • huggingface.co
    • kaggle.com
    Updated Dec 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Smart Systems and Artificial Intelligence, Nazarbayev University (2025). Turkish_Speech_Corpus [Dataset]. https://huggingface.co/datasets/issai/Turkish_Speech_Corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 17, 2025
    Dataset authored and provided by
    Institute of Smart Systems and Artificial Intelligence, Nazarbayev University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Turkish Speech Corpus (TSC)

    This repository presents an open-source Turkish Speech Corpus, introduced in "Multilingual Speech Recognition for Turkic Languages". The corpus contains 218.2 hours of transcribed speech with 186,171 utterances and is the largest publicly available Turkish dataset of its kind at that time. Paper: Multilingual Speech Recognition for Turkic Languages.
    GitHub Repository: https://github.com/IS2AI/TurkicASR

      Citation
    

    @Article{info14020074… See the full description on the dataset page: https://huggingface.co/datasets/issai/Turkish_Speech_Corpus.

  18. g

    Turkish Scene Text Recognition Dataset

    • gts.ai
    json
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED (2024). Turkish Scene Text Recognition Dataset [Dataset]. https://gts.ai/dataset-download/turkish-scene-text-recognition-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset authored and provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A rich collection of Turkish text images including street signs, advertisements, and real-world scenes, with diverse fonts, sizes, and orientations, optimized for OCR and computer vision research.

  19. 1620 Hours - Turkish(Turkey) Real-world Casual Conversation and Monologue...

    • nexdata.ai
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 1620 Hours - Turkish(Turkey) Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1324
    Explore at:
    Dataset updated
    Feb 9, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Turkish(Turkey) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  20. h

    Turkish-Alpaca

    • huggingface.co
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TÜBİTAK Science High School AI Club (2023). Turkish-Alpaca [Dataset]. https://huggingface.co/datasets/TFLai/Turkish-Alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset authored and provided by
    TÜBİTAK Science High School AI Club
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Stanford alpaca turkish: Stanford Alpaca

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset

turkish-sentiment-analysis-dataset

Turkish Sentiment Dataset

winvoker/turkish-sentiment-analysis-dataset

Explore at:
31 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2022
Authors
Batuhan
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

Search
Clear search
Close search
Google apps
Main menu