Facebook
TwitterDataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.
You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.
Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data split:
Labels:
I do not **own **this dataset. I changed the original format for easier use and turned it into a **csv **and .spacy file.
you reach the original version of the dataset from the **link **below
https://github.com/turkish-nlp-suite/Turkish-Wiki-NER-Dataset
Facebook
TwitterAttribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Dataset Summary
This dataset is enhanced version of existing offensive language studies. Existing studies are highly imbalanced, and solving this problem is too costly. To solve this, we proposed contextual data mining method for dataset augmentation. Our method is basically prevent us from retrieving random tweets and label individually. We can directly access almost exact hate related tweets and label them directly without any further human interaction in order to solve imbalanced… See the full description on the dataset page: https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Thanks for using the Turkish Wikipedia dataset! We hope it will be useful for your language modeling and text generation tasks.
Since the Turkish Wikipedia dataset is not on Kaggle, I took a shared dataset on Huggingface. I merged the shared dataset as 2 parquet files and shared it on Kaggle. You can go to the version of the dataset shared on Huggingface from the link below. I would also like to thank https://huggingface.co/musabg for creating this dataset.
Original link to this dataset: https://huggingface.co/datasets/musabg/wikipedia-tr
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Stanford alpaca turkish: Stanford Alpaca
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is machine-translated version of HuggingFaceH4/instruction-dataset into Turkish.Translated with googletrans==3.1.0a0.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey Scripted Monologue Dataset for AI & Speech Models Contact Us OverviewTitle (Language)Turkish Turkey Language DatasetDataset TypesScripted MonologueCountryTurkeyDescriptionThe Scripted Monologue dataset consists…
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Tsev Wake Lo Lus Turkish DatasetHigh-Quality Turkish Wake Word Dataset rau AI & Cov Qauv Hais Lus Hu rau Peb Txheej TxheemTitleWake Lo Lus Turkish Lus DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word…
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Compilation of Bilkent Turkish Writings Dataset
Dataset Description
This is a comprehensive compilation of Turkish creative writings from Bilkent University's Turkish 101 and Turkish 102 courses (2014-2025). The dataset contains 9119 student writings originally created by students and instructors, focusing on creativity, content, composition, grammar, spelling, and punctuation development. Note: This dataset is a compilation and digitization of publicly available writings… See the full description on the dataset page: https://huggingface.co/datasets/selimfirat/bilkent-turkish-writings-dataset.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Turkish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Turkish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Turkish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Turkish speech models that understand and respond to authentic Turkish accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Turkish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
Participant Diversity:
- Speakers: 60 verified native Turkish speakers from FutureBeeAI’s contributor community.
- Regions: Representing various provinces of Turkey to ensure dialectal diversity and demographic balance.
- Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
Recording Details:
- Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
- Duration: Each conversation ranges from 15 to 60 minutes.
- Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
- Environment: Quiet, echo-free settings with no background noise.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Sample Topics Include:
- Family & Relationships
- Food & Recipes
- Education & Career
- Healthcare Discussions
- Social Issues
- Technology & Gadgets
- Travel & Local Culture
- Shopping & Marketplace Experiences, and many more.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
Transcription Highlights:
- Speaker-segmented dialogues
- Time-coded utterances
- Non-speech elements (pauses, laughter, etc.)
- High transcription accuracy, achieved through double QA pass, average WER< 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
Recording Metadata: Topic, duration, audio format, device type, and sample rate.
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This Turkish General Conversation Dataset is created by FutureBeeAI and is available for commercial licensing.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Turkish General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Turkish usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Turkish conversations covering a broad spectrum of everyday topics.
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Turkish speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
Words per Chat: 300–700
Turns per Chat: Up to 50 dialogue turns
Contributors: 200 native Turkish speakers from the FutureBeeAI Crowd Community
Format: TXT, DOCS, JSON or CSV (customizable)
Structure: Each record contains the full chat, topic tag, and metadata block
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
Music, books, and movies
Health and wellness
Children and parenting
Family life and relationships
Food and cooking
Education and studying
Festivals and traditions
Environment and daily life
Internet and tech usage
Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Chats reflect informal, native-level Turkish usage with:
Colloquial expressions and local dialect influence
Domain-relevant terminology
Language-specific grammar, phrasing, and sentence flow
Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
Representation of different writing styles and input quirks to ensure training data realism
Every chat instance is accompanied by structured metadata, which includes:
Participant Age
Gender
Country/Region
Chat Domain
Chat Topic
Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
Manual review for content completeness
Format checks for chat turns and metadata
Linguistic verification by native speakers
Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
This dataset is developed and owned by FutureBeeAI and is available under a commercial license. Custom licensing terms can be provided for academic, research, or enterprise clients.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Mawu Turkish DatasetHigh-Quality Turkish Wake Word Dataset ya AI & Zolankhula Zolankhula Lumikizanani Nafe mwachiduleTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Mawu / Lamulo Lamawu / Limbitsani Mawu…
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Turkish Speech Corpus (TSC)
This repository presents an open-source Turkish Speech Corpus, introduced in "Multilingual Speech Recognition for Turkic Languages". The corpus contains 218.2 hours of transcribed speech with 186,171 utterances and is the largest publicly available Turkish dataset of its kind at that time.
Paper: Multilingual Speech Recognition for Turkic Languages.
GitHub Repository: https://github.com/IS2AI/TurkicASR
Citation
@Article{info14020074… See the full description on the dataset page: https://huggingface.co/datasets/issai/Turkish_Speech_Corpus.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Turkish Tiel is a dataset for object detection tasks - it contains Money annotations for 604 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
containing a total of 13
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
NLI-TR is a revolutionary set of two datasets that provide an unparalleled opportunity for the natural language processing and machine learning community to conduct inference research in the Turkish Language. The datasets - SNLI-TR and MNLI-TR - contain carefully curated natural language inference data that have been translated into Turkish. With NLI-TR, researchers can explore the exciting prospects of developing automated models tailored to make inferences on texts produced in this vibrant language. Moreover, they can also investigate how models trained on data from one language fare when applied in another, a valuable insight into cross-lingual generalization capabilities. NLI-TR offers both seasoned and budding researchers an unprecedented platform to further our understanding of natural language inferencing capability
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How To Use The NLI-TR Dataset to Unlock Turkish NLI Research
Welcome to the exciting world of natural language inference (NLI) research! If you’re looking for a great dataset to use for your research in this field, the NLI-TR dataset is a perfect starting point. This guide will provide an overview of how you can use the data from this dataset to uncover new insights about NLI tasks in Turkish.
The NLI-TR dataset contains two large scale datasets intended for natural language inference tasks – SNLI-TR and MNLI- TR. Both datasets offer researchers an opportunity to explore Natural Language Inference (NLI) research in the Turkish language, with examples ranging from sentence paraphrasing task and classification tasks to question answering scenarios using various NLP techniques.
Using the Data:
The data provided in this dataset includes both training and validation sets, making it easy for researchers who are just getting started with their projects. The SNLI_tr_train.csv file is used as input for training your models, while slni_tr_validation can be used as input for testing or validating model accuracy on unseen data. Additionally, multinli_tr_validation_{matched / mismatched}.csv files offer additional validation on how well your trained models perform on more complex scenarios such as sentence paraphrasing or question answering tasks using various NLP techniques.
Each record includes four columns – premise ,hypothesis ,label , (and domain). The premise column specifies what information is provided before asking a question or making an inference; think of it as context clues that explain why one statement implies another statement more directly than others might do without them . The hypothesis column provides what lies at the heart of inference --the conclusion reached after introducing facts given before it . Last but not least we have label column which denotes whether two sentences entail each other (ENTAILMENT), contradict each other(CONTRADICTION) or are unrelated(NEUTRAL). A domain label has also been assigned by some authors when necessary; this mostly applies when inferring between sentences across different semantic domains such as weather vs sports vs finance etc .
- Developing an NLI-based Turkish language question answering system.
- Training a sentiment analysis algorithm to identify sentiment in text written in Turkish.
- Building a Machine Learning Chatbot that uses NLI to understand conversational context and respond accordingly for users intending to converse in the Turkish language
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: snli_tr_train.csv | Column name | Description | |:---------------|:------------------------------------------------------------------------------------------------------------------------------------------...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Önemli Not: Bu veri setinin cevap sütununda bir hata tespit edildi ve bu hata yeni sürümünde düzeltildi. Bu nedenle, son sürümünün kullanılması büyük önem taşımaktadır.
Important Note: There was an error in the answer column of this dataset, which has been fixed in version the latest version. It is very important to use the latest version.
Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün Türkçe Veri Seti (Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications)
The Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications dataset is a comprehensive and original resource, specifically designed for training, fine-tuning, and evaluating AI models in Turkish. With 293,468 questions, this dataset stands as the most extensive collection in its field, covering a wide range of academic and professional subjects relevant to Turkey, including major exams like TUS (Medical Specialization Examination), KPSS (Public Personnel Selection Examination), and many others.
Key Features:
• Completely Original Content: The dataset is entirely created from original Turkish sources, ensuring authenticity and relevance to the Turkish context. It has not been translated from other languages, which is crucial for maintaining the integrity of the language data.
• Extensive Data Volume: With nearly 300,000 questions, the dataset offers a substantial corpus for training models, enabling deep learning algorithms to gain a nuanced understanding of the Turkish language across diverse topics.
• Detailed Structure: The dataset is organized into six key columns:
• ‘bölüm’ (section): Indicates the broader exam or category.
• ‘konu’ (subject): Specifies the topic within the section.
• ‘soru’ (question): The question text itself.
• ‘cevap’ (answer): The correct answer to the question.
• ‘aciklama’ (explanation): Provides additional context or reasoning for the answer, crucial for models to understand the logic behind correct responses.
• ‘secenekler’ (options): The possible answer choices, essential for multiple-choice formats.
• Wide Range of Sections and Subjects: The dataset includes 67 sections covering over 800 unique subjects. These sections span from specialized medical fields in TUS to general knowledge and vocational exams like KPSS and Ehliyet, ensuring that the dataset reflects the complexity and breadth of Turkish academic and professional content.
Dataset Source and Usage:
• Data Source: The dataset is compiled from publicly available data on the internet. While care has been taken to ensure that the data is original, there may be instances where some questions contain copyrighted material. If any copyright holders identify their material within the dataset, they are encouraged to contact the author, and the specific question will be promptly removed.
• Non-Commercial Use: This dataset is strictly intended for research and academic purposes. It cannot be used for commercial purposes under any circumstances.
Importance for AI Models:
1. Training: The vast number of questions, coupled with detailed explanations, makes this dataset an invaluable resource for training AI models to understand and process Turkish at a high level. The diversity of topics also ensures that the model is exposed to a wide range of vocabulary, concepts, and linguistic structures.
2. Fine-Tuning: For researchers and developers looking to fine-tune existing models, such as GPT, BERT, or other transformer-based architectures, this dataset offers domain-specific content that can significantly enhance performance in areas like medical language processing, legal text analysis, or general-purpose Turkish language understanding.
3. Evaluation: The Turkish MMLU dataset is ideal for evaluating the performance of AI models in Turkish. With its rich content and structured format, it allows for rigorous testing across various subjects, helping to measure how well a model can comprehend and generate accurate responses in Turkish.
4. Real-World Application: Beyond academic research, this dataset is also highly applicable in developing AI-powered tools for exam preparation, automated tutoring systems, and educational applications that require a deep understanding of the Turkish language and its diverse domains.
Example Sections:
• Medical Exams (TUS): Includes specialized subjects such as Farmakoloji, Patoloji, Mikrobiyoloji, and more, which are critical for training models intended for medical documentation or decision support systems.
• Public and Professional Exams (KPSS): Encompasses a wide array of subjects like Genel Kültür, Tarih, Coğrafya, and Vatandaşlık, making it valuable for general-purpose models.
• Diverse Topics: Ranging from Dini Bilgiler and Futbol to İlahiyat and İşletme, this dataset provides a robust foundation for models that need to handle a variety of real-world questions in Turkish.
Potential Uses:
• Model Training: Utilize the dataset to train AI models from scratch, providing a foundational understanding of Turkish in both general and specialized contexts.
• Fine-Tuning Pre-Trained Models: Enhance existing models by fine-tuning them on this dataset, allowing them to achieve better performance in Turkish language tasks.
• Evaluation and Benchmarking: Test and benchmark the capabilities of AI models, ensuring they meet the necessary standards for comprehension and response generation in Turkish.
• AI-Powered Educational Tools: Develop intelligent tutoring systems or exam preparation tools that can assist students and professionals in mastering complex subjects.
Conclusion:
The Turkish MMLU dataset is not just a collection of questions and answers; it is a comprehensive and original tool designed to advance the development of AI in the Turkish language. Whether you are training new models, fine-tuning existing ones, or evaluating their performance, this dataset offers the depth and breadth needed to push the boundaries of natural language processing in Turkish. Its originality and extensive scope make it an indispensable resource for anyone working in this field.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Turkish Book Data Set
This dataset is a comprehensive compilation of Turkish books obtained through web scraping from the internet. Each record in the dataset contains essential information such as the book's title, author, publisher, publication year, page count, category, description, and image URL.
This rich dataset can be utilized in various applications, particularly in the analysis through methods such as classification, content-based recommendation algorithms, and natural language processing (NLP). For researchers, students, and data scientists, this dataset serves as a valuable resource for exploring Turkish literature, generating book recommendations, or developing machine learning models.
Facebook
TwitterDataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.
You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.
Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435