Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset
This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.
Facebook
TwitterSTSb Turkish
Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STSb English dataset. This dataset is not reviewed by expert human translators. Uploaded from this repository.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Thanks for using the Turkish Wikipedia dataset! We hope it will be useful for your language modeling and text generation tasks.
Since the Turkish Wikipedia dataset is not on Kaggle, I took a shared dataset on Huggingface. I merged the shared dataset as 2 parquet files and shared it on Kaggle. You can go to the version of the dataset shared on Huggingface from the link below. I would also like to thank https://huggingface.co/musabg for creating this dataset.
Original link to this dataset: https://huggingface.co/datasets/musabg/wikipedia-tr
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Turkish Book Data Set
This dataset is a comprehensive compilation of Turkish books obtained through web scraping from the internet. Each record in the dataset contains essential information such as the book's title, author, publisher, publication year, page count, category, description, and image URL.
This rich dataset can be utilized in various applications, particularly in the analysis through methods such as classification, content-based recommendation algorithms, and natural language processing (NLP). For researchers, students, and data scientists, this dataset serves as a valuable resource for exploring Turkish literature, generating book recommendations, or developing machine learning models.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains chat-based dialogs in the Turkish language. The dialogs are written in a particularly natural, polite and supportive style. The interactions between the user and the chatbot aim to provide information and support on different topics. This dataset is suitable for Turkish language processing (NLP) projects and can be used in areas such as chatbots, language modeling and text analysis.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TURKISH SONG LYRICS FROM GENIUS DATASET
DATASET DESCRIPTION
This dataset contains a comprehensive collection of 44,692 Turkish song lyrics, extracted from the larger "Genius Song Lyrics with Language Information" dataset available on Kaggle. The original 9.07 GB dataset was filtered to include only songs identified with the language code 'tr' (Turkish), making it a clean and focused resource for Turkish Natural Language Processing (NLP) tasks.
[TR] Bu veri seti, Kaggle'da bulunan "Genius Song Lyrics with Language Information" adlı büyük veri setinden ayıklanmış 44,692 adet Türkçe şarkı sözü içermektedir. Orijinal 9.07 GB'lık veri seti, dil kodu 'tr' (Türkçe) olarak tanımlanmış şarkıları içerecek şekilde filtrelenmiştir. Bu, Türkçe Doğal Dil İşleme (DDİ) görevleri için temiz ve odaklanmış bir kaynak oluşturmaktadır.
HOW TO USE
You can easily load this dataset using the Hugging Face datasets library.
[TR] Bu veri setini Hugging Face datasets kütüphanesini kullanarak kolayca yükleyebilirsiniz.
Örnek Python Kodu: ```python from datasets import load_dataset
dataset = load_dataset("mustafakemal0146/Genius-Turkish-Dataset")
print(dataset['train'][0]['lyrics']) ```
DATASET STRUCTURE
The dataset consists of a single CSV file, loaded as the train split, with the following columns:
[TR] Veri seti, train bölünmüşü olarak yüklenen ve aşağıdaki sütunları içeren tek bir CSV dosyasından oluşur:
DATA SOURCE AND CURATION
This dataset is a curated subset of the "Genius Song Lyrics with Language Information" (Link: https://www.kaggle.com/datasets/pavanelisetty/genius-song-lyrics-with-language-information) dataset on Kaggle, originally collected from Genius.com. The filtering process involved reading the main song_lyrics.csv file and selecting all rows where the language column was equal to 'tr'.
[TR] Bu veri seti, orijinal olarak Genius.com'dan toplanmış olan Kaggle'daki "Genius Song Lyrics with Language Information" veri setinin düzenlenmiş bir alt kümesidir. Filtreleme işlemi, ana song_lyrics.csv dosyasını okuyarak language sütununun 'tr' olduğu tüm satırların seçilmesiyle yapılmıştır.
LICENSE
The original dataset on Kaggle does not specify a license. This curated version is shared under the "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)" (Link: https://creativecommons.org/licenses/by-nc-sa/4.0/) license, assuming it will be used for non-commercial research and educational purposes. Please refer to the original data source for any commercial use inquiries.
Created by MustafaKemal0146 (Hugging Face Profile: https://huggingface.co/MustafaKemal0146)
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Turkish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Turkish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Turkish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Turkish speech models that understand and respond to authentic Turkish accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Turkish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
Participant Diversity:
- Speakers: 60 verified native Turkish speakers from FutureBeeAI’s contributor community.
- Regions: Representing various provinces of Turkey to ensure dialectal diversity and demographic balance.
- Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
Recording Details:
- Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
- Duration: Each conversation ranges from 15 to 60 minutes.
- Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
- Environment: Quiet, echo-free settings with no background noise.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Sample Topics Include:
- Family & Relationships
- Food & Recipes
- Education & Career
- Healthcare Discussions
- Social Issues
- Technology & Gadgets
- Travel & Local Culture
- Shopping & Marketplace Experiences, and many more.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
Transcription Highlights:
- Speaker-segmented dialogues
- Time-coded utterances
- Non-speech elements (pauses, laughter, etc.)
- High transcription accuracy, achieved through double QA pass, average WER< 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
Recording Metadata: Topic, duration, audio format, device type, and sample rate.
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This Turkish General Conversation Dataset is created by FutureBeeAI and is available for commercial licensing.
Facebook
TwitterThis dataset includes Turkish-language dialogues between customers and representatives across various service domains, such as finance, e-commerce, technical support, and general inquiries.
Each conversation contains multiple turns and is structured with clearly labeled speaker roles (customer or representative), making it suitable for Natural Language Processing (NLP) tasks related to dialogue systems, intent detection, and chatbot development.
conversation_id: Unique identifier for each conversationcategory: Service domain label (e.g. Finance, Technical Support)speaker: Role of the speaker (customer or representative)text: Utterance in TurkishCC BY 4.0 — Free to use with attribution
Prepared by Anıl as part of a research and educational NLP project.
Facebook
TwitterAttribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Dataset Summary
This dataset is enhanced version of existing offensive language studies. Existing studies are highly imbalanced, and solving this problem is too costly. To solve this, we proposed contextual data mining method for dataset augmentation. Our method is basically prevent us from retrieving random tweets and label individually. We can directly access almost exact hate related tweets and label them directly without any further human interaction in order to solve imbalanced… See the full description on the dataset page: https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is machine-translated version of HuggingFaceH4/instruction-dataset into Turkish.Translated with googletrans==3.1.0a0.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.
Facebook
Twitterhttps://www.apache.org/licenses/LICENSE-2.0https://www.apache.org/licenses/LICENSE-2.0
The Turkish Offensive Language Dataset is a Turkish-language dataset collected from Twitter, designed for training models in offensive language detection, hate speech detection, and text classification tasks. Created by Gülzade Evni and Zeynep Baydemir, the dataset covers multiple subcategories of harmful content including racism, profanity, insult, and sexism.
It consists of seven files and is distributed under the Apache-2.0 license, making it openly available for research and development purposes. The dataset is intended for practitioners and researchers working on natural language processing for Turkish social media content, addressing a recognized gap in low-resource language resources for content moderation applications.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Turkish Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Turkish language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Turkish speech data.
This dataset features over 6,000 high-quality scripted monologue recordings in Turkish. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.
Participant Diversity
- Speakers: 60 native Turkish speakers
- Regions: Broad regional coverage ensures diverse accents and dialects
- Demographics: Participants aged 18 to 70, with a 60:40 male-to-female ratio
Recording Specifications
- Recording Type: Scripted monologues and prompt-based recordings
- Audio Duration: 5 to 30 seconds per file
- Format: WAV, mono channel, 16-bit, 8 kHz & 16 kHz sample rates
- Environment: Clean, noise-free conditions to ensure clarity and usability
The dataset covers a wide variety of general conversation scenarios, including:
Daily Conversations
Topic-Specific Discussions
General Knowledge and Advice
Idioms and Sayings
To enhance authenticity, the prompts include:
Names: Male and female names specific to different Turkey regions
Addresses: Commonly used address formats in daily Turkish speech
Dates & Times: References used in general scheduling and time expressions
Organization Names: Names of businesses, institutions, and other entities
Numbers & Currencies: Mentions of quantities, prices, and monetary values
Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.
Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.
Content: Exact match to the spoken audio
Format: Plain text (.TXT), named identically to the corresponding audio file
Quality Control: All transcripts are validated by native Turkish transcribers
Rich metadata is included for detailed filtering and analysis:
Speaker Metadata: Unique speaker ID, age, gender, region, and dialect
Audio Metadata: Prompt transcript, recording setup, device specs, sample rate, bit depth, and format.
This dataset is developed and owned by FutureBeeAI and is available for commercial use, offering high-value resources for enterprises and research organizations developing Turkish speech technologies.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Turkish General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Turkish usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Turkish conversations covering a broad spectrum of everyday topics.
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Turkish speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
Words per Chat: 300–700
Turns per Chat: Up to 50 dialogue turns
Contributors: 200 native Turkish speakers from the FutureBeeAI Crowd Community
Format: TXT, DOCS, JSON or CSV (customizable)
Structure: Each record contains the full chat, topic tag, and metadata block
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
Music, books, and movies
Health and wellness
Children and parenting
Family life and relationships
Food and cooking
Education and studying
Festivals and traditions
Environment and daily life
Internet and tech usage
Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Chats reflect informal, native-level Turkish usage with:
Colloquial expressions and local dialect influence
Domain-relevant terminology
Language-specific grammar, phrasing, and sentence flow
Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
Representation of different writing styles and input quirks to ensure training data realism
Every chat instance is accompanied by structured metadata, which includes:
Participant Age
Gender
Country/Region
Chat Domain
Chat Topic
Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
Manual review for content completeness
Format checks for chat turns and metadata
Linguistic verification by native speakers
Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
This dataset is developed and owned by FutureBeeAI and is available under a commercial license. Custom licensing terms can be provided for academic, research, or enterprise clients.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
containing a total of 13
Facebook
TwitterThis dataset was created by Berkay Kocaoglu
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Turkish Speech Corpus (TSC)
This repository presents an open-source Turkish Speech Corpus, introduced in "Multilingual Speech Recognition for Turkic Languages". The corpus contains 218.2 hours of transcribed speech with 186,171 utterances and is the largest publicly available Turkish dataset of its kind at that time.
Paper: Multilingual Speech Recognition for Turkic Languages.
GitHub Repository: https://github.com/IS2AI/TurkicASR
Citation
@Article{info14020074… See the full description on the dataset page: https://huggingface.co/datasets/issai/Turkish_Speech_Corpus.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A rich collection of Turkish text images including street signs, advertisements, and real-world scenes, with diverse fonts, sizes, and orientations, optimized for OCR and computer vision research.
Facebook
TwitterTurkish(Turkey) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Stanford alpaca turkish: Stanford Alpaca
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset
This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.