Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset
This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for "turkish-nlp-suite/turkish-wikiNER"
Dataset Summary
Turkish NER dataset from Wikipedia sentences. 20.000 sentences are sampled and re-annotated from Kuzgunlar NER dataset. Annotations are done by Co-one. Many thanks to them for their contributions. This dataset is also used in our brand new spaCy Turkish packages.
Dataset Instances
An instance of this dataset looks as follows: { "tokens": ["Çekimler", "5", "Temmuz", "2005", "tarihinde"… See the full description on the dataset page: https://huggingface.co/datasets/turkish-nlp-suite/turkish-wikiNER.
http://www.apache.org/licenses/LICENSE-2.0.txthttp://www.apache.org/licenses/LICENSE-2.0.txt
A human-annotated morphosyntactic treebank for Turkish.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey TTS Dataset for AI & Speech Models Contact Us OverviewTitleTurkish Turkey Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in the…
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for "turkishneuralvoice"
Dataset Overview
Dataset Name: Turkish Neural Voice Description: This dataset contains Turkish audio samples generated using Microsoft Text to Speech services. The dataset includes audio files and their corresponding transcriptions.
Dataset Structure
Configs:
default
Data Files:
Split: train Path: data/train-*
Dataset Info:
Features: audio: Audio file transcription: Corresponding text transcription
Splits: train… See the full description on the dataset page: https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Turkish Turkey DatasetTürkiye Türkiye Veri Kümesi High-Quality Turkish Turkey TTS Dataset for AI & Speech Models Faʻafesoʻotaʻi Matou Vaʻaiga LauteleIgoaTurkish Turkey Language Dataset Ituaiga Seti TusiTTSFaamatalaga Faamaumauga o upu e tasi, lea e masani ona pa'u i le…
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Tsev Wake Lo Lus Turkish DatasetHigh-Quality Turkish Wake Word Dataset rau AI & Cov Qauv Hais Lus Hu rau Peb Txheej TxheemTitleWake Lo Lus Turkish Lus DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This Turkish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Turkish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
The dataset features 30 hours of dual-channel call center recordings between native Turkish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
Such domain-rich variety ensures model generalization across common real estate support conversations.
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
These transcriptions streamline ASR and NLP development for Turkish real estate voice applications.
Detailed metadata accompanies each participant and conversation:
This enables smart filtering, dialect-focused model training, and structured dataset exploration.
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Shoko Turkish DatasetHigh-Quality Turkish Wake Word Dataset yeAI & Maemodheru eMatauriro Bata Isu OverviewTitleWake Izwi Turkish Mutauro DatasetDataset TypeWake WordDescriptionWake Mazwi / Inzwi Raira / Rinosimudzira Shoko…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Turkish Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.
Dataset Contain & Diversity:Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these sticky notes were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.
If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Kovid
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Word DatasetTurkiga-Tayo-sare oo tayo leh Qalabka Tooska ah ee Turkiga ee AI & Qaababka Hadalka Nala soo xidhiidh DulmarkaTitleWake Word Xogta Luuqadda Turkiga
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset maka AI & ụdị okwu Kpọtụrụ anyị OverviewTitleWake Okwu Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Ikhaya Wake IZwi I-Turkish DatasetIkhwalithi ephezulu ye-Turkish Wake Word Dataset ye-AI namamodeli Enkulumo Xhumana nathi UhlolojikeleleTitleWake Izwi Lase-Turkish Language DatasetDataset TypeWake WordDescriptionWake Amagama / Umyalo Wezwi / Qalisa Igama...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.
The dataset records typically include the following columns:
The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv
for training models, slni_tr_validation
for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv
for additional validation on complex scenarios. The multinli_tr_train.csv
file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv
file, for instance, containing approximately 392,700 records.
This dataset is ideal for various applications and use cases in NLP and machine learning:
The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.
CC0
The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:
Original Data Source: NLI-TR (Turkish NLI Research)
A comprehensive Turkish dataset for question-answering tasks in medical domain
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Turkish Lira Detection is a dataset for object detection tasks - it contains Banknote annotations for 4,531 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.
Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.
By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.
We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.
All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Turkish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.
Dataset Contain & Diversity:Containing a total of 5000 images, this Turkish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Turkish text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Turkish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Turkish text recognition models.
Update & Custom Collection:We're committed to expanding this dataset by continuously adding more images with the assistance of our native Turkish crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Turkish language. Your journey to enhanced language understanding and processing starts here.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset
This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.