100+ datasets found

h
turkish-sentiment-analysis-dataset
huggingface.co
Updated Jun 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2022
Authors
Batuhan
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.
h
turkish-wikiNER
huggingface.co
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turkish NLP Suite (2024). turkish-wikiNER [Dataset]. https://huggingface.co/datasets/turkish-nlp-suite/turkish-wikiNER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2024
Dataset authored and provided by
Turkish NLP Suite
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for "turkish-nlp-suite/turkish-wikiNER"

Dataset Summary

Turkish NER dataset from Wikipedia sentences. 20.000 sentences are sampled and re-annotated from Kuzgunlar NER dataset. Annotations are done by Co-one. Many thanks to them for their contributions. This dataset is also used in our brand new spaCy Turkish packages.

Dataset Instances

An instance of this dataset looks as follows: { "tokens": ["Çekimler", "5", "Temmuz", "2005", "tarihinde"… See the full description on the dataset page: https://huggingface.co/datasets/turkish-nlp-suite/turkish-wikiNER.
g
Turkish Web Treebank
github.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turkish Web Treebank [Dataset]. https://github.com/google-research-datasets/turkish-treebanks
Explore at:
License
http://www.apache.org/licenses/LICENSE-2.0.txthttp://www.apache.org/licenses/LICENSE-2.0.txt
Description
A human-annotated morphosyntactic treebank for Turkish.
s
Turkish Turkey Dataset
shaip.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Turkish Turkey Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
Explore at:
Dataset updated
Feb 10, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey TTS Dataset for AI & Speech Models Contact Us OverviewTitleTurkish Turkey Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in the…
s
Wake Word Turkish Dataset
shaip.com
Updated Oct 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Wake Word Turkish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
Explore at:
Dataset updated
Oct 12, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
h
turkishvoicedataset
huggingface.co
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EREN FAZLIOĞLU (2023). turkishvoicedataset [Dataset]. https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 14, 2023
Authors
EREN FAZLIOĞLU
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for "turkishneuralvoice"

Dataset Overview

Dataset Name: Turkish Neural Voice Description: This dataset contains Turkish audio samples generated using Microsoft Text to Speech services. The dataset includes audio files and their corresponding transcriptions.

Dataset Structure

Configs:

default

Data Files:

Split: train Path: data/train-*

Dataset Info:

Features: audio: Audio file transcription: Corresponding text transcription

Splits: train… See the full description on the dataset page: https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset.
s
Turkish Turkey Dataset
sm.shaip.com
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Turkish Turkey Dataset [Dataset]. https://sm.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
Explore at:
Dataset updated
Dec 5, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Turkish Turkey DatasetTürkiye Türkiye Veri Kümesi High-Quality Turkish Turkey TTS Dataset for AI & Speech Models Faʻafesoʻotaʻi Matou Vaʻaiga LauteleIgoaTurkish Turkey Language Dataset Ituaiga Seti TusiTTSFaamatalaga Faamaumauga o upu e tasi, lea e masani ona pa'u i le…
s
Wake Lo Lus Turkish Dataset
hmn.shaip.com
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Lo Lus Turkish Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
Explore at:
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Tsev Wake Lo Lus Turkish DatasetHigh-Quality Turkish Wake Word Dataset rau AI & Cov Qauv Hais Lus Hu rau Peb Txheej TxheemTitleWake Lo Lus Turkish Lus DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word…
F
Turkish Call Center Data for Realestate AI
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Turkish Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-turkish-turkey
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
This Turkish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Turkish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
Speech Data
The dataset features 30 hours of dual-channel call center recordings between native Turkish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
•Participant Diversity:
•
Speakers: 60 native Turkish speakers from our verified contributor community.

•
Regions: Representing different provinces across Turkey to ensure accent and dialect variation.

•
Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.

•Recording Details:
•
Conversation Nature: Naturally flowing, unscripted agent-customer discussions.

•
Call Duration: Average 5–15 minutes per call.

•
Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.

•
Recording Environment: Captured in noise-free and echo-free conditions.

Topic Diversity
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
•Inbound Calls:
•Property Inquiries
•Rental Availability
•Renovation Consultation
•Property Features & Amenities
•Investment Property Evaluation
•Ownership History & Legal Info, and more
•Outbound Calls:
•New Listing Notifications
•Post-Purchase Follow-ups
•Property Recommendations
•Value Updates
•Customer Satisfaction Surveys, and others
Such domain-rich variety ensures model generalization across common real estate support conversations.
Transcription
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-coded Segments
•Non-speech Tags (e.g., background noise, pauses)
•High transcription accuracy with word error rate below 5% via dual-layer human review.
These transcriptions streamline ASR and NLP development for Turkish real estate voice applications.
Metadata
Detailed metadata accompanies each participant and conversation:
•
Participant Metadata: ID, age, gender, location, accent, and dialect.

•
Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

This enables smart filtering, dialect-focused model training, and structured dataset exploration.
Usage and Applications
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
s
Wake Shoko Turkish Dataset
sn.shaip.com
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Shoko Turkish Dataset [Dataset]. https://sn.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
Explore at:
Dataset updated
Sep 11, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Shoko Turkish DatasetHigh-Quality Turkish Wake Word Dataset yeAI & Maemodheru eMatauriro Bata Isu OverviewTitleWake Izwi Turkish Mutauro DatasetDataset TypeWake WordDescriptionWake Mazwi / Inzwi Raira / Rinosimudzira Shoko…
F
Turkish Handwritten Sticky Notes OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Turkish Handwritten Sticky Notes OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-sticky-notes-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Turkish Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.
Dataset Contain & Diversity:
Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these sticky notes were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:
In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.
Update & Custom Collection:
We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.
If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:
This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.
i
Corona Virus (COVID-19) Turkish Tweets Dataset
ieee-dataport.org
Updated May 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Sabuncu (2020). Corona Virus (COVID-19) Turkish Tweets Dataset [Dataset]. https://ieee-dataport.org/open-access/corona-virus-covid-19-turkish-tweets-dataset-0
Explore at:
Dataset updated
May 19, 2020
Authors
Ibrahim Sabuncu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Kovid
s
Wake Word Turkish Dataset
so.shaip.com
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word Turkish Dataset [Dataset]. https://so.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
Explore at:
Dataset updated
Jul 30, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Word DatasetTurkiga-Tayo-sare oo tayo leh Qalabka Tooska ah ee Turkiga ee AI & Qaababka Hadalka Nala soo xidhiidh DulmarkaTitleWake Word Xogta Luuqadda Turkiga
s
Weke Okwu Turkish Dataset
ig.shaip.com
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Weke Okwu Turkish Dataset [Dataset]. https://ig.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
Explore at:
Dataset updated
Dec 8, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset maka AI & ụdị okwu Kpọtụrụ anyị OverviewTitleWake Okwu Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
s
Wake Word Turkish Dataset
zu.shaip.com
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word Turkish Dataset [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
Explore at:
Dataset updated
Aug 25, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Ikhaya Wake IZwi I-Turkish DatasetIkhwalithi ephezulu ye-Turkish Wake Word Dataset ye-AI namamodeli Enkulumo Xhumana nathi UhlolojikeleleTitleWake Izwi Lase-Turkish Language DatasetDataset TypeWake WordDescriptionWake Amagama / Umyalo Wezwi / Qalisa Igama...
o
Turkish Natural Language Inference Dataset
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Turkish Natural Language Inference Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f4951f96-ebbc-43bf-bed5-36dce9796e6e
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.

Columns

The dataset records typically include the following columns:

premise: This column contains sentences written in Turkish. These sentences have been translated from the English sources used for the original SNLI and MNLI datasets. It serves as the contextual information or the initial statement from which an inference is to be made.

hypothesis: This column also contains sentences in Turkish, translated from the English SNLI and MNLI datasets. It represents the conclusion or the statement whose relationship to the premise is being assessed.

label: This column assigns a relationship between the premise and hypothesis. Possible values include:

'entailment': The hypothesis logically follows from the premise.

'contradiction': The hypothesis directly contradicts the premise.

'neutral': The hypothesis is unrelated to or neither entails nor contradicts the premise.

domain: An optional column assigned by some authors, primarily used when inferences are made between sentences across different semantic domains, such as weather, sports, or finance.

Distribution

The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv for training models, slni_tr_validation for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv for additional validation on complex scenarios. The multinli_tr_train.csv file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv file, for instance, containing approximately 392,700 records.

Usage

This dataset is ideal for various applications and use cases in NLP and machine learning:

Developing Natural Language Inference (NLI)-based question answering systems for the Turkish language.

Training sentiment analysis algorithms to discern sentiment in Turkish text.

Building Machine Learning Chatbots that leverage NLI to understand conversational context and respond appropriately in Turkish.

Conducting general NLI research in Turkish.

Investigating cross-lingual generalisation capabilities of NLP models.

Tasks such as sentence paraphrasing, classification, and other NLP techniques applied to Turkish text.

Coverage

The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.

License

CC0

Who Can Use It

The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:

The natural language processing (NLP) community.

The machine learning community.

Seasoned and budding researchers looking to delve into NLI tasks.

Developers aiming to create automated models for Turkish language inference.

Academics and practitioners exploring the cross-lingual generalisation capabilities of models.

Anyone working on NLP tasks in Turkish, such as sentence paraphrasing, text classification, or question answering.

Dataset Name Suggestions

NLI-TR (Turkish NLI Research)

Turkish Natural Language Inference Dataset

SNLI-TR and MNLI-TR Turkish Data

Turkish Textual Entailment Data

Attributes

Original Data Source: NLI-TR (Turkish NLI Research)
P
MedTurkQuAD: Medical Turkish Question-Answering Dataset Dataset
paperswithcode.com
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mert Incidelen; Murat Aydogan (2024). MedTurkQuAD: Medical Turkish Question-Answering Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/medturkquad-medical-turkish-question
Explore at:
Dataset updated
Oct 15, 2024
Authors
Mert Incidelen; Murat Aydogan
Description
A comprehensive Turkish dataset for question-answering tasks in medical domain
R
Turkish Lira Detection Dataset
universe.roboflow.com
zip
Updated Apr 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tltltltl (2023). Turkish Lira Detection Dataset [Dataset]. https://universe.roboflow.com/tltltltl/turkish-lira-detection
Explore at:
zipAvailable download formats
Dataset updated
Apr 21, 2023
Dataset authored and provided by
tltltltl
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Türkiye
Variables measured
Banknote Bounding Boxes
Description
Turkish Lira Detection

## Overview Turkish Lira Detection is a dataset for object detection tasks - it contains Banknote annotations for 4,531 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
m
English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...
data.mendeley.com
Updated Feb 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
Explore at:
Unique identifier
https://doi.org/10.17632/cdcztymf4k.1
Dataset updated
Feb 9, 2017
Authors
H. Bahadir Sahin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
F
Turkish Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Turkish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Turkish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Turkish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Turkish text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Turkish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Turkish text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Turkish crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Turkish language. Your journey to enhanced language understanding and processing starts here.

Facebook

Twitter

Click to copy link

Link copied

Cite

Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset

turkish-sentiment-analysis-dataset

Turkish Sentiment Dataset

winvoker/turkish-sentiment-analysis-dataset

Explore at:

17 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 21, 2022

Authors

Batuhan

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

Clear search

Close search

Google apps

Main menu

turkish-sentiment-analysis-dataset

turkish-wikiNER

Turkish Web Treebank

Turkish Turkey Dataset

Wake Word Turkish Dataset

turkishvoicedataset

Turkish Turkey Dataset

Wake Lo Lus Turkish Dataset

Turkish Call Center Data for Realestate AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Wake Shoko Turkish Dataset

Turkish Handwritten Sticky Notes OCR Image Dataset

What’s Included

Corona Virus (COVID-19) Turkish Tweets Dataset

Wake Word Turkish Dataset

Weke Okwu Turkish Dataset

Wake Word Turkish Dataset

Turkish Natural Language Inference Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

MedTurkQuAD: Medical Turkish Question-Answering Dataset Dataset

Turkish Lira Detection Dataset

Turkish Lira Detection

English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

Turkish Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

turkish-sentiment-analysis-dataset

Turkish Sentiment Dataset

winvoker/turkish-sentiment-analysis-dataset