100+ datasets found
  1. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  2. Turkish Tweets Dataset

    • kaggle.com
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anil Guven (2021). Turkish Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/anil1055/turkish-tweet-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anil Guven
    Description

    Dataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.

    You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.

    Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435

  3. h

    turkish-toxic-language

    • huggingface.co
    Updated Apr 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Overfit-GM (2023). turkish-toxic-language [Dataset]. https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2023
    Dataset authored and provided by
    Overfit-GM
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Turkish Texts for Toxic Language Detection

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    This text dataset is a collection of Turkish texts that have been merged from various existing offensive language datasets found online. The dataset contains a total of 77,800 instances, each labeled as either offensive or not offensive. To ensure the dataset's completeness, we utilized multiple transformer models to augment the dataset with pseudo labels. The resulting dataset is… See the full description on the dataset page: https://huggingface.co/datasets/Overfit-GM/turkish-toxic-language.

  4. g

    Turkish Web Treebank

    • github.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turkish Web Treebank [Dataset]. https://github.com/google-research-datasets/turkish-treebanks
    Explore at:
    License

    http://www.apache.org/licenses/LICENSE-2.0.txthttp://www.apache.org/licenses/LICENSE-2.0.txt

    Description

    A human-annotated morphosyntactic treebank for Turkish.

  5. s

    Turkish Turkey Dataset

    • shaip.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Turkish Turkey Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey TTS Dataset for AI & Speech Models Contact Us OverviewTitleTurkish Turkey Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in the…

  6. h

    turkishvoicedataset

    • huggingface.co
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EREN FAZLIOĞLU (2023). turkishvoicedataset [Dataset]. https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2023
    Authors
    EREN FAZLIOĞLU
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for "turkishneuralvoice"

      Dataset Overview
    

    Dataset Name: Turkish Neural Voice Description: This dataset contains Turkish audio samples generated using Microsoft Text to Speech services. The dataset includes audio files and their corresponding transcriptions.

      Dataset Structure
    

    Configs:

    default

    Data Files:

    Split: train Path: data/train-*

    Dataset Info:

    Features: audio: Audio file transcription: Corresponding text transcription

    Splits: train… See the full description on the dataset page: https://huggingface.co/datasets/erenfazlioglu/turkishvoicedataset.

  7. h

    turkish_product_reviews

    • huggingface.co
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih Barmanbay (2025). turkish_product_reviews [Dataset]. https://huggingface.co/datasets/fthbrmnby/turkish_product_reviews
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2025
    Authors
    Fatih Barmanbay
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for Turkish Product Reviews

      Dataset Summary
    

    This Turkish Product Reviews Dataset contains 235.165 product reviews collected online. There are 220.284 positive, 14881 negative reviews.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    The dataset is based on Turkish.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    Example 1: sentence: beklentimin altında bir ürün kaliteli değil sentiment: 0 (negative) Example 2:… See the full description on the dataset page: https://huggingface.co/datasets/fthbrmnby/turkish_product_reviews.

  8. s

    Wake Word Turkish Dataset

    • shaip.com
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word Turkish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Oct 12, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  9. F

    Turkish Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/turkish-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Turkish General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Turkish usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Turkish conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Turkish speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 200 native Turkish speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Turkish usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  10. R

    Turkish Number Plates Dataset

    • universe.roboflow.com
    zip
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    plakatanima (2024). Turkish Number Plates Dataset [Dataset]. https://universe.roboflow.com/plakatanima-vnt3k/turkish-number-plates
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 7, 2024
    Dataset authored and provided by
    plakatanima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Plate Bounding Boxes
    Description

    Turkish Number Plates

    ## Overview
    
    Turkish Number Plates is a dataset for object detection tasks - it contains Plate annotations for 2,246 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. s

    Wake Lo Lus Turkish Dataset

    • hmn.shaip.com
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Lo Lus Turkish Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Tsev Wake Lo Lus Turkish DatasetHigh-Quality Turkish Wake Word Dataset rau AI & Cov Qauv Hais Lus Hu rau Peb Txheej TxheemTitleWake Lo Lus Turkish Lus DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word…

  12. h

    Turkish Sentiment Analysis Dataset

    • humirapps.cs.hacettepe.edu.tr
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Multimedia Information Retrieval Laboratory (2017). Turkish Sentiment Analysis Dataset [Dataset]. http://doi.org/10.1109/SITIS.2016.57
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2017
    Dataset authored and provided by
    Hacettepe University Multimedia Information Retrieval Laboratory
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.

  13. s

    I-Turkish Turkey Dataset

    • zu.shaip.com
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). I-Turkish Turkey Dataset [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Türkiye
    Description

    Ikhaya Lasekhaya laseTurkey I-Turkey DatasetTürkiye Türkiye Veri KümesiIkhwalithi ephezulu ye-Turkish Turkey Idathasethi ye-TTS ye-AI namamodeli wenkulumo Xhumana nathi UhlolojikeleleTitleTurkish Turkey Language DatasetIdathasethi UhloboTTSDescriptionUkurekhodwa kwezwi elilodwa, okuvamise ukuwa...

  14. R

    Turkish Sign Language Dataset

    • universe.roboflow.com
    zip
    Updated Mar 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    deneme (2025). Turkish Sign Language Dataset [Dataset]. https://universe.roboflow.com/deneme-uw6ew/turkish-sign-language
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 22, 2025
    Dataset authored and provided by
    deneme
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    A Bounding Boxes
    Description

    Turkish Sign Language

    ## Overview
    
    Turkish Sign Language is a dataset for object detection tasks - it contains A annotations for 331 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  15. F

    Turkish Newspaper, Magazine, and Books OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-newspaper-book-magazine-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Turkish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.

    Dataset Contain & Diversity:

    Containing a total of 5000 images, this Turkish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Turkish text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.

    All these images were captured by native Turkish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Turkish text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Turkish crowd community.

    If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Turkish language. Your journey to enhanced language understanding and processing starts here.

  16. s

    Wake Word Turkish Dataset

    • zu.shaip.com
    Updated Aug 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Turkish Dataset [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Aug 25, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Ikhaya Wake IZwi I-Turkish DatasetIkhwalithi ephezulu ye-Turkish Wake Word Dataset ye-AI namamodeli Enkulumo Xhumana nathi UhlolojikeleleTitleWake Izwi Lase-Turkish Language DatasetDataset TypeWake WordDescriptionWake Amagama / Umyalo Wezwi / Qalisa Igama...

  17. s

    Takelaka data Turkish Turkey

    • mg.shaip.com
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Takelaka data Turkish Turkey [Dataset]. https://mg.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Dec 9, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Türkiye
    Description

    Home Tiorka Tiorkia DatasetTürkiye Türkiye Veri Kümesi Tiorka Tiorka TTS manana kalitao avo lenta ho an'ny AI sy maodely kabary Mifandraisa aminay OverviewTitleTitle Tiorkia Fiteny Dataset TypeTTSDescriptionFiraketana fitenenana tokana, izay mirona amin'ny…

  18. F

    Turkish Shopping List OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-shopping-list-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Turkish Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these shopping lists were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.

    If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.

  19. i

    Corona Virus (COVID-19) Turkish Tweets Dataset

    • ieee-dataport.org
    Updated May 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Sabuncu (2020). Corona Virus (COVID-19) Turkish Tweets Dataset [Dataset]. https://ieee-dataport.org/open-access/corona-virus-covid-19-turkish-tweets-dataset-0
    Explore at:
    Dataset updated
    May 19, 2020
    Authors
    Ibrahim Sabuncu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Kovid

  20. s

    Wake Word Turkish Dataset

    • so.shaip.com
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Turkish Dataset [Dataset]. https://so.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Jul 30, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word DatasetTurkiga-Tayo-sare oo tayo leh Qalabka Tooska ah ee Turkiga ee AI & Qaababka Hadalka Nala soo xidhiidh DulmarkaTitleWake Word Xogta Luuqadda Turkiga

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset

turkish-sentiment-analysis-dataset

Turkish Sentiment Dataset

winvoker/turkish-sentiment-analysis-dataset

Explore at:
17 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2022
Authors
Batuhan
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

Search
Clear search
Close search
Google apps
Main menu