100+ datasets found
  1. Turkish Tweets Dataset

    • kaggle.com
    zip
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anil Guven (2021). Turkish Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/anil1055/turkish-tweet-dataset
    Explore at:
    zip(170312 bytes)Available download formats
    Dataset updated
    Apr 9, 2021
    Authors
    Anil Guven
    Description

    Dataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.

    You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.

    Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435

  2. h

    stsb-mt-turkish

    • huggingface.co
    Updated Feb 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emrecan Çelik (2022). stsb-mt-turkish [Dataset]. https://huggingface.co/datasets/emrecan/stsb-mt-turkish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2022
    Authors
    Emrecan Çelik
    Description

    STSb Turkish

    Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STSb English dataset. This dataset is not reviewed by expert human translators. Uploaded from this repository.

  3. h

    turkish-sentiment-analysis-dataset

    • huggingface.co
    • kaggle.com
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2022
    Authors
    Batuhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.

  4. All Turkish Words Dataset 📃🖊️

    • kaggle.com
    zip
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enis Tuna (2024). All Turkish Words Dataset 📃🖊️ [Dataset]. https://www.kaggle.com/datasets/enistuna/all-turkish-words-dataset
    Explore at:
    zip(42391799 bytes)Available download formats
    Dataset updated
    Mar 14, 2024
    Authors
    Enis Tuna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ALL TURKISH WORDS DATASET

    This dataset contains all the Turkish words I've managed to fetch from the web. The dataset has approximately 7 million lines of Turkish word tokens, each seperated by " " so it is easier to read.

    Some words are different variations of the same word e.g. "araba", "arabada", "arabadan". Feel free to use lemmatization algorithms to reduce the data size.

    I believe this dataset could be improved upon. It certainly is not finished. I will update this dataset if I can get my hands on new words in the future.

    My Linkedin: https://www.linkedin.com/in/enistuna/ My Github: https://github.com/enistuna

  5. s

    Turkish Turkey Dataset

    • shaip.com
    Updated Feb 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Turkish Turkey Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey Scripted Monologue Dataset for AI & Speech Models Contact Us OverviewTitle (Language)Turkish Turkey Language DatasetDataset TypesScripted MonologueCountryTurkeyDescriptionThe Scripted Monologue dataset consists…

  6. Turkish Idioms and Proverbs

    • kaggle.com
    zip
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emre Okcular (2023). Turkish Idioms and Proverbs [Dataset]. https://www.kaggle.com/datasets/emreokcular/turkish-idioms-and-proverbs
    Explore at:
    zip(766620 bytes)Available download formats
    Dataset updated
    Feb 2, 2023
    Authors
    Emre Okcular
    Description

    A useful resource for Turkish NLP tasks. The dataset was collected using a TDK endpoint

    Türk dil kurumu adresi kullanılarak toplanan, anlamları ile beraber Türkçe atasözleri ve deyimler veri seti. Türkçe doğal dil işleme çalışmaları için yararlı bir kaynaktır.

    Disclaimer: It might contain missing or wrong idioms/proverbs.

  7. s

    Wake Word Turkish Dataset

    • shaip.com
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word Turkish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Oct 12, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  8. R

    Turkish Number Plates Dataset

    • universe.roboflow.com
    zip
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    plakatanima (2024). Turkish Number Plates Dataset [Dataset]. https://universe.roboflow.com/plakatanima-vnt3k/turkish-number-plates
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 7, 2024
    Dataset authored and provided by
    plakatanima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Plate Bounding Boxes
    Description

    Turkish Number Plates

    ## Overview
    
    Turkish Number Plates is a dataset for object detection tasks - it contains Plate annotations for 2,246 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  9. E

    Turkish Speecon database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Feb 22, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). Turkish Speecon database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0178/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Turkish Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 550 adult Turkish speakers (280 males, 270 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). 2) The second set comprises the recordings of 50 child Turkish speakers (25 boys, 25 girls), recorded over 4 microphone channels in 1 recording environment (children room). This database is partitioned into 28 DVDs (first set) and 4 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:Calibration data: 6 noise recordingsThe “silence word” recordingFree spontaneous items (adults only):3 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city name, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address222 application specific words and phrases per session (adults)74 toy commands, 14 general commands, 31 phone commands and 4 application word synonyms (children)The following age distribution has been obtained: Adults: 244 speakers are between 15 and 30, 235 speakers are between 31 and 45, and 71 speakers are over 46.Children: 25 speakers are between 8 and 10, 25 speakers are between 11 and 15.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

  10. s

    Wake Word Turkish Dataset

    • ny.shaip.com
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2025). Wake Word Turkish Dataset [Dataset]. https://ny.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Sep 20, 2025
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Mawu Turkish DatasetHigh-Quality Turkish Wake Word Dataset ya AI & Zolankhula Zolankhula Lumikizanani Nafe mwachiduleTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Mawu / Lamulo Lamawu / Limbitsani Mawu…

  11. F

    Turkish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Turkish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Turkish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Turkish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Turkish speech models that understand and respond to authentic Turkish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Turkish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Turkish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Turkey to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Turkish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Turkish.
    Voice Assistants: Build smart assistants capable of understanding natural Turkish conversations.
    <span

  12. F

    Turkish Shopping List OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-shopping-list-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Turkish Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these shopping lists were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.

    If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.

  13. h

    Turkish_Speech_Corpus

    • huggingface.co
    • kaggle.com
    Updated Dec 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Smart Systems and Artificial Intelligence, Nazarbayev University (2025). Turkish_Speech_Corpus [Dataset]. https://huggingface.co/datasets/issai/Turkish_Speech_Corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 17, 2025
    Dataset authored and provided by
    Institute of Smart Systems and Artificial Intelligence, Nazarbayev University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Turkish Speech Corpus (TSC)

    This repository presents an open-source Turkish Speech Corpus, introduced in "Multilingual Speech Recognition for Turkic Languages". The corpus contains 218.2 hours of transcribed speech with 186,171 utterances and is the largest publicly available Turkish dataset of its kind at that time. Paper: Multilingual Speech Recognition for Turkic Languages.
    GitHub Repository: https://github.com/IS2AI/TurkicASR

      Citation
    

    @Article{info14020074… See the full description on the dataset page: https://huggingface.co/datasets/issai/Turkish_Speech_Corpus.

  14. h

    turkish-academic-theses-dataset

    • huggingface.co
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umut Ertuğrul Daşgın (2025). turkish-academic-theses-dataset [Dataset]. https://huggingface.co/datasets/umutertugrul/turkish-academic-theses-dataset
    Explore at:
    Dataset updated
    Nov 10, 2025
    Authors
    Umut Ertuğrul Daşgın
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📚 Turkish Academic Theses Abstracts (TR/EN)

    A large-scale bilingual (Turkish–English) collection of abstracts from Turkish academic theses (YÖK Ulusal Tez Merkezi). This dataset focuses only on abstracts, provided in Turkish (abstract_tr) and English (abstract_en), suitable for summarization, translation, classification, and retrieval.

    Records: ~650k abstracts (TR & EN) Format: Parquet (.parquet) — fast & analytics-friendly Language: 🇹🇷 Turkish + 🇬🇧 English (parallel abstracts… See the full description on the dataset page: https://huggingface.co/datasets/umutertugrul/turkish-academic-theses-dataset.

  15. i

    Turkish Question answering sentiment dataset (SCD)

    • ieee-dataport.org
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kadir tohma (2023). Turkish Question answering sentiment dataset (SCD) [Dataset]. https://ieee-dataport.org/documents/turkish-question-answering-sentiment-dataset-scd
    Explore at:
    Dataset updated
    Jul 3, 2023
    Authors
    kadir tohma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    containing a total of 13

  16. Turkish Polite Dataset

    • kaggle.com
    zip
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunus Emre Akca (2025). Turkish Polite Dataset [Dataset]. https://www.kaggle.com/datasets/yunusemreakca/turkish-polite-dataset
    Explore at:
    zip(98871 bytes)Available download formats
    Dataset updated
    Apr 17, 2025
    Authors
    Yunus Emre Akca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains chat-based dialogs in the Turkish language. The dialogs are written in a particularly natural, polite and supportive style. The interactions between the user and the chatbot aim to provide information and support on different topics. This dataset is suitable for Turkish language processing (NLP) projects and can be used in areas such as chatbots, language modeling and text analysis.

  17. Turkish Sentiment Analysis Dataset

    • humirapps.cs.hacettepe.edu.tr
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Multimedia Information Retrieval Laboratory (2017). Turkish Sentiment Analysis Dataset [Dataset]. http://doi.org/10.1109/SITIS.2016.57
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2017
    Dataset provided by
    Hacettepe Universityhttp://hacettepe.edu.tr/
    Authors
    Hacettepe University Multimedia Information Retrieval Laboratory
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.

  18. m

    English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

    • data.mendeley.com
    Updated Feb 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
    Explore at:
    Dataset updated
    Feb 9, 2017
    Authors
    H. Bahadir Sahin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

    Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

    By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

    We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

    All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.

  19. h

    interpress_news_category_tr

    • huggingface.co
    • opendatalab.com
    Updated Mar 7, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yavuz Kömeçoğlu (2013). interpress_news_category_tr [Dataset]. https://huggingface.co/datasets/yavuzkomecoglu/interpress_news_category_tr
    Explore at:
    Dataset updated
    Mar 7, 2013
    Authors
    Yavuz Kömeçoğlu
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    It is a Turkish news data set consisting of 273601 news in 17 categories, compiled from print media and news websites between 2010 and 2017 by the Interpress (https://www.interpress.com/) media monitoring company.

  20. h

    turkish-ocr

    • huggingface.co
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C. Emre Karataş (2025). turkish-ocr [Dataset]. https://huggingface.co/datasets/emredeveloper/turkish-ocr
    Explore at:
    Dataset updated
    Nov 4, 2025
    Authors
    C. Emre Karataş
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Turkish OCR Synthetic Dataset

    A synthetic dataset for Turkish OCR (Optical Character Recognition) containing 1,000 realistic images with Turkish text.

      Features
    

    Text types: Handwritten (70%) and printed (30%) Full Turkish character support: Ç, ç, Ğ, ğ, İ, ı, Ö, ö, Ş, ş, Ü, ü Realistic backgrounds: Lined paper, grid paper, and textured paper Augmentations: Motion blur, noise, perspective distortion, and ink artifacts

      Data Format
    

    Each sample contains:

    image:… See the full description on the dataset page: https://huggingface.co/datasets/emredeveloper/turkish-ocr.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anil Guven (2021). Turkish Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/anil1055/turkish-tweet-dataset
Organization logo

Turkish Tweets Dataset

Dataset for Sentiment Analysis

Explore at:
zip(170312 bytes)Available download formats
Dataset updated
Apr 9, 2021
Authors
Anil Guven
Description

Dataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.

You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.

Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435

Search
Clear search
Close search
Google apps
Main menu