100+ datasets found
  1. Turkish Tweets Dataset

    • kaggle.com
    zip
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anil Guven (2021). Turkish Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/anil1055/turkish-tweet-dataset
    Explore at:
    zip(170312 bytes)Available download formats
    Dataset updated
    Apr 9, 2021
    Authors
    Anil Guven
    Description

    Dataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.

    You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.

    Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435

  2. NER Dataset(Turkish)

    • kaggle.com
    zip
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akay (2024). NER Dataset(Turkish) [Dataset]. https://www.kaggle.com/datasets/akay16/ner-datasetturkish
    Explore at:
    zip(9149708 bytes)Available download formats
    Dataset updated
    Apr 14, 2024
    Authors
    Akay
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data split:

    • 18.000 train
    • 1000 test
    • 1000 dev

    Labels:

    • CARDINAL
    • DATE
    • EVENT
    • FAC
    • GPE
    • LANGUAGE
    • LAW
    • LOC
    • MONEY
    • NORP
    • ORDINAL
    • ORG
    • PERCENT
    • PERSON
    • PRODUCT
    • QUANTITY
    • TIME
    • TITLE
    • WORK_OF_ART

    I do not **own **this dataset. I changed the original format for easier use and turned it into a **csv **and .spacy file.

    you reach the original version of the dataset from the **link **below

    https://github.com/turkish-nlp-suite/Turkish-Wiki-NER-Dataset

  3. h

    turkish-offensive-language-detection

    • huggingface.co
    Updated Sep 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toygar Tanyel (2022). turkish-offensive-language-detection [Dataset]. https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2022
    Authors
    Toygar Tanyel
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Summary

    This dataset is enhanced version of existing offensive language studies. Existing studies are highly imbalanced, and solving this problem is too costly. To solve this, we proposed contextual data mining method for dataset augmentation. Our method is basically prevent us from retrieving random tweets and label individually. We can directly access almost exact hate related tweets and label them directly without any further human interaction in order to solve imbalanced… See the full description on the dataset page: https://huggingface.co/datasets/Toygar/turkish-offensive-language-detection.

  4. Turkish Wikipedia Dataset

    • kaggle.com
    zip
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osman Kagan Kurnaz (2024). Turkish Wikipedia Dataset [Dataset]. https://www.kaggle.com/datasets/osmankagankurnaz/turkish-wikipedia-dataset
    Explore at:
    zip(458865119 bytes)Available download formats
    Dataset updated
    Mar 19, 2024
    Authors
    Osman Kagan Kurnaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
    • The articles in this dataset are not specifically tagged for a particular task and the dataset is untagged.
    • This dataset is written in Turkish and was created by a team of volunteers using community engagement methods.
    • This dataset is an original dataset created from the Turkish Wikipedia.

    Thanks for using the Turkish Wikipedia dataset! We hope it will be useful for your language modeling and text generation tasks.

    Since the Turkish Wikipedia dataset is not on Kaggle, I took a shared dataset on Huggingface. I merged the shared dataset as 2 parquet files and shared it on Kaggle. You can go to the version of the dataset shared on Huggingface from the link below. I would also like to thank https://huggingface.co/musabg for creating this dataset.

    Original link to this dataset: https://huggingface.co/datasets/musabg/wikipedia-tr

  5. h

    Turkish-Alpaca

    • huggingface.co
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TÜBİTAK Science High School AI Club (2023). Turkish-Alpaca [Dataset]. https://huggingface.co/datasets/TFLai/Turkish-Alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2023
    Dataset authored and provided by
    TÜBİTAK Science High School AI Club
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Stanford alpaca turkish: Stanford Alpaca

  6. h

    instruction-turkish

    • huggingface.co
    Updated Feb 7, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmet (2026). instruction-turkish [Dataset]. https://huggingface.co/datasets/atasoglu/instruction-turkish
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2026
    Authors
    Ahmet
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is machine-translated version of HuggingFaceH4/instruction-dataset into Turkish.Translated with googletrans==3.1.0a0.

  7. s

    Turkish Turkey Dataset

    • shaip.com
    Updated Feb 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Turkish Turkey Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/turkish-turkey-dataset/
    Explore at:
    Dataset updated
    Feb 10, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Turkish Turkey DatasetTürkiye Türkiye Veri KümesiHigh-Quality Turkish Turkey Scripted Monologue Dataset for AI & Speech Models Contact Us OverviewTitle (Language)Turkish Turkey Language DatasetDataset TypesScripted MonologueCountryTurkeyDescriptionThe Scripted Monologue dataset consists…

  8. s

    Wake Word Turkish Dataset

    • shaip.com
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word Turkish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Oct 12, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word Turkish DatasetHigh-Quality Turkish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  9. s

    Wake Lo Lus Turkish Dataset

    • hmn.shaip.com
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Lo Lus Turkish Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Tsev Wake Lo Lus Turkish DatasetHigh-Quality Turkish Wake Word Dataset rau AI & Cov Qauv Hais Lus Hu rau Peb Txheej TxheemTitleWake Lo Lus Turkish Lus DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word…

  10. h

    bilkent-turkish-writings-dataset

    • huggingface.co
    Updated May 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Selim F. Yilmaz (2025). bilkent-turkish-writings-dataset [Dataset]. https://huggingface.co/datasets/selimfirat/bilkent-turkish-writings-dataset
    Explore at:
    Dataset updated
    May 24, 2025
    Authors
    Selim F. Yilmaz
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Compilation of Bilkent Turkish Writings Dataset

      Dataset Description
    

    This is a comprehensive compilation of Turkish creative writings from Bilkent University's Turkish 101 and Turkish 102 courses (2014-2025). The dataset contains 9119 student writings originally created by students and instructors, focusing on creativity, content, composition, grammar, spelling, and punctuation development. Note: This dataset is a compilation and digitization of publicly available writings… See the full description on the dataset page: https://huggingface.co/datasets/selimfirat/bilkent-turkish-writings-dataset.

  11. F

    Turkish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-turkish-turkey
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Turkish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Turkish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Turkish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Turkish speech models that understand and respond to authentic Turkish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Turkish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:

    - Speakers: 60 verified native Turkish speakers from FutureBeeAI’s contributor community.

    - Regions: Representing various provinces of Turkey to ensure dialectal diversity and demographic balance.

    - Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

    Recording Details:

    - Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

    - Duration: Each conversation ranges from 15 to 60 minutes.

    - Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

    - Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:

    - Family & Relationships

    - Food & Recipes

    - Education & Career

    - Healthcare Discussions

    - Social Issues

    - Technology & Gadgets

    - Travel & Local Culture

    - Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:

    - Speaker-segmented dialogues

    - Time-coded utterances

    - Non-speech elements (pauses, laughter, etc.)

    - High transcription accuracy, achieved through double QA pass, average WER< 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    License

    This Turkish General Conversation Dataset is created by FutureBeeAI and is available for commercial licensing.

  12. F

    Turkish Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Turkish Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/turkish-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Turkish General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Turkish usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Turkish conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Turkish speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700

    Turns per Chat: Up to 50 dialogue turns

    Contributors: 200 native Turkish speakers from the FutureBeeAI Crowd Community

    Format: TXT, DOCS, JSON or CSV (customizable)

    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies

    Health and wellness

    Children and parenting

    Family life and relationships

    Food and cooking

    Education and studying

    Festivals and traditions

    Environment and daily life

    Internet and tech usage

    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Turkish usage with:

    Colloquial expressions and local dialect influence

    Domain-relevant terminology

    Language-specific grammar, phrasing, and sentence flow

    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references

    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age

    Gender

    Country/Region

    Chat Domain

    Chat Topic

    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness

    Format checks for chat turns and metadata

    Linguistic verification by native speakers

    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Licensing

    This dataset is developed and owned by FutureBeeAI and is available under a commercial license. Custom licensing terms can be provided for academic, research, or enterprise clients.

  13. s

    Wake Word Turkish Dataset

    • ny.shaip.com
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2025). Wake Word Turkish Dataset [Dataset]. https://ny.shaip.com/offerings/speech-data-catalog/wake-word-turkish-dataset/
    Explore at:
    Dataset updated
    Sep 20, 2025
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Mawu Turkish DatasetHigh-Quality Turkish Wake Word Dataset ya AI & Zolankhula Zolankhula Lumikizanani Nafe mwachiduleTitleWake Word Turkish Language DatasetDataset TypeWake WordDescriptionWake Mawu / Lamulo Lamawu / Limbitsani Mawu…

  14. h

    Turkish_Speech_Corpus

    • huggingface.co
    • kaggle.com
    Updated Dec 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Smart Systems and Artificial Intelligence, Nazarbayev University (2025). Turkish_Speech_Corpus [Dataset]. https://huggingface.co/datasets/issai/Turkish_Speech_Corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 17, 2025
    Dataset authored and provided by
    Institute of Smart Systems and Artificial Intelligence, Nazarbayev University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Turkish Speech Corpus (TSC)

    This repository presents an open-source Turkish Speech Corpus, introduced in "Multilingual Speech Recognition for Turkic Languages". The corpus contains 218.2 hours of transcribed speech with 186,171 utterances and is the largest publicly available Turkish dataset of its kind at that time. Paper: Multilingual Speech Recognition for Turkic Languages.
    GitHub Repository: https://github.com/IS2AI/TurkicASR

      Citation
    

    @Article{info14020074… See the full description on the dataset page: https://huggingface.co/datasets/issai/Turkish_Speech_Corpus.

  15. Turkish Sentiment Analysis Dataset

    • humirapps.cs.hacettepe.edu.tr
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Multimedia Information Retrieval Laboratory (2017). Turkish Sentiment Analysis Dataset [Dataset]. http://doi.org/10.1109/SITIS.2016.57
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 12, 2017
    Dataset provided by
    Hacettepe Universityhttp://hacettepe.edu.tr/
    Authors
    Hacettepe University Multimedia Information Retrieval Laboratory
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We have selected two most popular movie and hotel recommendation websites from those which attain a high rate in the Alexa website. We selected “beyazperde.com” and “otelpuan.com” for movie and hotel reviews, respectively. The reviews of 5,660 movies were investigated. The all 220,000 extracted reviews had been already rated by own authors using stars 1 to 5. As most of the reviews were positive, we selected the positive reviews as much as the negative ones to provide a balanced situation. The total of negative reviews rated by 1 or 2 stars were 26,700, thus, we randomly selected 26,700 out of 130,210 positive reviews rated by 4 or 5 stars. Overall, 53,400 movie reviews by the average length of 33 words were selected. The similar manner was used to hotel reviews with the difference that the hotel reviews had been rated by the numbers between 0 and 100 instead of stars. From 18,478 reviews extracted from 550 hotels, a balanced set of positive and negative reviews was selected. As there were only 5,802 negative hotel reviews using 0 to 40 rating, we selected 5800 out of 6499 positive reviews rated from 80 to 100. The average length of all 11,600 selected positive and negative hotel reviews were 74 which is more than two times of the movie reviews.

  16. R

    Turkish Tiel Dataset

    • universe.roboflow.com
    zip
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    edoski (2023). Turkish Tiel Dataset [Dataset]. https://universe.roboflow.com/edoski/turkish-tiel
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 13, 2023
    Dataset authored and provided by
    edoski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Money Bounding Boxes
    Description

    Turkish Tiel

    ## Overview
    
    Turkish Tiel is a dataset for object detection tasks - it contains Money annotations for 604 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. i

    Turkish Question answering sentiment dataset (SCD)

    • ieee-dataport.org
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kadir tohma (2023). Turkish Question answering sentiment dataset (SCD) [Dataset]. https://ieee-dataport.org/documents/turkish-question-answering-sentiment-dataset-scd
    Explore at:
    Dataset updated
    Jul 3, 2023
    Authors
    kadir tohma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    containing a total of 13

  18. NLI-TR (Turkish NLI Research)

    • kaggle.com
    zip
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). NLI-TR (Turkish NLI Research) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-turkish-nli-research-with-the-nli-tr-d
    Explore at:
    zip(43062479 bytes)Available download formats
    Dataset updated
    Dec 6, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    NLI-TR (Turkish NLI Research)

    Unleash Your NLI Research in Turkish Language!

    By Huggingface Hub [source]

    About this dataset

    NLI-TR is a revolutionary set of two datasets that provide an unparalleled opportunity for the natural language processing and machine learning community to conduct inference research in the Turkish Language. The datasets - SNLI-TR and MNLI-TR - contain carefully curated natural language inference data that have been translated into Turkish. With NLI-TR, researchers can explore the exciting prospects of developing automated models tailored to make inferences on texts produced in this vibrant language. Moreover, they can also investigate how models trained on data from one language fare when applied in another, a valuable insight into cross-lingual generalization capabilities. NLI-TR offers both seasoned and budding researchers an unprecedented platform to further our understanding of natural language inferencing capability

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How To Use The NLI-TR Dataset to Unlock Turkish NLI Research

    Welcome to the exciting world of natural language inference (NLI) research! If you’re looking for a great dataset to use for your research in this field, the NLI-TR dataset is a perfect starting point. This guide will provide an overview of how you can use the data from this dataset to uncover new insights about NLI tasks in Turkish.

    The NLI-TR dataset contains two large scale datasets intended for natural language inference tasks – SNLI-TR and MNLI- TR. Both datasets offer researchers an opportunity to explore Natural Language Inference (NLI) research in the Turkish language, with examples ranging from sentence paraphrasing task and classification tasks to question answering scenarios using various NLP techniques.

    Using the Data:

    The data provided in this dataset includes both training and validation sets, making it easy for researchers who are just getting started with their projects. The SNLI_tr_train.csv file is used as input for training your models, while slni_tr_validation can be used as input for testing or validating model accuracy on unseen data. Additionally, multinli_tr_validation_{matched / mismatched}.csv files offer additional validation on how well your trained models perform on more complex scenarios such as sentence paraphrasing or question answering tasks using various NLP techniques.

    Each record includes four columns – premise ,hypothesis ,label , (and domain). The premise column specifies what information is provided before asking a question or making an inference; think of it as context clues that explain why one statement implies another statement more directly than others might do without them . The hypothesis column provides what lies at the heart of inference --the conclusion reached after introducing facts given before it . Last but not least we have label column which denotes whether two sentences entail each other (ENTAILMENT), contradict each other(CONTRADICTION) or are unrelated(NEUTRAL). A domain label has also been assigned by some authors when necessary; this mostly applies when inferring between sentences across different semantic domains such as weather vs sports vs finance etc .

    Research Ideas

    • Developing an NLI-based Turkish language question answering system.
    • Training a sentiment analysis algorithm to identify sentiment in text written in Turkish.
    • Building a Machine Learning Chatbot that uses NLI to understand conversational context and respond accordingly for users intending to converse in the Turkish language

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: snli_tr_train.csv | Column name | Description | |:---------------|:------------------------------------------------------------------------------------------------------------------------------------------...

  19. Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün...

    • zenodo.org
    bin
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Ali Bayram; M. Ali Bayram (2024). Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün Türkçe Veri Seti [Dataset]. http://doi.org/10.5281/zenodo.13375018
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    M. Ali Bayram; M. Ali Bayram
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Önemli Not: Bu veri setinin cevap sütununda bir hata tespit edildi ve bu hata yeni sürümünde düzeltildi. Bu nedenle, son sürümünün kullanılması büyük önem taşımaktadır.

    Important Note: There was an error in the answer column of this dataset, which has been fixed in version the latest version. It is very important to use the latest version.

    Turkish MMLU: Yapay Zeka ve Akademik Uygulamalar İçin En Kapsamlı ve Özgün Türkçe Veri Seti (Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications)

    The Turkish MMLU: The Most Comprehensive and Original Turkish Dataset for AI and Academic Applications dataset is a comprehensive and original resource, specifically designed for training, fine-tuning, and evaluating AI models in Turkish. With 293,468 questions, this dataset stands as the most extensive collection in its field, covering a wide range of academic and professional subjects relevant to Turkey, including major exams like TUS (Medical Specialization Examination), KPSS (Public Personnel Selection Examination), and many others.

    Key Features:

    Completely Original Content: The dataset is entirely created from original Turkish sources, ensuring authenticity and relevance to the Turkish context. It has not been translated from other languages, which is crucial for maintaining the integrity of the language data.

    Extensive Data Volume: With nearly 300,000 questions, the dataset offers a substantial corpus for training models, enabling deep learning algorithms to gain a nuanced understanding of the Turkish language across diverse topics.

    Detailed Structure: The dataset is organized into six key columns:

    ‘bölüm’ (section): Indicates the broader exam or category.

    ‘konu’ (subject): Specifies the topic within the section.

    ‘soru’ (question): The question text itself.

    ‘cevap’ (answer): The correct answer to the question.

    ‘aciklama’ (explanation): Provides additional context or reasoning for the answer, crucial for models to understand the logic behind correct responses.

    ‘secenekler’ (options): The possible answer choices, essential for multiple-choice formats.

    Wide Range of Sections and Subjects: The dataset includes 67 sections covering over 800 unique subjects. These sections span from specialized medical fields in TUS to general knowledge and vocational exams like KPSS and Ehliyet, ensuring that the dataset reflects the complexity and breadth of Turkish academic and professional content.

    Dataset Source and Usage:

    Data Source: The dataset is compiled from publicly available data on the internet. While care has been taken to ensure that the data is original, there may be instances where some questions contain copyrighted material. If any copyright holders identify their material within the dataset, they are encouraged to contact the author, and the specific question will be promptly removed.

    Non-Commercial Use: This dataset is strictly intended for research and academic purposes. It cannot be used for commercial purposes under any circumstances.

    Importance for AI Models:

    1. Training: The vast number of questions, coupled with detailed explanations, makes this dataset an invaluable resource for training AI models to understand and process Turkish at a high level. The diversity of topics also ensures that the model is exposed to a wide range of vocabulary, concepts, and linguistic structures.

    2. Fine-Tuning: For researchers and developers looking to fine-tune existing models, such as GPT, BERT, or other transformer-based architectures, this dataset offers domain-specific content that can significantly enhance performance in areas like medical language processing, legal text analysis, or general-purpose Turkish language understanding.

    3. Evaluation: The Turkish MMLU dataset is ideal for evaluating the performance of AI models in Turkish. With its rich content and structured format, it allows for rigorous testing across various subjects, helping to measure how well a model can comprehend and generate accurate responses in Turkish.

    4. Real-World Application: Beyond academic research, this dataset is also highly applicable in developing AI-powered tools for exam preparation, automated tutoring systems, and educational applications that require a deep understanding of the Turkish language and its diverse domains.

    Example Sections:

    Medical Exams (TUS): Includes specialized subjects such as Farmakoloji, Patoloji, Mikrobiyoloji, and more, which are critical for training models intended for medical documentation or decision support systems.

    Public and Professional Exams (KPSS): Encompasses a wide array of subjects like Genel Kültür, Tarih, Coğrafya, and Vatandaşlık, making it valuable for general-purpose models.

    Diverse Topics: Ranging from Dini Bilgiler and Futbol to İlahiyat and İşletme, this dataset provides a robust foundation for models that need to handle a variety of real-world questions in Turkish.

    Potential Uses:

    Model Training: Utilize the dataset to train AI models from scratch, providing a foundational understanding of Turkish in both general and specialized contexts.

    Fine-Tuning Pre-Trained Models: Enhance existing models by fine-tuning them on this dataset, allowing them to achieve better performance in Turkish language tasks.

    Evaluation and Benchmarking: Test and benchmark the capabilities of AI models, ensuring they meet the necessary standards for comprehension and response generation in Turkish.

    AI-Powered Educational Tools: Develop intelligent tutoring systems or exam preparation tools that can assist students and professionals in mastering complex subjects.

    Conclusion:

    The Turkish MMLU dataset is not just a collection of questions and answers; it is a comprehensive and original tool designed to advance the development of AI in the Turkish language. Whether you are training new models, fine-tuning existing ones, or evaluating their performance, this dataset offers the depth and breadth needed to push the boundaries of natural language processing in Turkish. Its originality and extensive scope make it an indispensable resource for anyone working in this field.

  20. Turkish Book Data Set

    • kaggle.com
    zip
    Updated Jan 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammed İbrahim Top (2024). Turkish Book Data Set [Dataset]. https://www.kaggle.com/datasets/muhammedbrahimtop/turkish-book-data-set
    Explore at:
    zip(17318668 bytes)Available download formats
    Dataset updated
    Jan 12, 2024
    Authors
    Muhammed İbrahim Top
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Turkish Book Data Set

    This dataset is a comprehensive compilation of Turkish books obtained through web scraping from the internet. Each record in the dataset contains essential information such as the book's title, author, publisher, publication year, page count, category, description, and image URL.

    This rich dataset can be utilized in various applications, particularly in the analysis through methods such as classification, content-based recommendation algorithms, and natural language processing (NLP). For researchers, students, and data scientists, this dataset serves as a valuable resource for exploring Turkish literature, generating book recommendations, or developing machine learning models.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anil Guven (2021). Turkish Tweets Dataset [Dataset]. https://www.kaggle.com/datasets/anil1055/turkish-tweet-dataset
Organization logo

Turkish Tweets Dataset

Dataset for Sentiment Analysis

Explore at:
zip(170312 bytes)Available download formats
Dataset updated
Apr 9, 2021
Authors
Anil Guven
Description

Dataset consists of 5 emotion labels. These labels are anger, happy, distinguish, surprise and fear. There are 800 tweets in the dataset for each label. Hence, total tweet count is 4000 for dataset.

You can use the data set in many areas such as sentiment, emotion analysis and topic modeling.

Info: Hashtags and usernames was removed in the dataset. Dataset has used many studies and researches. These researches are followed as: -(please citation this article) Güven, Z. A., Diri, B., & Cąkaloglu, T. (2020). Comparison of n-stage Latent Dirichlet Allocation versus other topic modeling methods for emotion analysis. Journal of the Faculty of Engineering and Architecture of Gazi University. https://doi.org/10.17341/gazimmfd.556104 -Güven, Z. A., Diri, B., & Çakaloğlu, T. (2019). Emotion Detection with n-stage Latent Dirichlet Allocation for Turkish Tweets. Academic Platform Journal of Engineering and Science. https://doi.org/10.21541/apjes.459447 -Guven, Z. A., Diri, B., & Cakaloglu, T. (2019). Comparison Method for Emotion Detection of Twitter Users. Proceedings - 2019 Innovations in Intelligent Systems and Applications Conference, ASYU 2019. https://doi.org/10.1109/ASYU48272.2019.8946435

Search
Clear search
Close search
Google apps
Main menu