100+ datasets found

h
Bitext-customer-support-llm-chatbot-training-dataset
huggingface.co
opendatalab.com
Updated Oct 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-customer-support-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.
Image and Video Description Data | 1 PB | Multimodal Data | GenAI | LLM Data...
datarade.ai
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). Image and Video Description Data | 1 PB | Multimodal Data | GenAI | LLM Data | Large Language Model(LLM) Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-image-and-video-description-data-1-pb-multimoda-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 3, 2025
Dataset authored and provided by
Nexdata
Area covered
Netherlands, Finland, Ecuador, Malta, Mexico, Czech Republic, United Arab Emirates, Israel, Belgium, Canada
Description
Image Description Data Data Size: 500 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), human action, picture book, magazine, PPT&chart, App screenshot, and etc. Resolution: 4K+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: the image format is .jpg, the annotation format is .json, and the description format is .txt

Video Description Data Data Size: 10 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), ads, TV sports, documentaries Resolution: 1080p+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: .mp4,.mov,.avi and other common formats;.xlsx (annotation file format)

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
h
cncf-question-and-answer-dataset-for-llm-training
huggingface.co
Updated Nov 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kubermatic (2020). cncf-question-and-answer-dataset-for-llm-training [Dataset]. https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2020
Dataset authored and provided by
Kubermatic
Description
CNCF QA Dataset for LLM Tuning

Description

This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model. The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training.
d
TagX Data collection for AI/ ML training | LLM data | Data collection for AI...
datarade.ai
.json, .csv, .xls
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Iceland, Equatorial Guinea, Benin, Djibouti, Belize, Russian Federation, Saudi Arabia, Antigua and Barbuda, Qatar, Colombia
Description
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
F
Telugu Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Telugu Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/telugu-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the Telugu Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content:
This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Telugu language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Telugu people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity:
To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats:
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Telugu Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy:
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Telugu version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Telugu Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
h
cncf-raw-data-for-llm-training
huggingface.co
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kubermatic (2024). cncf-raw-data-for-llm-training [Dataset]. https://huggingface.co/datasets/Kubermatic/cncf-raw-data-for-llm-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2024
Dataset authored and provided by
Kubermatic
Description
CNCF Raw Data for LLM Training

Description

This dataset, named cncf-raw-data-for-llm-training, consists of markdown (MD) and PDF content extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. The data was collected by fetching MD and PDF files from different CNCF project repositories and converting them into JSON format. This dataset is intended as raw data for training large language models (LLMs). The dataset includes… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-raw-data-for-llm-training.
F
Japanese Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
h
cve-llm-training
huggingface.co
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chiranjeevi G (2024). cve-llm-training [Dataset]. http://doi.org/10.57967/hf/3627
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3627
Dataset updated
Jul 12, 2024
Authors
Chiranjeevi G
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
CVE-llm_dataset

This dataset is intended to train an LLM model for an utterly CVE-focused input and output.

Data extraction:

For the data extraction, I first downloaded the CVE database from NVD lists and then loaded them using the cve_dataset_2.py and cve_dataset.py both have produce different datasets one is for llama and the other is for openai GPT. The CVE json files are mapped in this format: cves: | ├─1999 | ├─0xxx | | ├─CVE-1999-0001.json | | ├─.... | |… See the full description on the dataset page: https://huggingface.co/datasets/morpheuslord/cve-llm-training.
LLM - Detect AI Generated Text Dataset
kaggle.com
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sunil thite
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay
s
Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B...
storefront.silencio.network
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silencio Network (2025). Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B + Data Points | 100% Traceable Consent [Dataset]. https://storefront.silencio.network/products/ai-training-data-global-hyper-local-average-noise-levels-silencio-network
Explore at:
Dataset updated
Apr 9, 2025
Dataset provided by
Quickkonnect UG
Authors
Silencio Network
Area covered
Anguilla, Chile, French Guiana, Central African Republic, Timor-Leste, Hungary, Svalbard and Jan Mayen, Mauritania, Faroe Islands, Uzbekistan
Description
Silencio provides the world’s largest real-world street and venue noise-level dataset, combining over 35 billion datapoints with AI-powered interpolation. Fully anonymized, user-consented, and ready for AI training, urban analysis, and mobility insights. Available in raw format.
h
Bitext-retail-banking-llm-chatbot-training-dataset
huggingface.co
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-retail-banking-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Retail Banking Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail Banking] sector can be easily achieved using our two-step approach to LLM Fine-Tuning.… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset.
Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM |...
datarade.ai
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM | Pre-training |Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-unsupervised-speech-data-1-million-ho-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 3, 2025
Dataset authored and provided by
Nexdata
Area covered
Macao, France, Argentina, Ireland, United States of America, Switzerland, Germany, Peru, Taiwan, Netherlands
Description
Specifications Format: 16k Hz, 16 bit, wav, mono channel

Content category: Dialogue or monologue in several common domains, such as daily vlogs, travel, podcast, technology, beauty, etc

Language: English(USA, UK, Canada, Australia, India, Philippine, etc.), French, German, Japanese, Arabic(MSA, Gulf, Levantine, Egyptian accents, etc.), etc.

Recording condition: Mixed(indoor, public place, entertainment,etc.)

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of speech data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition...
datarade.ai
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition Data | Audio Data |Natural Language Processing (NLP) Data [Dataset]. https://datarade.ai/data-products/nexdata-in-car-speech-data-15-000-hours-audio-ai-ml-t-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 14, 2023
Dataset authored and provided by
Nexdata
Area covered
Switzerland, Poland, Argentina, Russian Federation, Egypt, Romania, Austria, Turkey, Netherlands, Germany
Description
Specifications Format : Audio format: 48kHz, 16bit, uncompressed wav, mono channel; Vedio format: MP4

Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes

Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people

Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Device : High fidelity microphone; Binocular camera

Language : 20 languages

Transcription content : text

Accuracy rate : 98%

Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Natural Language Processing (NLP) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data|...
datarade.ai
Updated Dec 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data| Speech Recognition Data | Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-read-speech-data-65-000-hours-aud-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 11, 2023
Dataset authored and provided by
Nexdata
Area covered
Puerto Rico, Uruguay, Luxembourg, Taiwan, Japan, Pakistan, France, Poland, Chile, Italy
Description
Specifications Format : 16kHz, 16bit, uncompressed wav, mono channel

Recording environment : quiet indoor environment, without echo

Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters

Speaker : native speaker, gender balance

Device : Android mobile phone, iPhone

Language : 100+ languages

Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers

Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)

Application scenarios : speech recognition, voiceprint recognition

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...
datarade.ai
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 10, 2023
Dataset authored and provided by
Nexdata
Area covered
Saudi Arabia, Germany, Indonesia, Malaysia, Canada, Ecuador, Vietnam, Korea (Republic of), Austria, Turkey
Description
Specifications Format : 16kHz 16bit, uncompressed wav, mono channel;

Environment : quiet indoor environment, without echo;

Recording content : No preset linguistic data，dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

Device : Android mobile phone, iPhone;

Language : 100+ Languages;

Application scenarios : speech recognition; voiceprint recognition;

Accuracy rate : the word accuracy rate is not less than 98%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
F
Finnish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Finnish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/finnish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Finnish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Finnish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Finnish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Finnish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Finnish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Finnish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Finnish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI...
datarade.ai
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-re-id-data-60-000-id-image-video-ai-ml-train-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 8, 2023
Dataset authored and provided by
Nexdata
Area covered
Luxembourg, Ecuador, Turkmenistan, Cuba, Bolivia (Plurinational State of), Russian Federation, Sri Lanka, Trinidad and Tobago, United Arab Emirates, Portugal
Description
Specifications Data size : 60,000 ID

Population distribution : the race distribution is Asians, Caucasians and black people, the gender distribution is male and female, the age distribution is from children to the elderly

Collecting environment : including indoor and outdoor scenes (such as supermarket, mall and residential area, etc.)

Data diversity : different ages, different time periods, different cameras, different human body orientations and postures, different ages collecting environment

Device : surveillance cameras, the image resolution is not less than 1,9201,080

Data format : the image data format is .jpg, the annotation file format is .json

Annotation content : human body rectangular bounding boxes, 15 human body attributes

Quality Requirements : A rectangular bounding box of human body is qualified when the deviation is not more than 3 pixels, and the qualified rate of the bounding boxes shall not be lower than 97%;Annotation accuracy of attributes is over 97%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data.These ready-to-go Identity Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/computervision?source=Datarade
Chain-of-Thought collection
kaggle.com
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://identifiers.org/arxiv:2305.140
Dataset updated
Jun 19, 2023
Dataset provided by
Kaggle
Authors
Konrad Banachewicz
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
CISA TTP Articles Data Set
zenodo.org
csv, json
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dženan Hamzić; Dženan Hamzić; Florian Skopik; Florian Skopik; Markus Wurzenberger; Markus Wurzenberger; Max Landauer; Max Landauer (2025). CISA TTP Articles Data Set [Dataset]. http://doi.org/10.5281/zenodo.14659512
Explore at:
json, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14659512
Dataset updated
May 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dženan Hamzić; Dženan Hamzić; Florian Skopik; Florian Skopik; Markus Wurzenberger; Markus Wurzenberger; Max Landauer; Max Landauer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 27, 2024
Description
This dataset contains 77 cybersecurity articles crawled from the public CISA website. All these articles were publically available at the time of crawling without the need of any subscription or usage of paid services. These articles were published from July 2020 to February 2024 and selected for this dataset if they contained explicitely mentioned MITRE ATT&CK TTPs (Tactics, Techniques, and Procedures).

The data set supports research in the domain of Cyber Threat Intelligence as it may act as a ground truth for TTP labeling. Specifically, this dataset is designed to facilitate research and analysis related to the identification and classification of TTPs in cybersecurity advisories.

Each crawled article is represented by the following four columns:

RawText: The unfiltered text extracted from the main content of each article (class: "l-full_main").

TTP: A set of MITRE ATT&CK TTP (Tactics, Techniques, and Procedures) IDs identified within the article's RawText. These IDs are extracted using the regex pattern: (?:TA\d{4}|T\d{4,5}(?:\.\d{3})?).

CleanText: A cleaned version of the RawText, with tables and TTP IDs removed for clarity.

URL: The url to the original article.

About the crawling process

All advisories were gathered on Sept 27th, 2024 from the CISA website by sifting through all advisory urls backwards in time until 2020. All articles which explicitely mentioned TTPs were selected for the data set. To detect the presence of TTP IDs, each article was checked for the presence of any of the following phrases in the main content:

"MITRE ATT&CK Tactics and Techniques"

"Tactics and Techniques"

"MITRE ATT&CK Techniques"

The data set is availble in CSV and as JSON format, both containing the same data.

Acknowledgments: Funded by the European Union under the European Defence Fund (GA no. 101121403 - NEWSROOM and GA no. 101121418 - EUCINF). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. This work is co-funded by the Austrian FFG Kiras project ASOC (GA no. FO999905301).
P
CulturaX Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thuat Nguyen; Chien Van Nguyen; Viet Dac Lai; Hieu Man; Nghia Trung Ngo; Franck Dernoncourt; Ryan A. Rossi; Thien Huu Nguyen (2024). CulturaX Dataset [Dataset]. https://paperswithcode.com/dataset/culturax
Explore at:
Dataset updated
Dec 18, 2024
Authors
Thuat Nguyen; Chien Van Nguyen; Viet Dac Lai; Hieu Man; Nghia Trung Ngo; Franck Dernoncourt; Ryan A. Rossi; Thien Huu Nguyen
Description
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for large language model (LLM) development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. We employ MinHash at document level to achieve fuzzy deduplication for the datasets in different languages. Our data cleaning framework includes diverse criteria and threshold selections, guided by extensive data samples, ensuring comprehensive noise filtering in various aspects. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs.

Our dataset combines the most recent iteration of mC4 (version 3.1.0) [1] with all accessible OSCAR corpora up to the present year, including 20.19, 21.09, 22.01, and 23.01 [2]. After deep cleaning and deduplication, CulturaX involves 16TB data in the parquet format (expanding to 27TB when unpacked). More than a half of our dataset is dedicated to non-English languages to significantly boost the data size and enhance the feasibility of training models in multilingual scenarios.

To obtain perplexity scores for data cleaning, we train a SentencePiece tokenizer and 5-gram Kneser-Ney language models as provided in the KenLM library [3] using the 20230501 dumps of Wikipedia. Our KenLM models are also released in HuggingFace: https://huggingface.co/uonlp/kenlm.

Details for the dataset can be found in our technical paper: https://arxiv.org/abs/2309.09400 and https://huggingface.co/datasets/uonlp/CulturaX

Facebook

Twitter

Click to copy link

Link copied

Cite

Bitext (2024). Bitext-customer-support-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset

Bitext-customer-support-llm-chatbot-training-dataset

bitext/Bitext-customer-support-llm-chatbot-training-dataset

Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 1, 2024

Dataset authored and provided by

Bitext

License

https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

Description

Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

  Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.

Clear search

Close search

Google apps

Main menu

Bitext-customer-support-llm-chatbot-training-dataset

Image and Video Description Data | 1 PB | Multimodal Data | GenAI | LLM Data...

cncf-question-and-answer-dataset-for-llm-training

TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

Telugu Open Ended Classification Prompt & Response Dataset

What’s Included

cncf-raw-data-for-llm-training

Japanese Closed Ended Question Answer Text Dataset

What’s Included

cve-llm-training

LLM - Detect AI Generated Text Dataset

Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B...

Bitext-retail-banking-llm-chatbot-training-dataset

Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM |...

In-Cabin Speech Data | 15,000 Hours | AI Training Data | Speech Recognition...

Scripted Monologues Speech Data | 65,000 Hours | Generative AI Audio Data|...

16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

Finnish Open Ended Question Answer Text Dataset

What’s Included

Re-ID Data | 600,000 ID | CCTV Data |Computer Vision Data| Identity Data| AI...

Chain-of-Thought collection

CISA TTP Articles Data Set

CulturaX Dataset

Bitext-customer-support-llm-chatbot-training-datasetSee More Versions

bitext/Bitext-customer-support-llm-chatbot-training-dataset

Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

Bitext-customer-support-llm-chatbot-training-dataset