Image Description Data Data Size: 500 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), human action, picture book, magazine, PPT&chart, App screenshot, and etc. Resolution: 4K+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: the image format is .jpg, the annotation format is .json, and the description format is .txt
Video Description Data Data Size: 10 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), ads, TV sports, documentaries Resolution: 1080p+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: .mp4,.mov,.avi and other common formats;.xlsx (annotation file format)
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
Off-the-shelf 1PB image and video description data covers multiple scenes, languages, and domains.
Environment : quiet indoor environment, without echo;
Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;
Device : Android mobile phone, iPhone;
Language : 100+ Languages;
Application scenarios : speech recognition; voiceprint recognition;
Accuracy rate : the word accuracy rate is not less than 98%
Nexdata has off-the-shelf 35,000 hours Machine Learning (ML) Data of 16kHz conversational speech, covering 100+ countries including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Russia and etc.
Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes
Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people
Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : High fidelity microphone; Binocular camera
Language : 20 languages
Transcription content : text
Accuracy rate : 98%
Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.
Off-the-shelf 1 million hours of Unsupervised speech dataset, covering 10+ languages(English, French, German, Japanese, Arabic, Mandarin and etc. , 100,000 hours each). The content covers dialogues or monologues in 28 common domains, such as daily vlogs, travel, podcast, technology, beauty, etc.
Recording environment : quiet indoor environment, without echo
Recording content (read speech) : economy, entertainment, news, oral language, numbers, letters
Speaker : native speaker, gender balance
Device : Android mobile phone, iPhone
Language : 100+ languages
Transcription content : text, time point of speech data, 5 noise symbols, 5 special identifiers
Accuracy rate : 95% (the accuracy rate of noise symbols and other identifiers is not included)
Application scenarios : speech recognition, voiceprint recognition
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset Card for SIFT-50M
SIFT-50M (Speech Instruction Fine-Tuning) is a 50-million-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). It is built from publicly available speech corpora containing a total of 14K hours of speech and leverages LLMs and off-the-shelf expert models. The dataset spans five languages, covering diverse aspects of speech understanding and controllable speech generation instructions. SIFT-50M… See the full description on the dataset page: https://huggingface.co/datasets/amazon-agi/SIFT-50M.
Recording environment : professional recording studio.
Recording content : general narrative sentences, interrogative sentences, etc.
Speaker : native speaker
Annotation Feature : word transcription, part-of-speech, phoneme boundary, four-level accents, four-level prosodic boundary.
Device : Microphone
Language : American English, British English, Japanese, French, Dutch, Catonese, Canadian French,Australian English, Italian, New Zealand English, Spanish, Mexican Spanish
Application scenarios : speech synthesis
Accuracy rate: Word transcription: the sentences accuracy rate is not less than 99%. Part-of-speech annotation: the sentences accuracy rate is not less than 98%. Phoneme annotation: the sentences accuracy rate is not less than 98% (the error rate of voiced and swallowed phonemes is not included, because the labelling is more subjective). Accent annotation: the word accuracy rate is not less than 95%. Prosodic boundary annotation: the sentences accuracy rate is not less than 97% Phoneme boundary annotation: the phoneme accuracy rate is not less than 95% (the error range of boundary is within 5%)
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction (ACL 2025)
arXiv link: https://arxiv.org/abs/2504.15573Github: https://github.com/YJiangcm/WebR Leveraging an off-the-shelf LLM, WebR transforms raw web documents into high-quality instruction-response pairs. It strategically assigns each document as either an instruction or a response to trigger the process of web reconstruction. We released our generated datasets on Huggingface:
Dataset Generator Size… See the full description on the dataset page: https://huggingface.co/datasets/YuxinJiang/WebR-Pro-100k.
Environment : quiet indoor environment, without echo;
Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;
Device : Telephony recording system;
Language : 100+ Languages;
Application scenarios : speech recognition; voiceprint recognition;
Accuracy rate : the word accuracy rate is not less than 98%
Recording environment : quiet indoor environment, without echo Recording content (read speech) : general category; human-machine interaction category
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : Android mobile phone, iPhone;
Language : English-Korean, English-Japanese, German-English, Hong Kong Cantonese-English, Taiwanese-English,
Application scenarios : speech recognition; voiceprint recognition.
Accuracy rate : 97%
MixEval is a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination.
Recording condition: Phone recording system, with low background noise (call center scenario)
Recording content: Spontaneous inbound and outbound callings in typical domain, such as finance, real-estate, sale, health, insurance, telecom
Language: English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Arabic, Dutch, Swedish, Norwegian and etc.
Features of annotation: Transcription text, timestamp, speaker ID, gender, noise, PII redacted Accuracy: Word Accuracy Rate (WAR) 98%
Population distribution : the race distribution is Asians, Caucasians and black people, the gender distribution is male and female, the age distribution is from children to the elderly
Collecting environment : including indoor and outdoor scenes (such as supermarket, mall and residential area, etc.)
Data diversity : different ages, different time periods, different cameras, different human body orientations and postures, different ages collecting environment
Device : surveillance cameras, the image resolution is not less than 1,9201,080
Data format : the image data format is .jpg, the annotation file format is .json
Annotation content : human body rectangular bounding boxes, 15 human body attributes
Quality Requirements : A rectangular bounding box of human body is qualified when the deviation is not more than 3 pixels, and the qualified rate of the bounding boxes shall not be lower than 97%;Annotation accuracy of attributes is over 97%
Population distribution : gender distribution: balance gender; race distribution: Caucasians,blacks,Indians,Asians; age distribution: aged from 18 to 60
Collection environment : In-car Cameras
Collection diversity : multiple races, multiple age periods, multiple time periods and behaviors (Dangerous behavior, Fatigue behavior, Visual movement behavior)
Device : binocular camera of RGB and infrared channels, the resolutions are 640x480
Collection time : day, evening and night
Image parameter : the video format is .avi
Accuracy : according to the accuracy of each person's action, the accuracy is greater than 95%; the accuracy of label annotation is not less than 95%
Data size : 20,000 ID
Race distribution : Asian, Caucasian, Black, Brown
Gender distribution : male, female
Age distribution : from teenagers to the elderly, mainly young and middle-aged
Collection environment : indoor office scenes, such as meeting rooms, coffee shops, libraries, bedrooms, etc.
Collection diversity : diverse coverage of races, age groups and scenes
Collection equipment : cellphone, using the cellphone to simulate the perspective of the laptop camera in online conference scenes
Data format : .mp4, .mov
Accuracy rate : the accuracy exceeds 97%
Population distribution : race distribution: Asians, Caucasians, black people; gender distribution: gender balance; age distribution: from child to the elderly, the young people and the middle aged are the majorities
Collection environment : indoor scenes, outdoor scenes
Collection diversity : various postures, expressions, light condition, scenes, time periods and distances
Collection device : iPhone, android phone, iPad
Collection time : daytime,night
Image Parameter : the video format is .mov or .mp4, the image format is .jpg
Accuracy : the accuracy of actions exceeds 97%
Recording environment : quiet indoor environment, low background noise, without echo.
Recording content (read speech) : generic category; human-machine interaction category; smart home command and control category; in-car command and control category; numbers.
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : Android mobile phone, iPhone.
Language : American English, British English, Canadian English, Australian English, French English, German English, Spanish English, Italian English, Portuguese English, Russian English, Indian English, Japanese English, Korean English, Singaporean English and etc.
Application scenarios : speech recognition; voiceprint recognition.
Recording environment: Low background noise;
Recording content: Including live, variety-show, speech etc;
Language: English,French, German, Japanese, Portugese, Dutch, Turkish, Korean, Vietnamese, Indonesian, Malay, Thai, Burmese, Arabic, etc.
Features of annotation: Transcription text, timestamp, speaker ID, gender, noise
Accuracy rate: Word Accuracy Rate (WAR) 98%
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Image Description Data Data Size: 500 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), human action, picture book, magazine, PPT&chart, App screenshot, and etc. Resolution: 4K+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: the image format is .jpg, the annotation format is .json, and the description format is .txt
Video Description Data Data Size: 10 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), ads, TV sports, documentaries Resolution: 1080p+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: .mp4,.mov,.avi and other common formats;.xlsx (annotation file format)
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade