63 datasets found

h
gpt4-llm-cleaned-chatml
huggingface.co
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksey Korshuk (2023). gpt4-llm-cleaned-chatml [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/gpt4-llm-cleaned-chatml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 26, 2023
Authors
Aleksey Korshuk
Description
Dataset Card for "gpt4-llm-cleaned-chatml"

Data preprocessing pipeline: https://github.com/AlekseyKorshuk/chat-data-pipeline
d
Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Guinea-Bissau, Guatemala, Chile, Hungary, Namibia, Saint Barthélemy, Niue, Panama, Andorra, Guadeloupe
Description
This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.
d
TagX Data collection for AI/ ML training | LLM data | Data collection for AI...
datarade.ai
.json, .csv, .xls
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Belize, Colombia, Saudi Arabia, Antigua and Barbuda, Iceland, Equatorial Guinea, Russian Federation, Qatar, Benin, Djibouti
Description
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
h
empathetic_dialogues_llm
huggingface.co
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhangyiqun (2024). empathetic_dialogues_llm [Dataset]. https://huggingface.co/datasets/Estwld/empathetic_dialogues_llm
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2024
Authors
zhangyiqun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Empathetic Dialogues for LLM

This repository contains a reformatted version of the Empathetic Dialogues dataset, tailored for seamless integration with Language Model (LLM) training and inference. The original dataset's format posed challenges for direct application in LLM tasks, prompting us to restructure and clean the data.

Data Restructuring

We have implemented the following changes to enhance the dataset's usability:

Merged dialogues with the same conv_id… See the full description on the dataset page: https://huggingface.co/datasets/Estwld/empathetic_dialogues_llm.
daigt-v3-train-dataset
kaggle.com
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darek Kłeczek (2023). daigt-v3-train-dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Darek Kłeczek
Description
New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'

These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset

All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)

Enjoy ❤️

Version 2 update: - removed NaNs and duplicated/short generations - applied cleaning prodedure from @nbroad's notebook - give it an upvote please! - added model column to indicate model family used in generations
d
Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Xverum LLC
Authors
Xverum
Area covered
Norway, Dominican Republic, Sint Maarten (Dutch part), Cook Islands, Oman, India, Barbados, Western Sahara, Jordan, United Kingdom
Description
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

What Makes Our Data Unique?

Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

Primary Use Cases and Verticals

Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

Contact us for sample datasets or to discuss your specific needs.
LLM Service Outages and Incident Reports
zenodo.org
Updated Jan 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaoyu Chu; Xiaoyu Chu (2025). LLM Service Outages and Incident Reports [Dataset]. http://doi.org/10.5281/zenodo.14018219
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14018219
Dataset updated
Jan 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiaoyu Chu; Xiaoyu Chu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 31, 2024
Description
This dataset provides outage data and incident data of 3 LLM providers (OpenAI, Anthropic, Character.AI) across 8 LLM services (OpenAI API, ChatGPT, DALLE, Playground, Anthropic API, Claude, Console, Character.AI) collected until 2024-08-31.

Data sources:

Outage: https://status.{service_provider}.com/uptime

Incident Reports: https://status.{service_provider}.com/history

service_provider = [openai, anthropic, character.ai]

Documents:

./raw_data/*: Raw data for outages and incidents, stored by service.

./clean_data/*: Clean-up data for characterization, aggregated of all services.
h
text2cypher-gpt4o-clean
huggingface.co
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomaž Bratanič (2024). text2cypher-gpt4o-clean [Dataset]. https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2024
Authors
Tomaž Bratanič
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic dataset created with GPT-4o

Synthetic dataset of text2cypher over 16 different graph schemas. Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with gpt-4o using Chain of Thought. Here, there are only questions that return results when queried against the database. For more information visit: https://github.com/neo4j-labs/text2cypher/tree/main/datasets/synthetic_gpt4o_demodbs Dataset is available as train.csv. Columns are the following:… See the full description on the dataset page: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean.
Foundation Model Data Collection and Data Annotation | Large Language...
datarade.ai
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Nexdata
Area covered
Taiwan, El Salvador, Kyrgyzstan, Spain, Czech Republic, Portugal, Maldives, Russian Federation, Azerbaijan, Ireland
Description
Overview

Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
h
BRIGHT-Plus
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liyang Chen (2025). BRIGHT-Plus [Dataset]. https://huggingface.co/datasets/Helios1208/BRIGHT-Plus
Explore at:
Dataset updated
Jul 27, 2025
Authors
Liyang Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BRIGHT benchmark

BRIGHT+ is an upgraded version of the BRIGHT benchmark, specifically designed to support reasoning-intensive retrieval in realistic settings. It is constructed by applying MARCUS, a multi-agent LLM-based clean-and-split pipeline, to the original BRIGHT dataset. BRIGHT+ addresses key limitations of the original web-crawled corpus—such as redundant boilerplate content and fragmented semantic units—by applying targeted structural cleaning and LLM-based semantic… See the full description on the dataset page: https://huggingface.co/datasets/Helios1208/BRIGHT-Plus.
u
Data from: Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training
investigacion.usc.es
Updated 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian; de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian (2024). Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training [Dataset]. https://investigacion.usc.es/documentos/668fc40fb9e7c03b01bd388b?lang=es
Explore at:
Dataset updated
2024
Authors
de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian; de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian
Area covered
Galicia
Description
CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.

We happily announce that we are introducing a new version of the CorpusNÓS. After improving our text cleaning and processing methods in our cleaning pipeline, we have decided to release this new version of the corpus, which reflects those enhancements.

This new version contains the same files as the previous one and holds the same distribution of the data, however, we decided to change the format from plain text (.txt) to JSONL (.jsonl) so future cleaning processes can be performed easily, and relevant metadata can be included. As of now, some examples of entries from the CorpusNós have the following structure:

{"id": 0, "text": "Abades: Parroquia do concello de Baltar baixo a advocación de san Paio.", "num_words": 12}

{"id": 581, "text": "Feliz 2008 a tódolos nosos lectores Agora que remata 2007, un ano cheo de novidades tecnolóxicas que difundimos a través deste espazo dixital, queremos desexar a tódolos que non seguen con fidelidade unha boa despedida do ano e un feliz aninovo. Nós volveremos o mércores, 2 de xaneiro, á nosa actividade ordinaria, cumprindo coa nosa labor informativa para que as novas tecnolóxicas de Galicia e en galego cheguen ós nosos lectores puntualmente.", "num_words": 72, "pyplexity_score": 717.7585757844212, "lang": "gl"}

In the plain text version, the delimiter between different documents was constituted by two newlines (

). In the JSONL version, each document is a JSON object with their corresponding id, but it also includes the number of words of each document, and, in some cases, the pyplexity score and the language tag.

This new version of CorpusNós has undergone a heavier process of deduplication than the previous one. This means that more exact match duplications as well as partial duplications have been removed from the corpus and, therefore, the number of documents and tokens in this version has decreased and the current statistics are:

Subcorpus:

Data obtained via transfer agreement

Genre

Nº tokens

Nº documents

Books

7.217.626

103

Research articles

2.638.537

635

Press

92.661.350

161.760

Governmental

221.565.059

527.699

Web contents

15.471.132

41.276

Encyclopedic

4.799.214

47.396

Subtotal

332.721.231

777.583

Subcorpus:

Public data

Genre

Nº tokens

Nº documents

Press and blogs

142.238.181

598.375

Encyclopedic

48.260.708

148.560

Web crawls

1.205.699.835

2.850.604

Translation corpora

106.555.883

3.544.026

Subtotal

1.502.754.607

7.141.565

Total

1.835.475.838

7.919.148

The TXT version is still available under the corpusnos_v1_txt zip file and it mantains the same structure as before (documents are divided by two newlines '

') but this version hasn't gone through the improved cleaning process mentioned above.

Please, note that if you want to download or use the newest version you have to download the corpusnos_v2_jsonl.

Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing. Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0).

Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models.

If you use this data in your work, please cite:

de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.

Funding

This corpus was compiled/development.... within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.
d
FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision...
datarade.ai
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2024). FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision Applications | Machine Learning (ML) Data | Deep Learning (DL) Data [Dataset]. https://datarade.ai/data-products/filemarket-dataset-for-face-anti-spoofing-videos-in-compu-filemarket
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jul 10, 2024
Dataset authored and provided by
FileMarket
Area covered
Cabo Verde, United Republic of, Libya, Ukraine, Sao Tome and Principe, Russian Federation, South Sudan, Guinea-Bissau, Mauritania, Germany
Description
Live Face Anti-Spoof Dataset

A live face dataset is crucial for advancing computer vision tasks such as face detection, anti-spoofing detection, and face recognition. The Live Face Anti-Spoof Dataset offered by Ainnotate is specifically designed to train algorithms for anti-spoofing purposes, ensuring that AI systems can accurately differentiate between real and fake faces in various scenarios.

Key Features:

Comprehensive Video Collection: The dataset features thousands of videos showcasing a diverse range of individuals, including males and females, with and without glasses. It also includes men with beards, mustaches, and clean-shaven faces. Lighting Conditions: Videos are captured in both indoor and outdoor environments, ensuring that the data covers a wide range of lighting conditions, making it highly applicable for real-world use. Data Collection Method: Our datasets are gathered through a community-driven approach, leveraging our extensive network of over 700k users across various Telegram apps. This method ensures that the data is not only diverse but also ethically sourced with full consent from participants, providing reliable and real-world applicable data for training AI models. Versatility: This dataset is ideal for training models in face detection, anti-spoofing, and face recognition tasks, offering robust support for these essential computer vision applications. In addition to the Live Face Anti-Spoof Dataset, FileMarket provides specialized datasets across various categories to support a wide range of AI and machine learning projects:

Object Detection Data: Perfect for training AI in image and video analysis. Machine Learning (ML) Data: Offers a broad spectrum of applications, from predictive analytics to natural language processing (NLP). Large Language Model (LLM) Data: Designed to support text generation, chatbots, and machine translation models. Deep Learning (DL) Data: Essential for developing complex neural networks and deep learning models. Biometric Data: Includes diverse datasets for facial recognition, fingerprint analysis, and other biometric applications. This live face dataset, alongside our other specialized data categories, empowers your AI projects by providing high-quality, diverse, and comprehensive datasets. Whether your focus is on anti-spoofing detection, face recognition, or other biometric and machine learning tasks, our data offerings are tailored to meet your specific needs.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Norwegian Medical Question Answering Dataset - NorMedQA
zenodo.org
json
Updated May 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael A. Riegler; Michael A. Riegler (2025). Norwegian Medical Question Answering Dataset - NorMedQA [Dataset]. http://doi.org/10.5281/zenodo.15346484
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15346484
Dataset updated
May 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael A. Riegler; Michael A. Riegler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This benchmark dataset consists of 1401 medical question-and-answer pairs primarily in Norwegian (Bokmål and Nynorsk), designed for evaluating Large Language Models (LLMs). The content originates from publicly available sources containing medical exam questions and has undergone cleaning and preprocessing. The dataset is structured in JSON format, with each record containing the source document name, question number (where available), the question text, and the reference answer text and the wrong answers text if the answer was multiple choice. It is suitable for use within evaluation frameworks such as lm-evaluation-harness (Github with config and code example: https://github.com/kelkalot/normedqa)to assess model capabilities in medical knowledge retrieval and reasoning specific to the Norwegian context.
h
somos-clean-alpaca-es
huggingface.co
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SomosNLP (2023). somos-clean-alpaca-es [Dataset]. https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 24, 2023
Dataset authored and provided by
SomosNLP
Description
Dataset Card for "somos-clean-alpaca-es"

Este conjunto de datos es una traducción del dataset Clean Alpaca al Español y sirve como referencia para el esfuerzo colaborativo de limpieza y mejora del dataset durante el Hackathon Somos NLP 2023. Nota: No es necesario participar en el hackathon para contribuir a esta tarea. Cuantas más personas y equipos participen mayor será la calidad del dataset final y por lo tanto también del LLM que entrenemos, ¡únete! Te explicamos como… See the full description on the dataset page: https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es.
Sentiment Analysis Nepali Dataset by SamirWagle
kaggle.com
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samir Wagle (2025). Sentiment Analysis Nepali Dataset by SamirWagle [Dataset]. https://www.kaggle.com/datasets/sameerwagle/sentimentanalysisnepalidataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samir Wagle
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
web scrapping of comments, extracted available nepali comments existing in huggingface, kaggle. Preprocessed it. Removal of html tag, username, links, emoji, non nepali characters. Lemmatization done. Afrer that clean dataset is with us. That clean dataset is being labeled with LLM as a judge approach using openai o3 model ( premium ) costed us around 20$ for labeling. After that different embedding models were used for embedding. This will act as a benchmark dataset for sentiment analysis of nepali dataset. Please give author a credit if you are using it.
h
cleaned_data
huggingface.co
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
2025 Longevity x AI Hackathon (2025). cleaned_data [Dataset]. https://huggingface.co/datasets/longevity-db/cleaned_data
Explore at:
Dataset updated
Jun 17, 2025
Dataset authored and provided by
2025 Longevity x AI Hackathon
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Tabula Muris Senis Single-Cell Data and other aging datasets

This dataset contains LLM-cleaned single-cell transcriptomic annotations from the Tabula Muris Senis project, specifically for mouse tissues processed with SmartSeq2, and ALL OTHER DATASETS WITH AGING IN THE FILENAME :-) . The cleaning and annotation were performed using large language models (OpenAI and Claude), enabling enriched metadata and corrected cell type labels.

🧬 Over 1.3 million rows and 78.17 GB… See the full description on the dataset page: https://huggingface.co/datasets/longevity-db/cleaned_data.
Data from: Congressional Witnesses Matter: Proving Witness Testimony Impact...
zenodo.org
zip
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collin Coil; Collin Coil; Caroline Bruckner; Caroline Bruckner; Nicholas Chen; Nicholas Chen; Elizabeth Keith; Karen O'Connor; Karen O'Connor; Elizabeth Keith (2024). Congressional Witnesses Matter: Proving Witness Testimony Impact Using Large Language Models [Dataset]. http://doi.org/10.5281/zenodo.14291000
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14291000
Dataset updated
Dec 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Collin Coil; Collin Coil; Caroline Bruckner; Caroline Bruckner; Nicholas Chen; Nicholas Chen; Elizabeth Keith; Karen O'Connor; Karen O'Connor; Elizabeth Keith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository provides the data supporting our study, in which we use a large language model (LLMs) to analyze the impact of congressional witness testimony. The dataset has been curated and structured to facilitate reproducibility and encourage further research in this domain.

The repository includes the results of our study (see `Results.zip`), the fine-tuning corpus (see `Model Training Data.zip`), and the Witness and Legislative History and Impact Corpus (WLHIC), which is can be subdivided into the Witness Corpus (WC) and Legislative History and Impact Corpus (LHIC). For the LHIC and WC, we provide cleaned JSONL files containing the full datasets, individual text files of each document, and accompanying metadata (see `WLHIC data.zip`). To ensure comprehensive accessibility, we also include the original PDF versions of the documents in these corpora (see `WLHIC Raw Files.zip`).

We also provide the sentence transformer model resulting from the extended pretraining process and the model resulting from the fine tuning process. Both are accessible in `Models.zip`.

Researchers can use the provided data to replicate our findings and verify the results of our analysis. The cleaned data can also be regenerated by applying the cleaning scripts provided in the code repository to the LHIC and WC text files. While slight variations in results may occur when replicating the study from scratch due to the stochastic nature of LLM training, these differences are minimal and do not affect the substantive findings.

We encourage the use of this dataset for reproducibility studies and to inspire further exploration of LLM applications in political science research. By publishing this data, we aim to promote transparency, collaboration, and innovation in the field.
T3Set: Table Tennis Training Multimodal Dataset
zenodo.org
bin, pdf, zip
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ji Ma; Ji Ma (2025). T3Set: Table Tennis Training Multimodal Dataset [Dataset]. http://doi.org/10.5281/zenodo.15516144
Explore at:
bin, pdf, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15516144
Dataset updated
May 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ji Ma; Ji Ma
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
May 26, 2025
Description
This is the dataset for the KDD'25 (dataset and benchmark track) full paper "T3Set: A Multimodal Dataset with Targeted Suggestions for LLM-based Virtual Coach in Table Tennis Training".

T3Set (Table Tennis Training) is a multimodal dataset with aligned video-sensor-text data in table tennis training. The key features of T3Set include (1)temporal alignment between sensor data, video data, and text data. (2)high-quality targeted suggestions which are consistent with predefined suggestion taxonomy.

The scripts we used for dataset construction and data cleaning processes, are provided in the Github Repo: https://github.com/jima-cs/t3set

If you find this dataset useful, please cite our paper:
```bibtex
@inproceedings{
ma2025t3set,
title={T3Set: A Multimodal Dataset with Targeted Suggestions for LLM-based Virtual Coach in Table Tennis Training},
author={Ji Ma and Jiale Wu and Haoyu Wang and Yanze Zhang and Xiao Xie and Zheng Zhou and Jiachen Wang and Yingcai Wu},
year={2025},
booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
doi={10.1145/3711896.3737407}
pages={}
}
```
h
mz94-documentation
huggingface.co
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ratan0n (2025). mz94-documentation [Dataset]. https://huggingface.co/datasets/ratanon/mz94-documentation
Explore at:
Dataset updated
Jun 24, 2025
Authors
ratan0n
Description
MZ94 - LLM Training Dataset

Overview

This dataset contains crawled documentation from https://infozone.atlassian.net/wiki/spaces/MD94/, formatted for LLM training and RAG systems.

Dataset Statistics

Total Pages: 4109 Total Words: 1005892 Total Chunks: 2420 Crawled: 2025-06-24 04:55:10

Directory Structure /llm_ready/

Plain text files optimized for LLM training:

Clean, formatted text content Consistent structure with headers Document… See the full description on the dataset page: https://huggingface.co/datasets/ratanon/mz94-documentation.