Dataset Card for "gpt4-llm-cleaned-chatml"
Data preprocessing pipeline: https://github.com/AlekseyKorshuk/chat-data-pipeline
This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.
It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).
AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.
For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).
Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.
Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.
We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Empathetic Dialogues for LLM
This repository contains a reformatted version of the Empathetic Dialogues dataset, tailored for seamless integration with Language Model (LLM) training and inference. The original dataset's format posed challenges for direct application in LLM tasks, prompting us to restructure and clean the data.
Data Restructuring
We have implemented the following changes to enhance the dataset's usability:
Merged dialogues with the same conv_id… See the full description on the dataset page: https://huggingface.co/datasets/Estwld/empathetic_dialogues_llm.
New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'
These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset
All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)
Enjoy ❤️
Version 2 update:
- removed NaNs and duplicated/short generations
- applied cleaning prodedure from @nbroad's notebook - give it an upvote please!
- added model
column to indicate model family used in generations
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.
What Makes Our Data Unique?
Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.
Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.
Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.
Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.
How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.
Primary Use Cases and Verticals
Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.
Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.
B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.
HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.
How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.
Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.
Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.
Contact us for sample datasets or to discuss your specific needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides outage data and incident data of 3 LLM providers (OpenAI, Anthropic, Character.AI) across 8 LLM services (OpenAI API, ChatGPT, DALLE, Playground, Anthropic API, Claude, Console, Character.AI) collected until 2024-08-31.
Data sources:
service_provider = [openai, anthropic, character.ai]
Documents:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic dataset created with GPT-4o
Synthetic dataset of text2cypher over 16 different graph schemas. Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with gpt-4o using Chain of Thought. Here, there are only questions that return results when queried against the database. For more information visit: https://github.com/neo4j-labs/text2cypher/tree/main/datasets/synthetic_gpt4o_demodbs Dataset is available as train.csv. Columns are the following:… See the full description on the dataset page: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean.
-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.
-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.
-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.
-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization
-Quality: Multiple rounds of quality inspections ensures high quality data output
-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.
-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.
3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BRIGHT benchmark
BRIGHT+ is an upgraded version of the BRIGHT benchmark, specifically designed to support reasoning-intensive retrieval in realistic settings. It is constructed by applying MARCUS, a multi-agent LLM-based clean-and-split pipeline, to the original BRIGHT dataset. BRIGHT+ addresses key limitations of the original web-crawled corpus—such as redundant boilerplate content and fragmented semantic units—by applying targeted structural cleaning and LLM-based semantic… See the full description on the dataset page: https://huggingface.co/datasets/Helios1208/BRIGHT-Plus.
CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.
We happily announce that we are introducing a new version of the CorpusNÓS. After improving our text cleaning and processing methods in our cleaning pipeline, we have decided to release this new version of the corpus, which reflects those enhancements.
This new version contains the same files as the previous one and holds the same distribution of the data, however, we decided to change the format from plain text (.txt) to JSONL (.jsonl) so future cleaning processes can be performed easily, and relevant metadata can be included. As of now, some examples of entries from the CorpusNós have the following structure:
{"id": 0, "text": "Abades: Parroquia do concello de Baltar baixo a advocación de san Paio.", "num_words": 12}
{"id": 581, "text": "Feliz 2008 a tódolos nosos lectores Agora que remata 2007, un ano cheo de novidades tecnolóxicas que difundimos a través deste espazo dixital, queremos desexar a tódolos que non seguen con fidelidade unha boa despedida do ano e un feliz aninovo. Nós volveremos o mércores, 2 de xaneiro, á nosa actividade ordinaria, cumprindo coa nosa labor informativa para que as novas tecnolóxicas de Galicia e en galego cheguen ós nosos lectores puntualmente.", "num_words": 72, "pyplexity_score": 717.7585757844212, "lang": "gl"}
In the plain text version, the delimiter between different documents was constituted by two newlines (
). In the JSONL version, each document is a JSON object with their corresponding id, but it also includes the number of words of each document, and, in some cases, the pyplexity score and the language tag.
This new version of CorpusNós has undergone a heavier process of deduplication than the previous one. This means that more exact match duplications as well as partial duplications have been removed from the corpus and, therefore, the number of documents and tokens in this version has decreased and the current statistics are:
Subcorpus:
Data obtained via transfer agreement
Genre
Nº tokens
Nº documents
Books
7.217.626
103
Research articles
2.638.537
635
Press
92.661.350
161.760
Governmental
221.565.059
527.699
Web contents
15.471.132
41.276
Encyclopedic
4.799.214
47.396
Subtotal
332.721.231
777.583
Subcorpus:
Public data
Genre
Nº tokens
Nº documents
Press and blogs
142.238.181
598.375
Encyclopedic
48.260.708
148.560
Web crawls
1.205.699.835
2.850.604
Translation corpora
106.555.883
3.544.026
Subtotal
1.502.754.607
7.141.565
Total
1.835.475.838
7.919.148
The TXT version is still available under the corpusnos_v1_txt zip file and it mantains the same structure as before (documents are divided by two newlines '
') but this version hasn't gone through the improved cleaning process mentioned above.
Please, note that if you want to download or use the newest version you have to download the corpusnos_v2_jsonl.
Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing. Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0).
Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models.
If you use this data in your work, please cite:
de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.
Funding
This corpus was compiled/development.... within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.
Live Face Anti-Spoof Dataset
A live face dataset is crucial for advancing computer vision tasks such as face detection, anti-spoofing detection, and face recognition. The Live Face Anti-Spoof Dataset offered by Ainnotate is specifically designed to train algorithms for anti-spoofing purposes, ensuring that AI systems can accurately differentiate between real and fake faces in various scenarios.
Key Features:
Comprehensive Video Collection: The dataset features thousands of videos showcasing a diverse range of individuals, including males and females, with and without glasses. It also includes men with beards, mustaches, and clean-shaven faces. Lighting Conditions: Videos are captured in both indoor and outdoor environments, ensuring that the data covers a wide range of lighting conditions, making it highly applicable for real-world use. Data Collection Method: Our datasets are gathered through a community-driven approach, leveraging our extensive network of over 700k users across various Telegram apps. This method ensures that the data is not only diverse but also ethically sourced with full consent from participants, providing reliable and real-world applicable data for training AI models. Versatility: This dataset is ideal for training models in face detection, anti-spoofing, and face recognition tasks, offering robust support for these essential computer vision applications. In addition to the Live Face Anti-Spoof Dataset, FileMarket provides specialized datasets across various categories to support a wide range of AI and machine learning projects:
Object Detection Data: Perfect for training AI in image and video analysis. Machine Learning (ML) Data: Offers a broad spectrum of applications, from predictive analytics to natural language processing (NLP). Large Language Model (LLM) Data: Designed to support text generation, chatbots, and machine translation models. Deep Learning (DL) Data: Essential for developing complex neural networks and deep learning models. Biometric Data: Includes diverse datasets for facial recognition, fingerprint analysis, and other biometric applications. This live face dataset, alongside our other specialized data categories, empowers your AI projects by providing high-quality, diverse, and comprehensive datasets. Whether your focus is on anti-spoofing detection, face recognition, or other biometric and machine learning tasks, our data offerings are tailored to meet your specific needs.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This benchmark dataset consists of 1401 medical question-and-answer pairs primarily in Norwegian (Bokmål and Nynorsk), designed for evaluating Large Language Models (LLMs). The content originates from publicly available sources containing medical exam questions and has undergone cleaning and preprocessing. The dataset is structured in JSON format, with each record containing the source document name, question number (where available), the question text, and the reference answer text and the wrong answers text if the answer was multiple choice. It is suitable for use within evaluation frameworks such as lm-evaluation-harness
(Github with config and code example: https://github.com/kelkalot/normedqa)to assess model capabilities in medical knowledge retrieval and reasoning specific to the Norwegian context.
Dataset Card for "somos-clean-alpaca-es"
Este conjunto de datos es una traducción del dataset Clean Alpaca al Español y sirve como referencia para el esfuerzo colaborativo de limpieza y mejora del dataset durante el Hackathon Somos NLP 2023. Nota: No es necesario participar en el hackathon para contribuir a esta tarea. Cuantas más personas y equipos participen mayor será la calidad del dataset final y por lo tanto también del LLM que entrenemos, ¡únete! Te explicamos como… See the full description on the dataset page: https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
web scrapping of comments, extracted available nepali comments existing in huggingface, kaggle. Preprocessed it. Removal of html tag, username, links, emoji, non nepali characters. Lemmatization done. Afrer that clean dataset is with us. That clean dataset is being labeled with LLM as a judge approach using openai o3 model ( premium ) costed us around 20$ for labeling. After that different embedding models were used for embedding. This will act as a benchmark dataset for sentiment analysis of nepali dataset. Please give author a credit if you are using it.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cleaned Tabula Muris Senis Single-Cell Data and other aging datasets
This dataset contains LLM-cleaned single-cell transcriptomic annotations from the Tabula Muris Senis project, specifically for mouse tissues processed with SmartSeq2, and ALL OTHER DATASETS WITH AGING IN THE FILENAME :-) . The cleaning and annotation were performed using large language models (OpenAI and Claude), enabling enriched metadata and corrected cell type labels.
🧬 Over 1.3 million rows and 78.17 GB… See the full description on the dataset page: https://huggingface.co/datasets/longevity-db/cleaned_data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides the data supporting our study, in which we use a large language model (LLMs) to analyze the impact of congressional witness testimony. The dataset has been curated and structured to facilitate reproducibility and encourage further research in this domain.
The repository includes the results of our study (see `Results.zip`), the fine-tuning corpus (see `Model Training Data.zip`), and the Witness and Legislative History and Impact Corpus (WLHIC), which is can be subdivided into the Witness Corpus (WC) and Legislative History and Impact Corpus (LHIC). For the LHIC and WC, we provide cleaned JSONL files containing the full datasets, individual text files of each document, and accompanying metadata (see `WLHIC data.zip`). To ensure comprehensive accessibility, we also include the original PDF versions of the documents in these corpora (see `WLHIC Raw Files.zip`).
We also provide the sentence transformer model resulting from the extended pretraining process and the model resulting from the fine tuning process. Both are accessible in `Models.zip`.
Researchers can use the provided data to replicate our findings and verify the results of our analysis. The cleaned data can also be regenerated by applying the cleaning scripts provided in the code repository to the LHIC and WC text files. While slight variations in results may occur when replicating the study from scratch due to the stochastic nature of LLM training, these differences are minimal and do not affect the substantive findings.
We encourage the use of this dataset for reproducibility studies and to inspire further exploration of LLM applications in political science research. By publishing this data, we aim to promote transparency, collaboration, and innovation in the field.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This is the dataset for the KDD'25 (dataset and benchmark track) full paper "T3Set: A Multimodal Dataset with Targeted Suggestions for LLM-based Virtual Coach in Table Tennis Training".
T3Set (Table Tennis Training) is a multimodal dataset with aligned video-sensor-text data in table tennis training. The key features of T3Set include (1)temporal alignment between sensor data, video data, and text data. (2)high-quality targeted suggestions which are consistent with predefined suggestion taxonomy.
The scripts we used for dataset construction and data cleaning processes, are provided in the Github Repo: https://github.com/jima-cs/t3set
If you find this dataset useful, please cite our paper:
```bibtex
@inproceedings{
ma2025t3set,
title={T3Set: A Multimodal Dataset with Targeted Suggestions for LLM-based Virtual Coach in Table Tennis Training},
author={Ji Ma and Jiale Wu and Haoyu Wang and Yanze Zhang and Xiao Xie and Zheng Zhou and Jiachen Wang and Yingcai Wu},
year={2025},
booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
doi={10.1145/3711896.3737407}
pages={}
}
```
MZ94 - LLM Training Dataset
Overview
This dataset contains crawled documentation from https://infozone.atlassian.net/wiki/spaces/MD94/, formatted for LLM training and RAG systems.
Dataset Statistics
Total Pages: 4109 Total Words: 1005892 Total Chunks: 2420 Crawled: 2025-06-24 04:55:10
Directory Structure
/llm_ready/
Plain text files optimized for LLM training:
Clean, formatted text content Consistent structure with headers Document… See the full description on the dataset page: https://huggingface.co/datasets/ratanon/mz94-documentation.
Dataset Card for "gpt4-llm-cleaned-chatml"
Data preprocessing pipeline: https://github.com/AlekseyKorshuk/chat-data-pipeline