63 datasets found
  1. h

    gpt4-llm-cleaned-chatml

    • huggingface.co
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksey Korshuk (2023). gpt4-llm-cleaned-chatml [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/gpt4-llm-cleaned-chatml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2023
    Authors
    Aleksey Korshuk
    Description

    Dataset Card for "gpt4-llm-cleaned-chatml"

    Data preprocessing pipeline: https://github.com/AlekseyKorshuk/chat-data-pipeline

  2. d

    Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Guinea-Bissau, Guatemala, Chile, Hungary, Namibia, Saint Barthélemy, Niue, Panama, Andorra, Guadeloupe
    Description

    This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

    It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

    AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

    For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

    Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.

  3. d

    TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

    • datarade.ai
    .json, .csv, .xls
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Jun 18, 2021
    Dataset authored and provided by
    TagX
    Area covered
    Belize, Colombia, Saudi Arabia, Antigua and Barbuda, Iceland, Equatorial Guinea, Russian Federation, Qatar, Benin, Djibouti
    Description

    We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

    Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

    We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.

  4. h

    empathetic_dialogues_llm

    • huggingface.co
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zhangyiqun (2024). empathetic_dialogues_llm [Dataset]. https://huggingface.co/datasets/Estwld/empathetic_dialogues_llm
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2024
    Authors
    zhangyiqun
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Empathetic Dialogues for LLM

     This repository contains a reformatted version of the Empathetic Dialogues dataset, tailored for seamless integration with Language Model (LLM) training and inference. The original dataset's format posed challenges for direct application in LLM tasks, prompting us to restructure and clean the data. 

      Data Restructuring
    

     We have implemented the following changes to enhance the dataset's usability: 

    Merged dialogues with the same conv_id… See the full description on the dataset page: https://huggingface.co/datasets/Estwld/empathetic_dialogues_llm.

  5. daigt-v3-train-dataset

    • kaggle.com
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2023). daigt-v3-train-dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Darek Kłeczek
    Description

    New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'

    These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset

    All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)

    Enjoy ❤️

    Version 2 update: - removed NaNs and duplicated/short generations - applied cleaning prodedure from @nbroad's notebook - give it an upvote please! - added model column to indicate model family used in generations

  6. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Norway, Dominican Republic, Sint Maarten (Dutch part), Cook Islands, Oman, India, Barbados, Western Sahara, Jordan, United Kingdom
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  7. LLM Service Outages and Incident Reports

    • zenodo.org
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaoyu Chu; Xiaoyu Chu (2025). LLM Service Outages and Incident Reports [Dataset]. http://doi.org/10.5281/zenodo.14018219
    Explore at:
    Dataset updated
    Jan 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiaoyu Chu; Xiaoyu Chu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 31, 2024
    Description

    This dataset provides outage data and incident data of 3 LLM providers (OpenAI, Anthropic, Character.AI) across 8 LLM services (OpenAI API, ChatGPT, DALLE, Playground, Anthropic API, Claude, Console, Character.AI) collected until 2024-08-31.

    Data sources:

    • Outage: https://status.{service_provider}.com/uptime
    • Incident Reports: https://status.{service_provider}.com/history

    service_provider = [openai, anthropic, character.ai]

    Documents:

    • ./raw_data/*: Raw data for outages and incidents, stored by service.
    • ./clean_data/*: Clean-up data for characterization, aggregated of all services.

  8. h

    text2cypher-gpt4o-clean

    • huggingface.co
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomaž Bratanič (2024). text2cypher-gpt4o-clean [Dataset]. https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2024
    Authors
    Tomaž Bratanič
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Synthetic dataset created with GPT-4o

    Synthetic dataset of text2cypher over 16 different graph schemas. Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with gpt-4o using Chain of Thought. Here, there are only questions that return results when queried against the database. For more information visit: https://github.com/neo4j-labs/text2cypher/tree/main/datasets/synthetic_gpt4o_demodbs Dataset is available as train.csv. Columns are the following:… See the full description on the dataset page: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean.

  9. Foundation Model Data Collection and Data Annotation | Large Language...

    • datarade.ai
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Taiwan, El Salvador, Kyrgyzstan, Spain, Czech Republic, Portugal, Maldives, Russian Federation, Azerbaijan, Ireland
    Description
    1. Overview
    2. Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

    -SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

    -Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

    -RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

    1. Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

    -Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

    -Quality: Multiple rounds of quality inspections ensures high quality data output

    -Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

    -Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

    3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade

  10. h

    BRIGHT-Plus

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liyang Chen (2025). BRIGHT-Plus [Dataset]. https://huggingface.co/datasets/Helios1208/BRIGHT-Plus
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Liyang Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BRIGHT benchmark

    BRIGHT+ is an upgraded version of the BRIGHT benchmark, specifically designed to support reasoning-intensive retrieval in realistic settings. It is constructed by applying MARCUS, a multi-agent LLM-based clean-and-split pipeline, to the original BRIGHT dataset. BRIGHT+ addresses key limitations of the original web-crawled corpus—such as redundant boilerplate content and fragmented semantic units—by applying targeted structural cleaning and LLM-based semantic… See the full description on the dataset page: https://huggingface.co/datasets/Helios1208/BRIGHT-Plus.

  11. u

    Data from: Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

    • investigacion.usc.es
    Updated 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian; de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian (2024). Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training [Dataset]. https://investigacion.usc.es/documentos/668fc40fb9e7c03b01bd388b?lang=es
    Explore at:
    Dataset updated
    2024
    Authors
    de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian; de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian
    Area covered
    Galicia
    Description

    CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.

    We happily announce that we are introducing a new version of the CorpusNÓS. After improving our text cleaning and processing methods in our cleaning pipeline, we have decided to release this new version of the corpus, which reflects those enhancements.

    This new version contains the same files as the previous one and holds the same distribution of the data, however, we decided to change the format from plain text (.txt) to JSONL (.jsonl) so future cleaning processes can be performed easily, and relevant metadata can be included. As of now, some examples of entries from the CorpusNós have the following structure:

    {"id": 0, "text": "Abades: Parroquia do concello de Baltar baixo a advocación de san Paio.", "num_words": 12}

    {"id": 581, "text": "Feliz 2008 a tódolos nosos lectores Agora que remata 2007, un ano cheo de novidades tecnolóxicas que difundimos a través deste espazo dixital, queremos desexar a tódolos que non seguen con fidelidade unha boa despedida do ano e un feliz aninovo. Nós volveremos o mércores, 2 de xaneiro, á nosa actividade ordinaria, cumprindo coa nosa labor informativa para que as novas tecnolóxicas de Galicia e en galego cheguen ós nosos lectores puntualmente.", "num_words": 72, "pyplexity_score": 717.7585757844212, "lang": "gl"}

    In the plain text version, the delimiter between different documents was constituted by two newlines (

    ). In the JSONL version, each document is a JSON object with their corresponding id, but it also includes the number of words of each document, and, in some cases, the pyplexity score and the language tag.

    This new version of CorpusNós has undergone a heavier process of deduplication than the previous one. This means that more exact match duplications as well as partial duplications have been removed from the corpus and, therefore, the number of documents and tokens in this version has decreased and the current statistics are:

    Subcorpus:

    Data obtained via transfer agreement

    Genre

    Nº tokens

    Nº documents

    Books

    7.217.626

    103

    Research articles

    2.638.537

    635

    Press

    92.661.350

    161.760

    Governmental

    221.565.059

    527.699

    Web contents

    15.471.132

    41.276

    Encyclopedic

    4.799.214

    47.396

    Subtotal

    332.721.231

    777.583

    Subcorpus:

    Public data

    Genre

    Nº tokens

    Nº documents

    Press and blogs

    142.238.181

    598.375

    Encyclopedic

    48.260.708

    148.560

    Web crawls

    1.205.699.835

    2.850.604

    Translation corpora

    106.555.883

    3.544.026

    Subtotal

    1.502.754.607

    7.141.565

    Total

    1.835.475.838

    7.919.148

    The TXT version is still available under the corpusnos_v1_txt zip file and it mantains the same structure as before (documents are divided by two newlines '

    ') but this version hasn't gone through the improved cleaning process mentioned above.

    Please, note that if you want to download or use the newest version you have to download the corpusnos_v2_jsonl.

    Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing. Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0).

    Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models.

    If you use this data in your work, please cite:

    de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.

    Funding

    This corpus was compiled/development.... within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.

  12. d

    FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision...

    • datarade.ai
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision Applications | Machine Learning (ML) Data | Deep Learning (DL) Data [Dataset]. https://datarade.ai/data-products/filemarket-dataset-for-face-anti-spoofing-videos-in-compu-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jul 10, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Cabo Verde, United Republic of, Libya, Ukraine, Sao Tome and Principe, Russian Federation, South Sudan, Guinea-Bissau, Mauritania, Germany
    Description

    Live Face Anti-Spoof Dataset

    A live face dataset is crucial for advancing computer vision tasks such as face detection, anti-spoofing detection, and face recognition. The Live Face Anti-Spoof Dataset offered by Ainnotate is specifically designed to train algorithms for anti-spoofing purposes, ensuring that AI systems can accurately differentiate between real and fake faces in various scenarios.

    Key Features:

    Comprehensive Video Collection: The dataset features thousands of videos showcasing a diverse range of individuals, including males and females, with and without glasses. It also includes men with beards, mustaches, and clean-shaven faces. Lighting Conditions: Videos are captured in both indoor and outdoor environments, ensuring that the data covers a wide range of lighting conditions, making it highly applicable for real-world use. Data Collection Method: Our datasets are gathered through a community-driven approach, leveraging our extensive network of over 700k users across various Telegram apps. This method ensures that the data is not only diverse but also ethically sourced with full consent from participants, providing reliable and real-world applicable data for training AI models. Versatility: This dataset is ideal for training models in face detection, anti-spoofing, and face recognition tasks, offering robust support for these essential computer vision applications. In addition to the Live Face Anti-Spoof Dataset, FileMarket provides specialized datasets across various categories to support a wide range of AI and machine learning projects:

    Object Detection Data: Perfect for training AI in image and video analysis. Machine Learning (ML) Data: Offers a broad spectrum of applications, from predictive analytics to natural language processing (NLP). Large Language Model (LLM) Data: Designed to support text generation, chatbots, and machine translation models. Deep Learning (DL) Data: Essential for developing complex neural networks and deep learning models. Biometric Data: Includes diverse datasets for facial recognition, fingerprint analysis, and other biometric applications. This live face dataset, alongside our other specialized data categories, empowers your AI projects by providing high-quality, diverse, and comprehensive datasets. Whether your focus is on anti-spoofing detection, face recognition, or other biometric and machine learning tasks, our data offerings are tailored to meet your specific needs.

  13. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  14. Norwegian Medical Question Answering Dataset - NorMedQA

    • zenodo.org
    json
    Updated May 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael A. Riegler; Michael A. Riegler (2025). Norwegian Medical Question Answering Dataset - NorMedQA [Dataset]. http://doi.org/10.5281/zenodo.15346484
    Explore at:
    jsonAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael A. Riegler; Michael A. Riegler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This benchmark dataset consists of 1401 medical question-and-answer pairs primarily in Norwegian (Bokmål and Nynorsk), designed for evaluating Large Language Models (LLMs). The content originates from publicly available sources containing medical exam questions and has undergone cleaning and preprocessing. The dataset is structured in JSON format, with each record containing the source document name, question number (where available), the question text, and the reference answer text and the wrong answers text if the answer was multiple choice. It is suitable for use within evaluation frameworks such as lm-evaluation-harness (Github with config and code example: https://github.com/kelkalot/normedqa)to assess model capabilities in medical knowledge retrieval and reasoning specific to the Norwegian context.

  15. h

    somos-clean-alpaca-es

    • huggingface.co
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SomosNLP (2023). somos-clean-alpaca-es [Dataset]. https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2023
    Dataset authored and provided by
    SomosNLP
    Description

    Dataset Card for "somos-clean-alpaca-es"

    Este conjunto de datos es una traducción del dataset Clean Alpaca al Español y sirve como referencia para el esfuerzo colaborativo de limpieza y mejora del dataset durante el Hackathon Somos NLP 2023. Nota: No es necesario participar en el hackathon para contribuir a esta tarea. Cuantas más personas y equipos participen mayor será la calidad del dataset final y por lo tanto también del LLM que entrenemos, ¡únete! Te explicamos como… See the full description on the dataset page: https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es.

  16. Sentiment Analysis Nepali Dataset by SamirWagle

    • kaggle.com
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samir Wagle (2025). Sentiment Analysis Nepali Dataset by SamirWagle [Dataset]. https://www.kaggle.com/datasets/sameerwagle/sentimentanalysisnepalidataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samir Wagle
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    web scrapping of comments, extracted available nepali comments existing in huggingface, kaggle. Preprocessed it. Removal of html tag, username, links, emoji, non nepali characters. Lemmatization done. Afrer that clean dataset is with us. That clean dataset is being labeled with LLM as a judge approach using openai o3 model ( premium ) costed us around 20$ for labeling. After that different embedding models were used for embedding. This will act as a benchmark dataset for sentiment analysis of nepali dataset. Please give author a credit if you are using it.

  17. h

    cleaned_data

    • huggingface.co
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    2025 Longevity x AI Hackathon (2025). cleaned_data [Dataset]. https://huggingface.co/datasets/longevity-db/cleaned_data
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset authored and provided by
    2025 Longevity x AI Hackathon
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cleaned Tabula Muris Senis Single-Cell Data and other aging datasets

    This dataset contains LLM-cleaned single-cell transcriptomic annotations from the Tabula Muris Senis project, specifically for mouse tissues processed with SmartSeq2, and ALL OTHER DATASETS WITH AGING IN THE FILENAME :-) . The cleaning and annotation were performed using large language models (OpenAI and Claude), enabling enriched metadata and corrected cell type labels.

    🧬 Over 1.3 million rows and 78.17 GB… See the full description on the dataset page: https://huggingface.co/datasets/longevity-db/cleaned_data.

  18. Data from: Congressional Witnesses Matter: Proving Witness Testimony Impact...

    • zenodo.org
    zip
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collin Coil; Collin Coil; Caroline Bruckner; Caroline Bruckner; Nicholas Chen; Nicholas Chen; Elizabeth Keith; Karen O'Connor; Karen O'Connor; Elizabeth Keith (2024). Congressional Witnesses Matter: Proving Witness Testimony Impact Using Large Language Models [Dataset]. http://doi.org/10.5281/zenodo.14291000
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Collin Coil; Collin Coil; Caroline Bruckner; Caroline Bruckner; Nicholas Chen; Nicholas Chen; Elizabeth Keith; Karen O'Connor; Karen O'Connor; Elizabeth Keith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository provides the data supporting our study, in which we use a large language model (LLMs) to analyze the impact of congressional witness testimony. The dataset has been curated and structured to facilitate reproducibility and encourage further research in this domain.

    The repository includes the results of our study (see `Results.zip`), the fine-tuning corpus (see `Model Training Data.zip`), and the Witness and Legislative History and Impact Corpus (WLHIC), which is can be subdivided into the Witness Corpus (WC) and Legislative History and Impact Corpus (LHIC). For the LHIC and WC, we provide cleaned JSONL files containing the full datasets, individual text files of each document, and accompanying metadata (see `WLHIC data.zip`). To ensure comprehensive accessibility, we also include the original PDF versions of the documents in these corpora (see `WLHIC Raw Files.zip`).

    We also provide the sentence transformer model resulting from the extended pretraining process and the model resulting from the fine tuning process. Both are accessible in `Models.zip`.

    Researchers can use the provided data to replicate our findings and verify the results of our analysis. The cleaned data can also be regenerated by applying the cleaning scripts provided in the code repository to the LHIC and WC text files. While slight variations in results may occur when replicating the study from scratch due to the stochastic nature of LLM training, these differences are minimal and do not affect the substantive findings.

    We encourage the use of this dataset for reproducibility studies and to inspire further exploration of LLM applications in political science research. By publishing this data, we aim to promote transparency, collaboration, and innovation in the field.

  19. T3Set: Table Tennis Training Multimodal Dataset

    • zenodo.org
    bin, pdf, zip
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ji Ma; Ji Ma (2025). T3Set: Table Tennis Training Multimodal Dataset [Dataset]. http://doi.org/10.5281/zenodo.15516144
    Explore at:
    bin, pdf, zipAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ji Ma; Ji Ma
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    May 26, 2025
    Description

    This is the dataset for the KDD'25 (dataset and benchmark track) full paper "T3Set: A Multimodal Dataset with Targeted Suggestions for LLM-based Virtual Coach in Table Tennis Training".

    T3Set (Table Tennis Training) is a multimodal dataset with aligned video-sensor-text data in table tennis training. The key features of T3Set include (1)temporal alignment between sensor data, video data, and text data. (2)high-quality targeted suggestions which are consistent with predefined suggestion taxonomy.

    The scripts we used for dataset construction and data cleaning processes, are provided in the Github Repo: https://github.com/jima-cs/t3set

    If you find this dataset useful, please cite our paper:
    ```bibtex
    @inproceedings{
    ma2025t3set,
    title={T3Set: A Multimodal Dataset with Targeted Suggestions for LLM-based Virtual Coach in Table Tennis Training},
    author={Ji Ma and Jiale Wu and Haoyu Wang and Yanze Zhang and Xiao Xie and Zheng Zhou and Jiachen Wang and Yingcai Wu},
    year={2025},
    booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
    doi={10.1145/3711896.3737407}
    pages={}
    }
    ```

  20. h

    mz94-documentation

    • huggingface.co
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ratan0n (2025). mz94-documentation [Dataset]. https://huggingface.co/datasets/ratanon/mz94-documentation
    Explore at:
    Dataset updated
    Jun 24, 2025
    Authors
    ratan0n
    Description

    MZ94 - LLM Training Dataset

      Overview
    

    This dataset contains crawled documentation from https://infozone.atlassian.net/wiki/spaces/MD94/, formatted for LLM training and RAG systems.

      Dataset Statistics
    

    Total Pages: 4109 Total Words: 1005892 Total Chunks: 2420 Crawled: 2025-06-24 04:55:10

      Directory Structure
    
    
    
    
    
      /llm_ready/
    

    Plain text files optimized for LLM training:

    Clean, formatted text content Consistent structure with headers Document… See the full description on the dataset page: https://huggingface.co/datasets/ratanon/mz94-documentation.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aleksey Korshuk (2023). gpt4-llm-cleaned-chatml [Dataset]. https://huggingface.co/datasets/AlekseyKorshuk/gpt4-llm-cleaned-chatml

gpt4-llm-cleaned-chatml

AlekseyKorshuk/gpt4-llm-cleaned-chatml

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 26, 2023
Authors
Aleksey Korshuk
Description

Dataset Card for "gpt4-llm-cleaned-chatml"

Data preprocessing pipeline: https://github.com/AlekseyKorshuk/chat-data-pipeline

Search
Clear search
Close search
Google apps
Main menu