100+ datasets found
  1. filter-bad-models

    • huggingface.co
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huggingface Projects (2023). filter-bad-models [Dataset]. https://huggingface.co/datasets/huggingface-projects/filter-bad-models
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Huggingface Projects
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    huggingface-projects/filter-bad-models dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    profanity-filter

    • huggingface.co
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Shawarma Man (2023). profanity-filter [Dataset]. https://huggingface.co/datasets/shawarmas/profanity-filter
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2023
    Authors
    The Shawarma Man
    Description

    shawarmas/profanity-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    amazon-product-data-filter

    • huggingface.co
    Updated Nov 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    amazon-product-data-filter [Dataset]. https://huggingface.co/datasets/iarbel/amazon-product-data-filter
    Explore at:
    Dataset updated
    Nov 14, 2023
    Authors
    Iftach Arbel
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for "amazon-product-data-filter"

      Dataset Summary
    

    The Amazon Product Dataset contains product listing data from the Amazon US website. It can be used for various NLP and classification tasks, such as text generation, product type classification, attribute extraction, image recognition and more.

      Languages
    

    The text in the dataset is in English.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    Each data point provides product information, such… See the full description on the dataset page: https://huggingface.co/datasets/iarbel/amazon-product-data-filter.

  4. h

    s1-advanced-filter-plus

    • huggingface.co
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quy-Anh Dang (2025). s1-advanced-filter-plus [Dataset]. https://huggingface.co/datasets/quyanh/s1-advanced-filter-plus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Authors
    Quy-Anh Dang
    Description

    quyanh/s1-advanced-filter-plus dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    pile-of-law

    • huggingface.co
    • opendatalab.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset authored and provided by
    Pile of Law
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  6. h

    PERSONA

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SynthLabs (2025). PERSONA [Dataset]. https://huggingface.co/datasets/SynthLabsAI/PERSONA
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset authored and provided by
    SynthLabs
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for PERSONAS (Prism Filter)

    PERSONAS (Prism filter) is one of the largest datasets of synthetic preferences, with over 200k preferences over thousands of questions and 1k personas. Details on the PERSONAS dataset can be found here paper link. Note that you MUST also fill out the form on our site to receive access to the full dataset. The form is available here.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    The personas dataset is a pluralistic… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/PERSONA.

  7. h

    generation-semantic-memorization-filters

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    USVSN Sai Prashanth, generation-semantic-memorization-filters [Dataset]. https://huggingface.co/datasets/usvsnsp/generation-semantic-memorization-filters
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    USVSN Sai Prashanth
    Description

    usvsnsp/generation-semantic-memorization-filters dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    sft-spin-filter

    • huggingface.co
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yifan Wang (2024). sft-spin-filter [Dataset]. https://huggingface.co/datasets/AmberYifan/sft-spin-filter
    Explore at:
    Dataset updated
    Sep 20, 2024
    Authors
    Yifan Wang
    Description

    AmberYifan/sft-spin-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    the-stack-v2-filter

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    the-stack-v2-filter [Dataset]. https://huggingface.co/datasets/luowenyang/the-stack-v2-filter
    Explore at:
    Authors
    Tim Turing
    Description

    luowenyang/the-stack-v2-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    oo1-uuid-filter

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yizhi Li (2025). oo1-uuid-filter [Dataset]. https://huggingface.co/datasets/yizhilll/oo1-uuid-filter
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Yizhi Li
    Description

    yizhilll/oo1-uuid-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    Maze-Reasoning-filter

    • huggingface.co
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Menlo Research (2025). Maze-Reasoning-filter [Dataset]. https://huggingface.co/datasets/Menlo/Maze-Reasoning-filter
    Explore at:
    Dataset updated
    Feb 9, 2025
    Dataset authored and provided by
    Menlo Research
    Description

    Menlo/Maze-Reasoning-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    ov-kit-doc-files-filter-v2

    • huggingface.co
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PE-NLP (2024). ov-kit-doc-files-filter-v2 [Dataset]. https://huggingface.co/datasets/pe-nlp/ov-kit-doc-files-filter-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2024
    Dataset authored and provided by
    PE-NLP
    Description

    pe-nlp/ov-kit-doc-files-filter-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    sft-all-v0.1-clean-filter-r1

    • huggingface.co
    Updated Mar 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II Vietnam (2025). sft-all-v0.1-clean-filter-r1 [Dataset]. https://huggingface.co/datasets/II-Vietnam/sft-all-v0.1-clean-filter-r1
    Explore at:
    Dataset updated
    Mar 1, 2025
    Dataset authored and provided by
    II Vietnam
    Description

    II-Vietnam/sft-all-v0.1-clean-filter-r1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    sft-all-v0.1-clean-filter-r1

    • huggingface.co
    Updated Mar 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sw_tuenguyen (2025). sft-all-v0.1-clean-filter-r1 [Dataset]. https://huggingface.co/datasets/meoconxinhxan/sft-all-v0.1-clean-filter-r1
    Explore at:
    Dataset updated
    Mar 1, 2025
    Authors
    sw_tuenguyen
    Description

    meoconxinhxan/sft-all-v0.1-clean-filter-r1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    record-test-filter

    • huggingface.co
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suhail (2025). record-test-filter [Dataset]. https://huggingface.co/datasets/suhaild/record-test-filter
    Explore at:
    Dataset updated
    Jun 26, 2025
    Authors
    Suhail
    Description

    suhaild/record-test-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    sft-question-medical-synthetic-filter

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trần Phước Sang (2025). sft-question-medical-synthetic-filter [Dataset]. https://huggingface.co/datasets/phuocsang/sft-question-medical-synthetic-filter
    Explore at:
    Dataset updated
    Jun 21, 2025
    Authors
    Trần Phước Sang
    Description

    phuocsang/sft-question-medical-synthetic-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    waon-cc-audio-without-filter

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    speed (2025). waon-cc-audio-without-filter [Dataset]. https://huggingface.co/datasets/speed/waon-cc-audio-without-filter
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    speed
    Description

    speed/waon-cc-audio-without-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    translation-filter

    • huggingface.co
    Updated Feb 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satpal Singh Rathore (2024). translation-filter [Dataset]. https://huggingface.co/datasets/satpalsr/translation-filter
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2024
    Authors
    Satpal Singh Rathore
    Description

    satpalsr/translation-filter dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    GigaVerbo-Text-Filter

    • huggingface.co
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tucano (2024). GigaVerbo-Text-Filter [Dataset]. https://huggingface.co/datasets/TucanoBR/GigaVerbo-Text-Filter
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2024
    Dataset authored and provided by
    Tucano
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GigaVerbo Text-Filter

      Dataset Summary
    

    GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo (i.e., specifically those that were not synthetic). This dataset was used to train the text-quality filters described in "Tucano: Advancing Neural Text Generation for Portuguese". To create the text embeddings, we used sentence-transformers/LaBSE. All scores were generated by GPT-4o.

      Supported Tasks and Leaderboards… See the full description on the dataset page: https://huggingface.co/datasets/TucanoBR/GigaVerbo-Text-Filter.
    
  20. h

    db-sfw-512px-character-filter

    • huggingface.co
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hayden Donnelly (2024). db-sfw-512px-character-filter [Dataset]. https://huggingface.co/datasets/hayden-donnelly/db-sfw-512px-character-filter
    Explore at:
    Dataset updated
    Mar 12, 2024
    Authors
    Hayden Donnelly
    Description

    Danbooru SFW 512px Character Filter

    This dataset is meant to be used for training a simple binary classifier that can filter the Danbooru SFW 2021 dataset. It is similar to db-sfw-512-general-filter-dataset but it has different class criteria. Just like the general dataset there are two classes: "accepted" and "rejected", with "accepted" representing samples that should pass through the filter and "rejected" representing samples that should not. To be accepted, a sample should… See the full description on the dataset page: https://huggingface.co/datasets/hayden-donnelly/db-sfw-512px-character-filter.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Huggingface Projects (2023). filter-bad-models [Dataset]. https://huggingface.co/datasets/huggingface-projects/filter-bad-models
Organization logo

filter-bad-models

huggingface-projects/filter-bad-models

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Huggingface Projects
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

huggingface-projects/filter-bad-models dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu