100+ datasets found

filter-bad-models
huggingface.co
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huggingface Projects (2023). filter-bad-models [Dataset]. https://huggingface.co/datasets/huggingface-projects/filter-bad-models
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Huggingface Projects
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
huggingface-projects/filter-bad-models dataset hosted on Hugging Face and contributed by the HF Datasets community
h
profanity-filter
huggingface.co
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Shawarma Man (2023). profanity-filter [Dataset]. https://huggingface.co/datasets/shawarmas/profanity-filter
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2023
Authors
The Shawarma Man
Description
shawarmas/profanity-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
amazon-product-data-filter
huggingface.co
Updated Nov 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
amazon-product-data-filter [Dataset]. https://huggingface.co/datasets/iarbel/amazon-product-data-filter
Explore at:
Dataset updated
Nov 14, 2023
Authors
Iftach Arbel
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for "amazon-product-data-filter"

Dataset Summary

The Amazon Product Dataset contains product listing data from the Amazon US website. It can be used for various NLP and classification tasks, such as text generation, product type classification, attribute extraction, image recognition and more.

Languages

The text in the dataset is in English.

Dataset Structure Data Instances

Each data point provides product information, such… See the full description on the dataset page: https://huggingface.co/datasets/iarbel/amazon-product-data-filter.
h
s1-advanced-filter-plus
huggingface.co
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quy-Anh Dang (2025). s1-advanced-filter-plus [Dataset]. https://huggingface.co/datasets/quyanh/s1-advanced-filter-plus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Authors
Quy-Anh Dang
Description
quyanh/s1-advanced-filter-plus dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pile-of-law
huggingface.co
opendatalab.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
Explore at:
Dataset updated
Jul 10, 2022
Dataset authored and provided by
Pile of Law
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
h
PERSONA
huggingface.co
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SynthLabs (2025). PERSONA [Dataset]. https://huggingface.co/datasets/SynthLabsAI/PERSONA
Explore at:
Dataset updated
Apr 16, 2025
Dataset authored and provided by
SynthLabs
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for PERSONAS (Prism Filter)

PERSONAS (Prism filter) is one of the largest datasets of synthetic preferences, with over 200k preferences over thousands of questions and 1k personas. Details on the PERSONAS dataset can be found here paper link. Note that you MUST also fill out the form on our site to receive access to the full dataset. The form is available here.

Dataset Details Dataset Description

The personas dataset is a pluralistic… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/PERSONA.
h
generation-semantic-memorization-filters
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USVSN Sai Prashanth, generation-semantic-memorization-filters [Dataset]. https://huggingface.co/datasets/usvsnsp/generation-semantic-memorization-filters
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
USVSN Sai Prashanth
Description
usvsnsp/generation-semantic-memorization-filters dataset hosted on Hugging Face and contributed by the HF Datasets community
h
sft-spin-filter
huggingface.co
Updated Sep 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yifan Wang (2024). sft-spin-filter [Dataset]. https://huggingface.co/datasets/AmberYifan/sft-spin-filter
Explore at:
Dataset updated
Sep 20, 2024
Authors
Yifan Wang
Description
AmberYifan/sft-spin-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
the-stack-v2-filter
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
the-stack-v2-filter [Dataset]. https://huggingface.co/datasets/luowenyang/the-stack-v2-filter
Explore at:
Authors
Tim Turing
Description
luowenyang/the-stack-v2-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
oo1-uuid-filter
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yizhi Li (2025). oo1-uuid-filter [Dataset]. https://huggingface.co/datasets/yizhilll/oo1-uuid-filter
Explore at:
Dataset updated
Jul 8, 2025
Authors
Yizhi Li
Description
yizhilll/oo1-uuid-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Maze-Reasoning-filter
huggingface.co
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menlo Research (2025). Maze-Reasoning-filter [Dataset]. https://huggingface.co/datasets/Menlo/Maze-Reasoning-filter
Explore at:
Dataset updated
Feb 9, 2025
Dataset authored and provided by
Menlo Research
Description
Menlo/Maze-Reasoning-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ov-kit-doc-files-filter-v2
huggingface.co
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PE-NLP (2024). ov-kit-doc-files-filter-v2 [Dataset]. https://huggingface.co/datasets/pe-nlp/ov-kit-doc-files-filter-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 28, 2024
Dataset authored and provided by
PE-NLP
Description
pe-nlp/ov-kit-doc-files-filter-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
sft-all-v0.1-clean-filter-r1
huggingface.co
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II Vietnam (2025). sft-all-v0.1-clean-filter-r1 [Dataset]. https://huggingface.co/datasets/II-Vietnam/sft-all-v0.1-clean-filter-r1
Explore at:
Dataset updated
Mar 1, 2025
Dataset authored and provided by
II Vietnam
Description
II-Vietnam/sft-all-v0.1-clean-filter-r1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
sft-all-v0.1-clean-filter-r1
huggingface.co
Updated Mar 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sw_tuenguyen (2025). sft-all-v0.1-clean-filter-r1 [Dataset]. https://huggingface.co/datasets/meoconxinhxan/sft-all-v0.1-clean-filter-r1
Explore at:
Dataset updated
Mar 1, 2025
Authors
sw_tuenguyen
Description
meoconxinhxan/sft-all-v0.1-clean-filter-r1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
record-test-filter
huggingface.co
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suhail (2025). record-test-filter [Dataset]. https://huggingface.co/datasets/suhaild/record-test-filter
Explore at:
Dataset updated
Jun 26, 2025
Authors
Suhail
Description
suhaild/record-test-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
sft-question-medical-synthetic-filter
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trần Phước Sang (2025). sft-question-medical-synthetic-filter [Dataset]. https://huggingface.co/datasets/phuocsang/sft-question-medical-synthetic-filter
Explore at:
Dataset updated
Jun 21, 2025
Authors
Trần Phước Sang
Description
phuocsang/sft-question-medical-synthetic-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
waon-cc-audio-without-filter
huggingface.co
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
speed (2025). waon-cc-audio-without-filter [Dataset]. https://huggingface.co/datasets/speed/waon-cc-audio-without-filter
Explore at:
Dataset updated
May 29, 2025
Authors
speed
Description
speed/waon-cc-audio-without-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
translation-filter
huggingface.co
Updated Feb 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satpal Singh Rathore (2024). translation-filter [Dataset]. https://huggingface.co/datasets/satpalsr/translation-filter
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2024
Authors
Satpal Singh Rathore
Description
satpalsr/translation-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
h
GigaVerbo-Text-Filter
huggingface.co
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tucano (2024). GigaVerbo-Text-Filter [Dataset]. https://huggingface.co/datasets/TucanoBR/GigaVerbo-Text-Filter
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 13, 2024
Dataset authored and provided by
Tucano
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GigaVerbo Text-Filter

Dataset Summary

GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo (i.e., specifically those that were not synthetic). This dataset was used to train the text-quality filters described in "Tucano: Advancing Neural Text Generation for Portuguese". To create the text embeddings, we used sentence-transformers/LaBSE. All scores were generated by GPT-4o.

Supported Tasks and Leaderboards… See the full description on the dataset page: https://huggingface.co/datasets/TucanoBR/GigaVerbo-Text-Filter.
h
db-sfw-512px-character-filter
huggingface.co
Updated Mar 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hayden Donnelly (2024). db-sfw-512px-character-filter [Dataset]. https://huggingface.co/datasets/hayden-donnelly/db-sfw-512px-character-filter
Explore at:
Dataset updated
Mar 12, 2024
Authors
Hayden Donnelly
Description
Danbooru SFW 512px Character Filter

This dataset is meant to be used for training a simple binary classifier that can filter the Danbooru SFW 2021 dataset. It is similar to db-sfw-512-general-filter-dataset but it has different class criteria. Just like the general dataset there are two classes: "accepted" and "rejected", with "accepted" representing samples that should pass through the filter and "rejected" representing samples that should not. To be accepted, a sample should… See the full description on the dataset page: https://huggingface.co/datasets/hayden-donnelly/db-sfw-512px-character-filter.

Facebook

Twitter

Click to copy link

Link copied

Cite

Huggingface Projects (2023). filter-bad-models [Dataset]. https://huggingface.co/datasets/huggingface-projects/filter-bad-models

filter-bad-models

huggingface-projects/filter-bad-models

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 20, 2023

Dataset provided by

Hugging Facehttps://huggingface.co/

Authors

Huggingface Projects

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

huggingface-projects/filter-bad-models dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

filter-bad-models

profanity-filter

amazon-product-data-filter

s1-advanced-filter-plus

pile-of-law

PERSONA

generation-semantic-memorization-filters

sft-spin-filter

the-stack-v2-filter

oo1-uuid-filter

Maze-Reasoning-filter

ov-kit-doc-files-filter-v2

sft-all-v0.1-clean-filter-r1

sft-all-v0.1-clean-filter-r1

record-test-filter

sft-question-medical-synthetic-filter

waon-cc-audio-without-filter

translation-filter

GigaVerbo-Text-Filter

db-sfw-512px-character-filter

filter-bad-models

huggingface-projects/filter-bad-models