MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
huggingface-projects/filter-bad-models dataset hosted on Hugging Face and contributed by the HF Datasets community
shawarmas/profanity-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for "amazon-product-data-filter"
Dataset Summary
The Amazon Product Dataset contains product listing data from the Amazon US website. It can be used for various NLP and classification tasks, such as text generation, product type classification, attribute extraction, image recognition and more.
Languages
The text in the dataset is in English.
Dataset Structure
Data Instances
Each data point provides product information, such… See the full description on the dataset page: https://huggingface.co/datasets/iarbel/amazon-product-data-filter.
quyanh/s1-advanced-filter-plus dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for PERSONAS (Prism Filter)
PERSONAS (Prism filter) is one of the largest datasets of synthetic preferences, with over 200k preferences over thousands of questions and 1k personas. Details on the PERSONAS dataset can be found here paper link. Note that you MUST also fill out the form on our site to receive access to the full dataset. The form is available here.
Dataset Details
Dataset Description
The personas dataset is a pluralistic… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/PERSONA.
usvsnsp/generation-semantic-memorization-filters dataset hosted on Hugging Face and contributed by the HF Datasets community
AmberYifan/sft-spin-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
yizhilll/oo1-uuid-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
Menlo/Maze-Reasoning-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
pe-nlp/ov-kit-doc-files-filter-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
II-Vietnam/sft-all-v0.1-clean-filter-r1 dataset hosted on Hugging Face and contributed by the HF Datasets community
meoconxinhxan/sft-all-v0.1-clean-filter-r1 dataset hosted on Hugging Face and contributed by the HF Datasets community
suhaild/record-test-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
phuocsang/sft-question-medical-synthetic-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
speed/waon-cc-audio-without-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
satpalsr/translation-filter dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GigaVerbo Text-Filter
Dataset Summary
GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo (i.e., specifically those that were not synthetic). This dataset was used to train the text-quality filters described in "Tucano: Advancing Neural Text Generation for Portuguese". To create the text embeddings, we used sentence-transformers/LaBSE. All scores were generated by GPT-4o.
Supported Tasks and Leaderboards… See the full description on the dataset page: https://huggingface.co/datasets/TucanoBR/GigaVerbo-Text-Filter.
Danbooru SFW 512px Character Filter
This dataset is meant to be used for training a simple binary classifier that can filter the Danbooru SFW 2021 dataset. It is similar to db-sfw-512-general-filter-dataset but it has different class criteria. Just like the general dataset there are two classes: "accepted" and "rejected", with "accepted" representing samples that should pass through the filter and "rejected" representing samples that should not. To be accepted, a sample should… See the full description on the dataset page: https://huggingface.co/datasets/hayden-donnelly/db-sfw-512px-character-filter.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
huggingface-projects/filter-bad-models dataset hosted on Hugging Face and contributed by the HF Datasets community