https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
Deduplicated imdb
This dataset is a deduplicated version of imdb using semantic deduplication with SemHash.
Deduplication Details
Method: deduplicate
Column: text
Original size: 25,000 samples
Deduplicated size: 24,830 samples
Duplicate ratio: 0.68%
Reduction: 0.68%
Date processed: 2025-06-27
How to use
from datasets import load_dataset
dataset = load_dataset("imdb-deduplicated")
Processing script
This dataset was created using the… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/imdb-deduplicated.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.