2 datasets found

h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
h
imdb-deduplicated
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien, imdb-deduplicated [Dataset]. https://huggingface.co/datasets/davanstrien/imdb-deduplicated
Explore at:
Authors
Daniel van Strien
Description
Deduplicated imdb

This dataset is a deduplicated version of imdb using semantic deduplication with SemHash.

Deduplication Details

Method: deduplicate

Column: text

Original size: 25,000 samples

Deduplicated size: 24,830 samples

Duplicate ratio: 0.68%

Reduction: 0.68%

Date processed: 2025-06-27

How to use

from datasets import load_dataset

dataset = load_dataset("imdb-deduplicated")

Processing script

This dataset was created using the… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/imdb-deduplicated.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb

Data from: imdb

IMDB

stanfordnlp/imdb

Explore at:

20 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 3, 2003

Dataset authored and provided by

Stanford NLP

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for "imdb"

  Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  Supported Tasks and Leaderboards

More Information Needed

  Languages

More Information Needed

  Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

Clear search

Close search

Google apps

Main menu

Data from: imdb

imdb-deduplicated

Data from: imdbSee More Versions

IMDB

stanfordnlp/imdb

Data from: imdb