2 datasets found
  1. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  2. h

    imdb-deduplicated

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien, imdb-deduplicated [Dataset]. https://huggingface.co/datasets/davanstrien/imdb-deduplicated
    Explore at:
    Authors
    Daniel van Strien
    Description

    Deduplicated imdb

    This dataset is a deduplicated version of imdb using semantic deduplication with SemHash.

      Deduplication Details
    

    Method: deduplicate

    Column: text

    Original size: 25,000 samples

    Deduplicated size: 24,830 samples

    Duplicate ratio: 0.68%

    Reduction: 0.68%

    Date processed: 2025-06-27

      How to use
    

    from datasets import load_dataset

    dataset = load_dataset("imdb-deduplicated")

      Processing script
    

    This dataset was created using the… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/imdb-deduplicated.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb

Data from: imdb

IMDB

stanfordnlp/imdb

Related Article
Explore at:
20 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for "imdb"

  Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  Supported Tasks and Leaderboards

More Information Needed

  Languages

More Information Needed

  Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
Search
Clear search
Close search
Google apps
Main menu