Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Cost to access
Described as free to access or have a license that allows redistribution.
10 datasets found
  1. Webis Cross-Lingual Sentiment Dataset 2010 (Webis-CLS-10)

    gz, txt
    Updated Jul 16, 2010
  2. Webis-CLS-10

    Updated 2010
  3. ITOP Dataset

    gz, jpg
    Updated Oct 8, 2016
  4. Sentiment Analysis of movie review

    Updated Nov 8, 2020
  5. Women's E-Commerce Clothing Reviews

    Updated Feb 3, 2018
  6. COVID-19 Open Research Dataset Challenge (CORD-19)

    Updated Apr 25, 2022
  7. Website Phishing Dataset

    Updated May 4, 2019
  8. Predict Click through rate (CTR) for a website

    Updated Jun 20, 2019
  9. Rotten Tomatoes movies and critic reviews dataset

    Updated Nov 4, 2020
  10. URL Classification Dataset [DMOZ]

    Updated Aug 19, 2018
  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Click to copy link
Link copied
Prettenhofer, Peter; Stein, Benno (2010). Webis Cross-Lingual Sentiment Dataset 2010 (Webis-CLS-10) [Dataset].
Organization logo

Webis Cross-Lingual Sentiment Dataset 2010 (Webis-CLS-10)

4 scholarly articles cite this dataset (View in Google Scholar)
txt, gzAvailable download formats
Dataset updated
Jul 16, 2010
Dataset provided by
Bauhaus-Universität Weimar
Prettenhofer, Peter; Stein, Benno

Attribution 4.0 (CC BY 4.0)
License information was derived automatically


The Cross-Lingual Sentiment (CLS) dataset comprises about 800.000 Amazon product reviews in the four languages English, German, French, and Japanese.

For more information on the construction of the dataset see (Prettenhofer and Stein, 2010) or the enclosed readme files. If you have a question after reading the paper and the readme files, please contact Peter Prettenhofer.

We provide the dataset in two formats: 1) a processed format which corresponds to the preprocessing (tokenization, etc.) in (Prettenhofer and Stein, 2010); 2) an unprocessed format which contains the full text of the reviews (e.g., for machine translation or feature engineering).

The dataset was first used by (Prettenhofer and Stein, 2010). It consists of Amazon product reviews for three product categories---books, dvds and music---written in four different languages: English, German, French, and Japanese. The German, French, and Japanese reviews were crawled from Amazon in November, 2009. The English reviews were sampled from the Multi-Domain Sentiment Dataset (Blitzer et. al., 2007). For each language-category pair there exist three sets of training documents, test documents, and unlabeled documents. The training and test sets comprise 2.000 documents each, whereas the number of unlabeled documents varies from 9.000 - 170.000.

Clear search
Close search
Google apps
Main menu