Updated Date
Download Format
Usage Rights
License from Data Provider
Please review the applicable license to make sure your contemplated use is permitted.
Free
Cost to Access
Described as free to access or have a license that allows redistribution.
29 datasets found
  1. Webis-Sentences-17

    • webis.de
    Updated Feb 27, 2017
  2. Webis-Simple-Sentences-17 Corpus

    • zenodo.org
    • search.datacite.org
    Updated Feb 27, 2017
  3. Webis-QSpell-17

    • webis.de
    • zenodo.org
    Updated 2017
  4. Webis-Mnemonics-17

    • webis.de
    • zenodo.org
    Updated 2017
  5. w

    Race and the criminal justice system statistics 2018

    • www.gov.uk
    Updated Nov 28, 2019
  6. w

    Women and the criminal justice system 2017

    • www.gov.uk
    Updated Nov 29, 2018
  7. w

    Youth Justice statistics: 2017 to 2018

    • www.gov.uk
    Updated Jan 30, 2020
  8. SNAP Memetracker

    • www.kaggle.com
    Updated Nov 21, 2016
  9. Bilingual English-Icelandic parallel corpus from Nordisk eTax website

    • data.europa.eu
    Updated Oct 10, 2019
  10. g

    Phrases in email subject lines

    • www.getresponse.com
    Updated Aug 12, 2019
  11. m

    Data for: Language Models, Surprisal and Fantasy in Slavic...

    • data.mendeley.com
    • search.datacite.org
    Updated Aug 29, 2018
  12. Pre-trained Word Vectors for Spanish

    • www.kaggle.com
    Updated Aug 9, 2017
  13. A

    Ambulance Services, England - 2014-15

    • digital.nhs.uk
    Updated Jun 17, 2015
  14. r

    Universal Inspirational Quotes

    • rapidapi.com
    Updated Jun 2, 2018
  15. Bilingual hr-en parallel corpus from the National and University Library in...

    • data.europa.eu
    Updated 10. 10. 2019
  16. t

    Blueways Conservation Decision Support Tool

    • geospatial.tnc.org
    Updated Oct 30, 2019
  17. d

    Federal Justice Statistics Program Data Series

    • www.da-ra.de
    Updated Mar 2, 1990
  18. d

    Archival Version

    • www.da-ra.de
    • www.icpsr.umich.edu
    • +1more
    Updated Feb 17, 1999
  19. f

    A Canadian French Emotional Speech Dataset

    • figshare.com
    • zenodo.org
    Updated Dec 17, 2019
  20. Percentage of Detected and Sanctioned Offences, Borough

    • data.gov.uk
    • datahub.ckan.io
    • +1more
    Updated Mar 23, 2017
  21. VidTIMIT Audio-Video Dataset

    • www.kaggle.com
    Updated Dec 30, 2018
  22. E

    HF radar daily averaged surface currents from the MOOSE MEDTLN sites (Toulon...

    • erddap.osupytheas.fr
    Updated Aug 23, 2018
  23. d

    Archival Version

    • www.da-ra.de
    • www.icpsr.umich.edu
    • +2more
    Updated Oct 2, 1993
  24. d

    Archival Version

    • www.da-ra.de
    • www.childandfamilydataarchive.org
    • +2more
    Updated Jul 28, 1998
  25. d

    Archival Version

    • www.da-ra.de
    • www.childandfamilydataarchive.org
    • +2more
    Updated Nov 2, 1999
  26. d

    National Jail Census Series

    • www.da-ra.de
    Updated Jul 13, 1996
  27. d

    Archival Version

    • www.da-ra.de
    • www.icpsr.umich.edu
    • +3more
    Updated Jul 13, 1996
  28. d

    Techniques for Assessing the Accuracy of Recidivism Prediction Scales,...

    • www.da-ra.de
    Updated Oct 2, 1993
  29. Data Fusion Contest 2017 (DFC2017)

    • search.datacite.org
    Updated Oct 29, 2019
  30. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
Facebook
Twitter
Email
Click to copy link
Link copied

Webis-Sentences-17

  • Dataset updated Feb 27, 2017
Dataset provided by
Bauhaus University, Weimarhttp://www.uni-weimar.de/
The Web Technology & Information Systems Network
Authors
Stein, Benno; Kiesel, Johannes; Lucks, Stefan
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis-Sentences-17 corpus is a collection of 3,369,618,811 sentences extracted from the ClueWeb12 web crawl. It is designed to allow for statistical analyses of human-written sentences. More details on the sentence extraction can be found in the associated publication. The Webis-Simple-Sentences-17 corpus contains 471,085,690 English sentences from the Webis-Sentences-17 corpus. The sentences were sampled to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word. Both corpora are split in training and test set as they are used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts.

Search
Clear search
Close search
Google apps
Main menu