3 datasets found
  1. o

    Google Books Ngrams

    • registry.opendata.aws
    Updated Apr 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Not managed (2018). Google Books Ngrams [Dataset]. https://registry.opendata.aws/google-ngrams/
    Explore at:
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    Not managed
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

  2. h

    ngram-google-2012

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Gustaw, ngram-google-2012 [Dataset]. https://huggingface.co/datasets/gustawdaniel/ngram-google-2012
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Daniel Gustaw
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    python -m spacy download en_core_web_sm

    Titles: jq -s '.[].title' raw/dict.jsonl

    returns

    "English" "English One Million" "American English" "British English" "English Fiction" "Chinese (simplified)" "French" "German" "Hebrew" "Italian" "Russian" "Spanish"

    Spellcheck: https://pypi.org/project/pyspellchecker/ English - ‘en’ Spanish - ‘es’ French - ‘fr’ Portuguese - ‘pt’ German - ‘de’ Russian - ‘ru’ Arabic - ‘ar’

    Sets now:

    "English" - en "Spanish" - es "French" - fr "German"… See the full description on the dataset page: https://huggingface.co/datasets/gustawdaniel/ngram-google-2012.

  3. s

    Google Books N-Grams

    • marketplace.sshopencloud.eu
    Updated Sep 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Google Books N-Grams [Dataset]. https://marketplace.sshopencloud.eu/dataset/G4YeiU
    Explore at:
    Dataset updated
    Sep 10, 2018
    Description

    The Google Books Ngram Viewer is optimized for quick inquiries into the usage of small sets of phrases. If you're interested in performing a large scale analysis on the underlying data, you might prefer to download a portion of the corpora yourself. Or all of it, if you have the bandwidth and space. We're happy to oblige. These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets).

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Not managed (2018). Google Books Ngrams [Dataset]. https://registry.opendata.aws/google-ngrams/

Google Books Ngrams

Explore at:
Dataset updated
Apr 20, 2018
Dataset provided by
Not managed
License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Search
Clear search
Close search
Google apps
Main menu