3 datasets found

o
Google Books Ngrams
registry.opendata.aws
Updated Apr 20, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Not managed (2018). Google Books Ngrams [Dataset]. https://registry.opendata.aws/google-ngrams/
Explore at:
Dataset updated
Apr 20, 2018
Dataset provided by
Not managed
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.
h
ngram-google-2012
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Gustaw, ngram-google-2012 [Dataset]. https://huggingface.co/datasets/gustawdaniel/ngram-google-2012
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Daniel Gustaw
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
python -m spacy download en_core_web_sm

Titles: jq -s '.[].title' raw/dict.jsonl

returns

"English" "English One Million" "American English" "British English" "English Fiction" "Chinese (simplified)" "French" "German" "Hebrew" "Italian" "Russian" "Spanish"

Spellcheck: https://pypi.org/project/pyspellchecker/ English - ‘en’ Spanish - ‘es’ French - ‘fr’ Portuguese - ‘pt’ German - ‘de’ Russian - ‘ru’ Arabic - ‘ar’

Sets now:

"English" - en "Spanish" - es "French" - fr "German"… See the full description on the dataset page: https://huggingface.co/datasets/gustawdaniel/ngram-google-2012.
s
Google Books N-Grams
marketplace.sshopencloud.eu
Updated Sep 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Google Books N-Grams [Dataset]. https://marketplace.sshopencloud.eu/dataset/G4YeiU
Explore at:
Dataset updated
Sep 10, 2018
Description
The Google Books Ngram Viewer is optimized for quick inquiries into the usage of small sets of phrases. If you're interested in performing a large scale analysis on the underlying data, you might prefer to download a portion of the corpora yourself. Or all of it, if you have the bandwidth and space. We're happy to oblige. These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Not managed (2018). Google Books Ngrams [Dataset]. https://registry.opendata.aws/google-ngrams/

Google Books Ngrams

Explore at:

Dataset updated

Apr 20, 2018

Dataset provided by

Not managed

License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Clear search

Close search

Google apps

Main menu

Google Books Ngrams

ngram-google-2012

Google Books N-Grams

Google Books Ngrams