Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
python -m spacy download en_core_web_sm
Titles: jq -s '.[].title' raw/dict.jsonl
returns
"English" "English One Million" "American English" "British English" "English Fiction" "Chinese (simplified)" "French" "German" "Hebrew" "Italian" "Russian" "Spanish"
Spellcheck: https://pypi.org/project/pyspellchecker/ English - ‘en’ Spanish - ‘es’ French - ‘fr’ Portuguese - ‘pt’ German - ‘de’ Russian - ‘ru’ Arabic - ‘ar’
Sets now:
"English" - en "Spanish" - es "French" - fr "German"… See the full description on the dataset page: https://huggingface.co/datasets/gustawdaniel/ngram-google-2012.
The Google Books Ngram Viewer is optimized for quick inquiries into the usage of small sets of phrases. If you're interested in performing a large scale analysis on the underlying data, you might prefer to download a portion of the corpora yourself. Or all of it, if you have the bandwidth and space. We're happy to oblige. These datasets were generated in July 2012 (Version 2) and July 2009 (Version 1); we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20120701 and 20090715 for the current sets).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.