33 datasets found

P
Billion Word Benchmark Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson, Billion Word Benchmark Dataset [Dataset]. https://paperswithcode.com/dataset/billion-word-benchmark
Explore at:
Authors
Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson
Description
The One Billion Word dataset is a dataset for language modeling. The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts.
P
One Billion Word Benchmark Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson, One Billion Word Benchmark Dataset [Dataset]. https://paperswithcode.com/dataset/one-billion-word-benchmark
Explore at:
Authors
Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson
Description
Text corpus with almost one billion words of training data for statistical language modeling benchmarking. The scale of approximately one billion words attempts to strike a balance between the relevance of the benchmark in a world of abundant data against the ease with which researchers can evaluate their modeling approaches. Monolingual english data was obtained from the WMT11 website and prepared using a variety of best-practices for machine learning dataset preparations.
t
1 Billion Word Language Model Benchmark - Dataset - LDM
service.tib.eu
Updated Nov 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). 1 Billion Word Language Model Benchmark - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/1-billion-word-language-model-benchmark
Explore at:
Dataset updated
Nov 25, 2024
Description
The 1 Billion Word Language Model Benchmark is a dataset used for measuring progress in statistical language modeling, consisting of a large collection of text data.
h
one-billion-words-test
huggingface.co
Updated Feb 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
andrew zhang (2025). one-billion-words-test [Dataset]. https://huggingface.co/datasets/azhang42/one-billion-words-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2025
Authors
andrew zhang
Description
azhang42/one-billion-words-test dataset hosted on Hugging Face and contributed by the HF Datasets community
h
101_billion_arabic_words_dataset
huggingface.co
Updated May 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ClusterlabAi (2024). 101_billion_arabic_words_dataset [Dataset]. http://doi.org/10.57967/hf/2204
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2204
Dataset updated
May 13, 2024
Dataset authored and provided by
ClusterlabAi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
101 Billion Arabic Words Dataset

Updates

Maintenance Status: Actively Maintained Update Frequency: Weekly updates to refine data quality and expand coverage.

Upcoming Version

More Cleaned Version: A more cleaned version of the dataset is in processing, which includes the addition of a UUID column for better data traceability and management.

Dataset Details

The 101 Billion Arabic Words Dataset is curated by the Clusterlab team and consists of 101… See the full description on the dataset page: https://huggingface.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset.
New General Service List
kaggle.com
Updated May 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jim O'Regan (2021). New General Service List [Dataset]. https://www.kaggle.com/datasets/jimregan/new-general-service-list/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jim O'Regan
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

In 1953, Michael West published a remarkable list of several thousand important vocabulary words known as the General Service List (GSL). Based on more than two decades of pre-computer corpus research, input from other famous early 20th century researchers such as Harold Palmer, and several vocabulary conferences sponsored by the Carnegie Foundation in the 30s, the GSL was designed to be more than simply a list of high frequency words, its primary purpose was to combine both objective and subjective criteria to come up with a list of words that would be of “general service” to learners of English as a foreign language. However, as useful and helpful as this list has been to us over the decades, it has also been criticized for (1) being based on a corpus that is considered to be quite dated, (2) being too small by modern standards (the initial work on the GSL was based on a 2.5 million word corpus that was collected under a grant from the Rockefeller Foundation in 1938), and (3) for not clearly defining what constitutes a “word”.

In March of 2013, on the 60th anniversary of West’s publication of the GSL, my colleagues (Dr. Brent Culligan & Joseph Phillips of Aoyama Gakuin Women’s Junior College) and I (Dr. Charles Browne, Meiji Gakuin University) announced the creation of a New General Service List (NGSL), one that is based on a carefully selected 273 million-word subsection of the 2 billion word Cambridge English Corpus (CEC).

Acknowledgements

New General Service List by Browne, C., Culligan, B., and Phillips, J. is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Permissions beyond the scope of this license may be available at http://www.charlie-browne.com.

NOTE: Development of the NGSL was made possible through approved access to the Cambridge English Corpus (CEC). The CEC is a multi-billion word computer database of contemporary spoken and written English. It includes British English, American English and other varieties of English. It also includes the Cambridge Learner Corpus, developed in collaboration with the University of Cambridge ESOL Examinations. Cambridge University Press has built up the CEC to provide evidence about language use that helps to produce better language teaching materials.
A
Corpus of Contemporary American English (COCA)
abacus.library.ubc.ca
dataverse.ucla.edu
bin, pdf, tar, txt
Updated Sep 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2022). Corpus of Contemporary American English (COCA) [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=e15480bd40b8874465f50f2e52ad?persistentId=hdl%3A11272.1%2FAB2%2F3AKAN0&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=
Explore at:
txt(1737), pdf(5234129), tar(2096243712), bin(14953046)Available download formats
Dataset updated
Sep 2, 2022
Dataset provided by
Abacus Data Network
Time period covered
1990 - 2020
Description
The Corpus of Contemporary American English (COCA) contains about 1 billion words in nearly 500,000 texts from 1990 to 2019 -- which are nearly evenly divided between spoken, fiction, magazines, newspapers, academic journals, blogs, other web pages, and TV/Movie subtitles (120-130 million words in each genre). In addition, there are 20 million words each year from 1990-2019 (with the same genre balance each year). From the COCA website:"The Corpus of Contemporary American English (COCA) is the only large and 'representative' corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created. These corpora were formerly known as the 'BYU Corpora', and they offer unparalleled insight into variation in English. (https://www.english-corpora.org/coca/)
h
ml_spoken_words
huggingface.co
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MLCommons (2024). ml_spoken_words [Dataset]. https://huggingface.co/datasets/MLCommons/ml_spoken_words
Explore at:
Dataset updated
Jun 25, 2024
Dataset authored and provided by
MLCommons
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.
English Word Frequency
kaggle.com
Updated Sep 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/rtatman/english-word-frequency/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
Description
Context:

How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

Content:

This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

Acknowledgements:

Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

The code used to generate this dataset is distributed under the MIT License.

Inspiration:

Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?

What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
f
The 30 high frequency words used in Experiment 2 to generate the...
plos.figshare.com
xls
Updated Jun 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aisling Conway; Nuala Brady; Karuna Misra (2023). The 30 high frequency words used in Experiment 2 to generate the experimental stimuli alongside their SUBTLEX frequency per million words. [Dataset]. http://doi.org/10.1371/journal.pone.0187326.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0187326.t002
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Aisling Conway; Nuala Brady; Karuna Misra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The 30 high frequency words used in Experiment 2 to generate the experimental stimuli alongside their SUBTLEX frequency per million words.
s
120 Million Word Spanish Corpus
marketplace.sshopencloud.eu
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). 120 Million Word Spanish Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/XTUFXt
Explore at:
Dataset updated
Apr 24, 2020
Description
Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.
ParlaMint_4.1
huggingface.co
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FrancophonIA (2024). ParlaMint_4.1 [Dataset]. https://huggingface.co/datasets/FrancophonIA/ParlaMint_4.1
Explore at:
Dataset updated
Nov 11, 2024
Dataset provided by
Francophonia
Authors
FrancophonIA
Description
Dataset origin: https://www.clarin.si/repository/xmlui/handle/11356/1912

Description

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and… See the full description on the dataset page: https://huggingface.co/datasets/FrancophonIA/ParlaMint_4.1.
Corpus of Contemporary American English (COCA)
redivis.com
application/jsonl +7
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University Libraries (2022). Corpus of Contemporary American English (COCA) [Dataset]. http://doi.org/10.57761/s7ja-z475
Explore at:
application/jsonl, spss, csv, stata, parquet, avro, arrow, sasAvailable download formats
Unique identifier
https://doi.org/10.57761/s7ja-z475
Dataset updated
Jun 29, 2022
Dataset provided by
Redivis Inc.
Authors
Stanford University Libraries
Description
Abstract

The Corpus of Contemporary American English (COCA) contains more than one billion words of text (20 million words each year from 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, television and movies subtitles, blogs, and other web pages.

Usage

For an overview of COCA and its affordances, please see supporting files:

coca2020_overview.pdf

%3C!-- --%3E

Bulk Data Access

Data access is required to view this section.
o
Data from: Press 65 : a corpus of one million running words from five...
llds.phon.ox.ac.uk
llds.ling-phil.ox.ac.uk
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Press 65 : a corpus of one million running words from five Swedish morning papers [selections] [Dataset]. https://llds.phon.ox.ac.uk/llds/xmlui/handle/20.500.14106/0385?show=full
Explore at:
Dataset updated
Nov 8, 2024
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
(:unav)...........................................
Lingtrain Hungarian Word Frequency
kaggle.com
Updated Jan 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Averkiev (2020). Lingtrain Hungarian Word Frequency [Dataset]. https://www.kaggle.com/averkij/lingtrain-hungarian-word-frequency/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 23, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sergei Averkiev
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
About Hungarian

Hungarian is a Finno-Ugric language, which is a member of the Uralic language family. The Hungarian name for the language is Magyar.

The Finno-Ugric languages also include Finnish, Estonian, Lappic (Sámi) and some other languages spoken in Russia: Khanty and Mansi are the most closely related to Hungarian. The Hungarian name for the language is Magyar.

Although Hungarian is not an Indo-European language, unlike most other European languages, its vocabulary has many words from Slavic and Turkic languages and also from German.

Content

This dataset contains so many words because Hungarian language is agglutinative. It means that words may contain different morphemes to determine their meanings, but all of these morphemes (including stems and affixes) remain, in every aspect, unchanged after their unions.

Agglutinative languages tend to have a high rate of affixes or morphemes per word, and to be very regular, in particular with very few irregular verbs.

Inspiration

I've taken the inspiration part from the beautiful Rachel Tatman

Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?

What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora? What might these differences tell us about how language is used?
P
LMC Dataset
paperswithcode.com
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justin Zhao; Flor Miriam Plaza-del-Arco; Benjie Genchel; Amanda Cercas Curry (2024). LMC Dataset [Dataset]. https://paperswithcode.com/dataset/lmc
Explore at:
Dataset updated
Jun 11, 2024
Authors
Justin Zhao; Flor Miriam Plaza-del-Arco; Benjie Genchel; Amanda Cercas Curry
Description
The Language Model Council (LMC) is a novel benchmarking framework proposed to address the challenge of ranking Large Language Models (LLMs) on highly subjective tasks¹. These tasks can include areas related to emotional intelligence, creative writing, or persuasiveness, which often lack majoritarian human agreement¹.

The LMC operates through a democratic process to¹: 1. Formulate a test set through equal participation, 2. Administer the test among council members, and 3. Evaluate responses as a collective jury¹.

The council is composed of the newest LLMs and they are deployed on an open-ended emotional intelligence task: responding to interpersonal dilemmas¹. The results from the LMC have shown to produce rankings that are more separable, robust, and less biased than those from any individual LLM judge¹. It is also more consistent with a human-established leaderboard compared to other benchmarks¹.

This benchmarking framework is designed to encourage the healthy development of the field, particularly through the lens of mathematical reasoning tasks². It provides a more comprehensive and fair comparison of different LLMs, especially for tasks that are highly subjective and often lack majoritarian human agreement¹.

(1) Language Model Council: Benchmarking Foundation Models on Highly .... https://arxiv.org/abs/2406.08598. (2) GitHub - GAIR-NLP/benbench: Benchmarking Benchmark Leakage in Large .... https://github.com/GAIR-NLP/benbench. (3) LLM Benchmarks: Understanding Language Model Performance. https://humanloop.com/blog/llm-benchmarks. (4) One Billion Word Benchmark for Measuring Progress in Statistical .... https://research.google.com/pubs/pub41880.html?source=post_page---------------------------.
Countries with the most Facebook users 2025
statista.com
ai-chatbox.pro
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Countries with the most Facebook users 2025 [Dataset]. https://www.statista.com/statistics/268136/top-15-countries-based-on-number-of-facebook-users/
Explore at:
Dataset updated
Jun 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
Which county has the most Facebook users? There are more than 383 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country, then it would be ranked third in terms of largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 196.9 million, 122.3 million, and 111.65 million Facebook users respectively. Facebook – the most used social media Meta, the company that was previously called Facebook, owns four of the most popular social media platforms worldwide, WhatsApp, Facebook Messenger, Facebook, and Instagram. As of the third quarter of 2021, there were around 3.5 billion cumulative monthly users of the company’s products worldwide. With around 2.9 billion monthly active users, Facebook is the most popular social media worldwide. With an audience of this scale, it is no surprise that the vast majority of Facebook’s revenue is generated through advertising. Facebook usage by device As of July 2021, it was found that 98.5 percent of active users accessed their Facebook account from mobile devices. In fact, almost 81.8 percent of Facebook audiences worldwide access the platform only via mobile phone. Facebook is not only available through mobile browser as the company has published several mobile apps for users to access their products and services. As of the third quarter 2021, the four core Meta products were leading the ranking of most downloaded mobile apps worldwide, with WhatsApp amassing approximately six billion downloads.
Z
Data for manuscript: "Themes in Academic Literature: Prejudice and Social...
data.niaid.nih.gov
explore.openaire.eu
+2more
Updated Aug 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rozado (2022). Data for manuscript: "Themes in Academic Literature: Prejudice and Social Justice" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5832064
Explore at:
Dataset updated
Aug 31, 2022
Dataset authored and provided by
David Rozado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains frequency counts of target words in 175 million academic abstracts published in all fields of knowledge. We quantify the prevalence of words denoting prejudice against ethnicity, gender, sexual orientation, gender identity, minority religious sentiment, age, body weight and disability in SSORC abstracts over the period 1970-2020. We then examine the relationship between the prevalence of such terms in the academic literature and their concomitant prevalence in news media content. We also analyze the temporal dynamics of an additional set of terms associated with social justice discourse in both the scholarly literature and in news media content. A few additional words not denoting prejudice are also available since they are used in the manuscript for illustration purposes.

The list of academic abstracts analyzed in this work was taken from the Semantic Scholar Open Research Corpus (SSORC). The corpus contains, as of 2020, over 175 million academic abstracts, and associated metadata, published in all fields of knowledge. The raw data is provided by Semantic Scholar in accessible JSON format.

Textual content included in our analysis is circumscribed to the scholarly articles’ titles and abstracts and does not include other article elements such as main body of text or references section. Thus, we use frequency counts derived from academic articles’ titles and abstracts as a proxy for word prevalence in those articles. This proxy was used because the SSORC corpus does not provide the entire text body of the indexed articles. Targeted textual content was located in JSON data and sorted by year to facilitate chronological analysis. Tokens were lowercased prior to estimating frequency counts.

Yearly relative frequencies of a target word or n-gram in the SSORC corpus were estimated by dividing the number of occurrences of the target word/n-gram in all scholarly articles within a given year by the total number of all words in all articles of that year. This method of estimating word frequencies accounts for variable volume of total scientific output over time. This approach has been shown before to accurately capture the temporal dynamics of historical events and social trends in news media corpora.

It is possible that a small percentage of scholarly articles in the SSORC corpus contain incorrect or missing data. For earlier years in the SSORC corpus, abstract information is sometimes missing and only article’s title information is available. As a result, the total and target word count metrics for a small subset of academic abstracts might not be precise. In a data analysis of 175 million scientific abstracts, manually checking the accuracy of frequency counts for every single academic abstract is unfeasible and hundred percent accuracy at capturing abstracts’ content might be elusive due to a small number of erroneous outlier cases in the raw data. Overall, however, we are confident that our frequency metrics are representative of word prevalence in academic content as illustrated by Figure 2 in the main manuscript, which shows the chronological prevalence in the SSORC corpus of several terms associated with different disciplines of scientific/academic knowledge.

Factor analysis of frequency counts time series was carried out only after Bartlett’s test of sphericity and Kaiser-Meyer-Olkin (KMO) test confirmed the suitability of the data for factor analysis. A single factor derived from the frequency counts time series of prejudice-denoting terms was extracted from each corpus (academic abstracts and news media content). The same procedure was applied for the terms denoting social justice discourse. A factor loading cutoff of 0.5 was used to ascribe terms to a factor. Chronbach alphas to determine if the resulting factors appeared coherent were extremely high (>0.95).

The textual content of news and opinion articles from the outlets listed in Figure 5 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from original sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.

Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

The list of compressed files in this data set is listed next:

-analysisScripts.rar contains the analysis scripts used in the main manuscript and raw data metrics

-scholarlyArticlesContainingTargetWords.rar contains the IDs of each analyzed abstract in the SSORC corpus and the counts of target words and total words for each scholarly article

-targetWordsInMediaArticlesCounts.rar contains counts of target words in news outlets articles as well as total counts of words in articles

In a small percentage of news articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles might not be precise.

In a data analysis of millions of news articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Rozado, Al-Gharbi, and Halberstadt, “Prevalence of Prejudice-Denoting Words in News Media Discourse" for supporting evidence).

31/08/2022 Update: There is a new way to download the Semantic Scholar Open Research Corpus (see https://github.com/allenai/s2orc). This updated version states that the corpus contains 136M+ paper nodes. However, when I downloaded a previous version of the corpus in 2021 from http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I counted 175M unique identifiers. The URL of the previous version of the corpus is no longer active, but it has been cached by the Internet Archive at https://web.archive.org/web/20201030131959/http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I haven't had the time to look at the specific reason for the mismatch but perhaps the newer version of the corpus has cleaned a lot of noisy entries in the previous version which often contained entries with missing abstracts. Filtering out entries in low prevalence languages other than English might be another reason. In any case, Figure 2 of the main manuscript of this work (at https://www.nas.org/academic-questions/35/2/themes-in-academic-literature-prejudice-and-social-justice) should provide support for the validity of the frequency counts.
E
Data from: Training corpus jos1M 1.1
live.european-language-grid.eu
binary format
Updated Mar 6, 2010
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2010). Training corpus jos1M 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8190
Explore at:
binary formatAvailable download formats
Dataset updated
Mar 6, 2010
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated.

The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all of the information from the source TEI.
n
Frequencies per million words for 5 epidemiologically relevant search terms...
data.niaid.nih.gov
datadryad.org
zip
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Derek Gatherer (2022). Frequencies per million words for 5 epidemiologically relevant search terms in a dozen British 19th century newspapers [Dataset]. http://doi.org/10.5061/dryad.9kd51c5md
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.9kd51c5md
Dataset updated
Aug 8, 2022
Dataset provided by
Lancaster University
Authors
Derek Gatherer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United Kingdom
Description
COVID-19 is the first known coronavirus pandemic. Nevertheless, the seasonal circulation of the four milder coronaviruses of humans – OC43, NL63, 229E and HKU1 – raises the possibility that these viruses are the descendants of more ancient coronavirus pandemics. This proposal arises by analogy to the observed descent of seasonal influenza subtypes H2N2 (now extinct), H3N2 and H1H1 from the pandemic strains of 1957, 1968 and 2009, respectively. Recent historical revisionist speculation has focussed on the influenza pandemic of 1889-1892, based on molecular phylogenetic reconstructions that show the emergence of human coronavirus OC43 around that time, probably by zoonosis from cattle. If the “Russian influenza”, as The Times named it in early 1890, was not influenza but caused by a coronavirus, the origins of the other three milder human coronaviruses may also have left a residue of clinical evidence in the 19th century medical literature and popular press. In this paper, we search digitised 19th century British newspapers for evidence of previously unsuspected coronavirus pandemics. We conclude that there is little or no corpus linguistic signal in the UK national press for large-scale outbreaks of unidentified respiratory disease for the period 1785 to 1890. Methods The data file is a spreadsheet used to record queries made via CQPweb (https://cqpweb.lancs.ac.uk). Search Terms For clarity, in the ensuing descriptions, we use bold font for search terms and italic font for collocates and other quotations. Based on clinical descriptions of COVID-19 (reviewed by Cevik et al., 2020), we identified the following search terms: 1) “cough”, 2) “fever”, 3) “pneumonia”. To avoid confusion with years when influenza pandemics may have occurred, we added 4) “influenza” and 5) “epidemic”. Any combination of terms 1 to 3 co-occurring with term 4 alone or terms 4 and 5 together, would be indicative of a respiratory outbreak caused by, or at the least attributed to, influenza. By contrast, any combination of terms 1 to 3 co-occurring with term 5 alone, or without either of terms 4 and 5, would suggest a respiratory disease that was not confidently identified as influenza at the time. This outbreak would provide a candidate coronavirus epidemic for further investigation. Newspapers Newspapers and years searched were as follows: Belfast Newsletter (1828-1900), The Era (1838-1900), Glasgow Herald (1820-1900), Hampshire & Portsmouth Telegraph (1799-1900), Ipswich Journal (1800-1900), Liverpool Mercury (1811-1900), Northern Echo (1870-1900) Pall Mall Gazette (1865-1900), Reynold’s Daily (1850-1900), Western Mail (1869-1900) and The Times (1785-2009). The search in The Times was extended to 2009 in order to provide a comparison with the 20th century. Searches were performed using Lancaster University’s instance of the CQPweb (Corpus Query Processor) corpus analysis software (https://cqpweb.lancs.ac.uk/; Hardie, 2012). CQPweb’s database is populated from the newspapers listed, using optical character recognition (OCR), so for older publications in particular, some errors may be present (McEnery et al., 2019). Statistics The occurrence of each of the five search terms was calculated per million words within the annual output of each publication, in CQPweb. This is compared to a background distribution constituting the corresponding words per million for each search term over the total year range for each newspaper. Within the annual distributions, for each search term and each newspaper, we determined the years lying in the top 1% (i.e. p<0.05 after application of a Bonferroni correction), following Gabrielatos et al. (2012). These are deemed to be years when that search term was in statistically significant usage above its background level for the newspaper in which it occurs. For years when search terms were significantly elevated, we also calculated collocates at range n. Collocates, in corpus linguistics, are other words found at statistically significant usage, over their own background levels, in a window from n positions to the left to n positions to the right of the search term. In other words, they are found in significant proximity to the search term. A default value of n=10 was used throughout, unless specified. Collocation analysis therefore assists in showing how a search term associates with other words within a corpus, providing information about the context in which that search term is used. CQPweb provides a log ratio method for the quantification of the strength of collocation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson, Billion Word Benchmark Dataset [Dataset]. https://paperswithcode.com/dataset/billion-word-benchmark

Billion Word Benchmark Dataset

Explore at:

Authors

Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson

Description

The One Billion Word dataset is a dataset for language modeling. The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts.

Clear search

Close search

Google apps

Main menu

Billion Word Benchmark Dataset

One Billion Word Benchmark Dataset

1 Billion Word Language Model Benchmark - Dataset - LDM

one-billion-words-test

101_billion_arabic_words_dataset

New General Service List

Context

Acknowledgements

Corpus of Contemporary American English (COCA)

ml_spoken_words

English Word Frequency

Context:

Content:

Acknowledgements:

Inspiration:

The 30 high frequency words used in Experiment 2 to generate the...

120 Million Word Spanish Corpus

ParlaMint_4.1

Corpus of Contemporary American English (COCA)

Abstract

Usage

Bulk Data Access

Data from: Press 65 : a corpus of one million running words from five...

Lingtrain Hungarian Word Frequency

About Hungarian

Content

Inspiration

LMC Dataset

Countries with the most Facebook users 2025

Data for manuscript: "Themes in Academic Literature: Prejudice and Social...

Data from: Training corpus jos1M 1.1

Frequencies per million words for 5 epidemiologically relevant search terms...

Billion Word Benchmark Dataset