The One Billion Word dataset is a dataset for language modeling. The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts.
Text corpus with almost one billion words of training data for statistical language modeling benchmarking. The scale of approximately one billion words attempts to strike a balance between the relevance of the benchmark in a world of abundant data against the ease with which researchers can evaluate their modeling approaches. Monolingual english data was obtained from the WMT11 website and prepared using a variety of best-practices for machine learning dataset preparations.
The 1 Billion Word Language Model Benchmark is a dataset used for measuring progress in statistical language modeling, consisting of a large collection of text data.
azhang42/one-billion-words-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
101 Billion Arabic Words Dataset
Updates
Maintenance Status: Actively Maintained Update Frequency: Weekly updates to refine data quality and expand coverage.
Upcoming Version
More Cleaned Version: A more cleaned version of the dataset is in processing, which includes the addition of a UUID column for better data traceability and management.
Dataset Details
The 101 Billion Arabic Words Dataset is curated by the Clusterlab team and consists of 101… See the full description on the dataset page: https://huggingface.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
In 1953, Michael West published a remarkable list of several thousand important vocabulary words known as the General Service List (GSL). Based on more than two decades of pre-computer corpus research, input from other famous early 20th century researchers such as Harold Palmer, and several vocabulary conferences sponsored by the Carnegie Foundation in the 30s, the GSL was designed to be more than simply a list of high frequency words, its primary purpose was to combine both objective and subjective criteria to come up with a list of words that would be of “general service” to learners of English as a foreign language. However, as useful and helpful as this list has been to us over the decades, it has also been criticized for (1) being based on a corpus that is considered to be quite dated, (2) being too small by modern standards (the initial work on the GSL was based on a 2.5 million word corpus that was collected under a grant from the Rockefeller Foundation in 1938), and (3) for not clearly defining what constitutes a “word”.
In March of 2013, on the 60th anniversary of West’s publication of the GSL, my colleagues (Dr. Brent Culligan & Joseph Phillips of Aoyama Gakuin Women’s Junior College) and I (Dr. Charles Browne, Meiji Gakuin University) announced the creation of a New General Service List (NGSL), one that is based on a carefully selected 273 million-word subsection of the 2 billion word Cambridge English Corpus (CEC).
New General Service List by Browne, C., Culligan, B., and Phillips, J. is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at http://www.charlie-browne.com.
NOTE: Development of the NGSL was made possible through approved access to the Cambridge English Corpus (CEC). The CEC is a multi-billion word computer database of contemporary spoken and written English. It includes British English, American English and other varieties of English. It also includes the Cambridge Learner Corpus, developed in collaboration with the University of Cambridge ESOL Examinations. Cambridge University Press has built up the CEC to provide evidence about language use that helps to produce better language teaching materials.
The Corpus of Contemporary American English (COCA) contains about 1 billion words in nearly 500,000 texts from 1990 to 2019 -- which are nearly evenly divided between spoken, fiction, magazines, newspapers, academic journals, blogs, other web pages, and TV/Movie subtitles (120-130 million words in each genre). In addition, there are 20 million words each year from 1990-2019 (with the same genre balance each year). From the COCA website:"The Corpus of Contemporary American English (COCA) is the only large and 'representative' corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created. These corpora were formerly known as the 'BYU Corpora', and they offer unparalleled insight into variation in English. (https://www.english-corpora.org/coca/)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.
How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.
This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.
Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.
The code used to generate this dataset is distributed under the MIT License.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 30 high frequency words used in Experiment 2 to generate the experimental stimuli alongside their SUBTLEX frequency per million words.
Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.
Dataset origin: https://www.clarin.si/repository/xmlui/handle/11356/1912
Description
ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and… See the full description on the dataset page: https://huggingface.co/datasets/FrancophonIA/ParlaMint_4.1.
The Corpus of Contemporary American English (COCA) contains more than one billion words of text (20 million words each year from 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, television and movies subtitles, blogs, and other web pages.
For an overview of COCA and its affordances, please see supporting files:
%3C!-- --%3E
Data access is required to view this section.
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
(:unav)...........................................
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Hungarian is a Finno-Ugric language, which is a member of the Uralic language family. The Hungarian name for the language is Magyar.
The Finno-Ugric languages also include Finnish, Estonian, Lappic (Sámi) and some other languages spoken in Russia: Khanty and Mansi are the most closely related to Hungarian. The Hungarian name for the language is Magyar.
Although Hungarian is not an Indo-European language, unlike most other European languages, its vocabulary has many words from Slavic and Turkic languages and also from German.
This dataset contains so many words because Hungarian language is agglutinative. It means that words may contain different morphemes to determine their meanings, but all of these morphemes (including stems and affixes) remain, in every aspect, unchanged after their unions.
Agglutinative languages tend to have a high rate of affixes or morphemes per word, and to be very regular, in particular with very few irregular verbs.
I've taken the inspiration part from the beautiful Rachel Tatman
The Language Model Council (LMC) is a novel benchmarking framework proposed to address the challenge of ranking Large Language Models (LLMs) on highly subjective tasksÂą. These tasks can include areas related to emotional intelligence, creative writing, or persuasiveness, which often lack majoritarian human agreementÂą.
The LMC operates through a democratic process toÂą: 1. Formulate a test set through equal participation, 2. Administer the test among council members, and 3. Evaluate responses as a collective juryÂą.
The council is composed of the newest LLMs and they are deployed on an open-ended emotional intelligence task: responding to interpersonal dilemmasÂą. The results from the LMC have shown to produce rankings that are more separable, robust, and less biased than those from any individual LLM judgeÂą. It is also more consistent with a human-established leaderboard compared to other benchmarksÂą.
This benchmarking framework is designed to encourage the healthy development of the field, particularly through the lens of mathematical reasoning tasks². It provides a more comprehensive and fair comparison of different LLMs, especially for tasks that are highly subjective and often lack majoritarian human agreement¹.
(1) Language Model Council: Benchmarking Foundation Models on Highly .... https://arxiv.org/abs/2406.08598. (2) GitHub - GAIR-NLP/benbench: Benchmarking Benchmark Leakage in Large .... https://github.com/GAIR-NLP/benbench. (3) LLM Benchmarks: Understanding Language Model Performance. https://humanloop.com/blog/llm-benchmarks. (4) One Billion Word Benchmark for Measuring Progress in Statistical .... https://research.google.com/pubs/pub41880.html?source=post_page---------------------------.
Which county has the most Facebook users? There are more than 383 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country, then it would be ranked third in terms of largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 196.9 million, 122.3 million, and 111.65 million Facebook users respectively. Facebook – the most used social media Meta, the company that was previously called Facebook, owns four of the most popular social media platforms worldwide, WhatsApp, Facebook Messenger, Facebook, and Instagram. As of the third quarter of 2021, there were around 3.5 billion cumulative monthly users of the company’s products worldwide. With around 2.9 billion monthly active users, Facebook is the most popular social media worldwide. With an audience of this scale, it is no surprise that the vast majority of Facebook’s revenue is generated through advertising. Facebook usage by device As of July 2021, it was found that 98.5 percent of active users accessed their Facebook account from mobile devices. In fact, almost 81.8 percent of Facebook audiences worldwide access the platform only via mobile phone. Facebook is not only available through mobile browser as the company has published several mobile apps for users to access their products and services. As of the third quarter 2021, the four core Meta products were leading the ranking of most downloaded mobile apps worldwide, with WhatsApp amassing approximately six billion downloads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 175 million academic abstracts published in all fields of knowledge. We quantify the prevalence of words denoting prejudice against ethnicity, gender, sexual orientation, gender identity, minority religious sentiment, age, body weight and disability in SSORC abstracts over the period 1970-2020. We then examine the relationship between the prevalence of such terms in the academic literature and their concomitant prevalence in news media content. We also analyze the temporal dynamics of an additional set of terms associated with social justice discourse in both the scholarly literature and in news media content. A few additional words not denoting prejudice are also available since they are used in the manuscript for illustration purposes.
The list of academic abstracts analyzed in this work was taken from the Semantic Scholar Open Research Corpus (SSORC). The corpus contains, as of 2020, over 175 million academic abstracts, and associated metadata, published in all fields of knowledge. The raw data is provided by Semantic Scholar in accessible JSON format.
Textual content included in our analysis is circumscribed to the scholarly articles’ titles and abstracts and does not include other article elements such as main body of text or references section. Thus, we use frequency counts derived from academic articles’ titles and abstracts as a proxy for word prevalence in those articles. This proxy was used because the SSORC corpus does not provide the entire text body of the indexed articles. Targeted textual content was located in JSON data and sorted by year to facilitate chronological analysis. Tokens were lowercased prior to estimating frequency counts.
Yearly relative frequencies of a target word or n-gram in the SSORC corpus were estimated by dividing the number of occurrences of the target word/n-gram in all scholarly articles within a given year by the total number of all words in all articles of that year. This method of estimating word frequencies accounts for variable volume of total scientific output over time. This approach has been shown before to accurately capture the temporal dynamics of historical events and social trends in news media corpora.
It is possible that a small percentage of scholarly articles in the SSORC corpus contain incorrect or missing data. For earlier years in the SSORC corpus, abstract information is sometimes missing and only article’s title information is available. As a result, the total and target word count metrics for a small subset of academic abstracts might not be precise. In a data analysis of 175 million scientific abstracts, manually checking the accuracy of frequency counts for every single academic abstract is unfeasible and hundred percent accuracy at capturing abstracts’ content might be elusive due to a small number of erroneous outlier cases in the raw data. Overall, however, we are confident that our frequency metrics are representative of word prevalence in academic content as illustrated by Figure 2 in the main manuscript, which shows the chronological prevalence in the SSORC corpus of several terms associated with different disciplines of scientific/academic knowledge.
Factor analysis of frequency counts time series was carried out only after Bartlett’s test of sphericity and Kaiser-Meyer-Olkin (KMO) test confirmed the suitability of the data for factor analysis. A single factor derived from the frequency counts time series of prejudice-denoting terms was extracted from each corpus (academic abstracts and news media content). The same procedure was applied for the terms denoting social justice discourse. A factor loading cutoff of 0.5 was used to ascribe terms to a factor. Chronbach alphas to determine if the resulting factors appeared coherent were extremely high (>0.95).
The textual content of news and opinion articles from the outlets listed in Figure 5 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from original sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript and raw data metrics
-scholarlyArticlesContainingTargetWords.rar contains the IDs of each analyzed abstract in the SSORC corpus and the counts of target words and total words for each scholarly article
-targetWordsInMediaArticlesCounts.rar contains counts of target words in news outlets articles as well as total counts of words in articles
In a small percentage of news articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles might not be precise.
In a data analysis of millions of news articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Rozado, Al-Gharbi, and Halberstadt, “Prevalence of Prejudice-Denoting Words in News Media Discourse" for supporting evidence).
31/08/2022 Update: There is a new way to download the Semantic Scholar Open Research Corpus (see https://github.com/allenai/s2orc). This updated version states that the corpus contains 136M+ paper nodes. However, when I downloaded a previous version of the corpus in 2021 from http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I counted 175M unique identifiers. The URL of the previous version of the corpus is no longer active, but it has been cached by the Internet Archive at https://web.archive.org/web/20201030131959/http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I haven't had the time to look at the specific reason for the mismatch but perhaps the newer version of the corpus has cleaned a lot of noisy entries in the previous version which often contained entries with missing abstracts. Filtering out entries in low prevalence languages other than English might be another reason. In any case, Figure 2 of the main manuscript of this work (at https://www.nas.org/academic-questions/35/2/themes-in-academic-literature-prejudice-and-social-justice) should provide support for the validity of the frequency counts.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated.
The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all of the information from the source TEI.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
COVID-19 is the first known coronavirus pandemic. Nevertheless, the seasonal circulation of the four milder coronaviruses of humans – OC43, NL63, 229E and HKU1 – raises the possibility that these viruses are the descendants of more ancient coronavirus pandemics. This proposal arises by analogy to the observed descent of seasonal influenza subtypes H2N2 (now extinct), H3N2 and H1H1 from the pandemic strains of 1957, 1968 and 2009, respectively. Recent historical revisionist speculation has focussed on the influenza pandemic of 1889-1892, based on molecular phylogenetic reconstructions that show the emergence of human coronavirus OC43 around that time, probably by zoonosis from cattle. If the “Russian influenza”, as The Times named it in early 1890, was not influenza but caused by a coronavirus, the origins of the other three milder human coronaviruses may also have left a residue of clinical evidence in the 19th century medical literature and popular press. In this paper, we search digitised 19th century British newspapers for evidence of previously unsuspected coronavirus pandemics. We conclude that there is little or no corpus linguistic signal in the UK national press for large-scale outbreaks of unidentified respiratory disease for the period 1785 to 1890. Methods The data file is a spreadsheet used to record queries made via CQPweb (https://cqpweb.lancs.ac.uk). Search Terms For clarity, in the ensuing descriptions, we use bold font for search terms and italic font for collocates and other quotations. Based on clinical descriptions of COVID-19 (reviewed by Cevik et al., 2020), we identified the following search terms: 1) “cough”, 2) “fever”, 3) “pneumonia”. To avoid confusion with years when influenza pandemics may have occurred, we added 4) “influenza” and 5) “epidemic”. Any combination of terms 1 to 3 co-occurring with term 4 alone or terms 4 and 5 together, would be indicative of a respiratory outbreak caused by, or at the least attributed to, influenza. By contrast, any combination of terms 1 to 3 co-occurring with term 5 alone, or without either of terms 4 and 5, would suggest a respiratory disease that was not confidently identified as influenza at the time. This outbreak would provide a candidate coronavirus epidemic for further investigation. Newspapers Newspapers and years searched were as follows: Belfast Newsletter (1828-1900), The Era (1838-1900), Glasgow Herald (1820-1900), Hampshire & Portsmouth Telegraph (1799-1900), Ipswich Journal (1800-1900), Liverpool Mercury (1811-1900), Northern Echo (1870-1900) Pall Mall Gazette (1865-1900), Reynold’s Daily (1850-1900), Western Mail (1869-1900) and The Times (1785-2009). The search in The Times was extended to 2009 in order to provide a comparison with the 20th century. Searches were performed using Lancaster University’s instance of the CQPweb (Corpus Query Processor) corpus analysis software (https://cqpweb.lancs.ac.uk/; Hardie, 2012). CQPweb’s database is populated from the newspapers listed, using optical character recognition (OCR), so for older publications in particular, some errors may be present (McEnery et al., 2019). Statistics The occurrence of each of the five search terms was calculated per million words within the annual output of each publication, in CQPweb. This is compared to a background distribution constituting the corresponding words per million for each search term over the total year range for each newspaper. Within the annual distributions, for each search term and each newspaper, we determined the years lying in the top 1% (i.e. p<0.05 after application of a Bonferroni correction), following Gabrielatos et al. (2012). These are deemed to be years when that search term was in statistically significant usage above its background level for the newspaper in which it occurs. For years when search terms were significantly elevated, we also calculated collocates at range n. Collocates, in corpus linguistics, are other words found at statistically significant usage, over their own background levels, in a window from n positions to the left to n positions to the right of the search term. In other words, they are found in significant proximity to the search term. A default value of n=10 was used throughout, unless specified. Collocation analysis therefore assists in showing how a search term associates with other words within a corpus, providing information about the context in which that search term is used. CQPweb provides a log ratio method for the quantification of the strength of collocation.
The One Billion Word dataset is a dataset for language modeling. The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts.