69 datasets found
  1. h

    English-Valid-Words

    • huggingface.co
    Updated Sep 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxim Belikov (2024). English-Valid-Words [Dataset]. https://huggingface.co/datasets/Maximax67/English-Valid-Words
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2024
    Authors
    Maxim Belikov
    License

    https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/

    Description

    English Valid Words

    This repository contains CSV files with valid English words along with their frequency, stem, and stem valid probability. Dataset Github link: https://github.com/Maximax67/English-Valid-Words

      Files included
    

    valid_words_sorted_alphabetically.csv:

    N: Counter for each word entry. Word: The English word itself. Frequency count: The number of occurrences of the word in the 1-grams dataset. Stem: The stem of the word. Stem valid probability: Probability… See the full description on the dataset page: https://huggingface.co/datasets/Maximax67/English-Valid-Words.

  2. Word frequencies from Project Gutenberg English texts

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabel Moreno-Sanchez; Francesc Font-Clos; Álvaro Corral (2023). Word frequencies from Project Gutenberg English texts [Dataset]. http://doi.org/10.6084/m9.figshare.1515919.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Isabel Moreno-Sanchez; Francesc Font-Clos; Álvaro Corral
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A compressed folder containing 31075 numerical vectors. Each one represents word frequencies of an EBook from Project Gutenberg written in English. Vectors are named containing the ID number of their corresponding text in Project Gutenberg.

  3. English Subtitle Word Frequency

    • kaggle.com
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luke Vanhaezebrouck (2020). English Subtitle Word Frequency [Dataset]. https://www.kaggle.com/lukevanhaezebrouck/subtlex-word-frequency/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luke Vanhaezebrouck
    Description

    Word Frequency based on American English subtitles (SUBTLEX)

    Word frequency is an important variable in cognitive processing. High-frequency words are perceived and produced faster and more efficiently than low-frequency words. At the same time, they are easier to recall but more difficult to recognize in episodic memory tasks.

    Brysbaert & New compiled a new frequency measure on the basis of American subtitles (51 million words in total). There are two measures:

    • The frequency per million words, called SUBTLEX (Subtitle frequency: word form frequency)
    • The percentage of films in which a word occurs, called SUBTLEX (Subtitle frequency: contextual diversity; see Adelman, Brown, & Quesada (2006) for the qualities of this measure).

    Columns

    1. Word. This starts with a capital when the word more often starts with an uppercase letter than with a lowercase letter.
    2. FREQcount. This is the number of times the word appears in the corpus (i.e., on the total of 51 million words).
    3. CDcount. This is the number of films in which the word appears (i.e., it has a maximum value of 8,388).
    4. FREQlow. This is the number of times the word appears in the corpus starting with a lowercase letter. This allows users to further match their stimuli.
    5. CDlow. This is the number of films in which the word appears starting with a lowercase letter.
    6. SUBTLWF. This is the word frequency per million words. It is the measure you would preferably use in your manuscripts, because it is a standard measure of word frequency independent of the corpus size. It is given with two digits precision, in order not to lose precision of the frequency counts.
    7. Lg10WF. This value is based on log10(FREQcount+1) and has four digit precision.
    8. SUBTLCD. This indicates in how many percent of the films the word appears. This value has two-digit precision in order not to lose information.
    9. Lg10CD. This value is based on log10(CDcount+1) and has four digit precision. It is the best value to use if you want to match words on word frequency.
    10. Dom_PoS_SUBTLEX. The dominant (most frequent) Part of Speech of each entry
    11. Freq_dom_PoS_SUBTLEX. The frequency of the dominant Part of Speech
    12. Percentage_dom_PoS. The relative frequency of the dominant Part of Speech
    13. All_PoS_SUBTLEX. All Parts of Speech observed for the entry
    14. All_freqs_SUBTLEX. The frequencies of each Part of Speech

    Sorted Dataset

    • Only includes words with a FREQcount greater than 1.
    • Is sorted based on the CDcount then alphabetically.

    Source

    This data set is taken from the Ghent University "SUBTLEXUS American Word Frequency" list compiled by Brysbaert & New. SUBTLEXUS website Brysbaert & New full analysis paper

  4. English Word Frequency

    • kaggle.com
    Updated Sep 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/datasets/rtatman/english-word-frequency/discussion?sortBy=hot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 6, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rachael Tatman
    Description

    Context:

    How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

    Content:

    This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

    Acknowledgements:

    Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

    The code used to generate this dataset is distributed under the MIT License.

    Inspiration:

    • Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?
    • What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
  5. Z

    Multi-LEX: a database of multi-word frequencies (English files)

    • data.niaid.nih.gov
    Updated Oct 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marjorie Armando (2022). Multi-LEX: a database of multi-word frequencies (English files) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7214222
    Explore at:
    Dataset updated
    Oct 17, 2022
    Dataset provided by
    Jonathan Grainger
    Marjorie Armando
    Stephane Dufau
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Written word frequency is a key variable used in many psycholinguistic studies and is central in explaining visual word recognition. Indeed, methodological advances on single word frequency estimates have helped to uncover novel language-related cognitive processes, fostering new ideas and studies. In an attempt to support and promote research on a related emerging topic, visual multi-word recognition, we extracted from the exhaustive Google Ngram datasets a selection of millions of multi-word sequences and computed their associated frequency estimate. Such sequences are presented with Part-of-Speech information for each individual word. An online behavioral investigation making use of the French 4-gram lexicon in a grammatical decision task was carried out. The results show an item-level frequency effect of word sequences. Moreover, the proposed datasets were found useful during the stimulus selection phase, allowing more precise control of the multi-word characteristics.

  6. f

    Kuçera Francis 1967 English Words Frequency Estimates

    • figshare.com
    txt
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henry Kuçera; W. Nelson Francis (2023). Kuçera Francis 1967 English Words Frequency Estimates [Dataset]. http://doi.org/10.6084/m9.figshare.7471193.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Henry Kuçera; W. Nelson Francis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Frequency count of the Brown corpus of present day American English — Brown corpus of present day American EnglishAvailable with prior consent of depositor for research purposes only

  7. w

    Word Frequency Database

    • wordunscrambler.net
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Word Frequency Database [Dataset]. https://www.wordunscrambler.net/random-word-generator/
    Explore at:
    Dataset updated
    Jul 6, 2025
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Database of English words categorized by usage frequency

  8. The database of English words and their Croatian equivalents

    • figshare.com
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irena Bogunović; Jasmina Jelčić Čolakovac; Mirjana Borucinsky (2023). The database of English words and their Croatian equivalents [Dataset]. http://doi.org/10.6084/m9.figshare.20014712.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Irena Bogunović; Jasmina Jelčić Čolakovac; Mirjana Borucinsky
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The current database contains English words which appear in Croatian in their original, unadapted form (e.g. show, boxer, zombie, skin, etc.). The list of words is based on The Database of English words in Croatian (Bogunović & Kučić 2022; https://repository.pfri.uniri.hr/islandora/object/pfri:2495), and was further complemented with words obtained from the corpus hrWaC (Ljubešić & Erjavec 2011; Ljubešić & Klubička 2014) using the platform SketchEngine (Kilgarriff et al. 2004). The same platform was used to check the list of English words against the corpora ENGRI (Bogunović et al. 2021; Bogunović & Kučić 2021) i hrWaC by consulting concordances and using CQL. The tagger Xf was used to filter out all English sentences embedded in Croatian texts. Corpus results were then manually checked using the random sample and filter tools to remove e.g., proper nouns, false cognate, false pairs, etc. The database also lists Croatian equivalents (and corresponding frequencies in the corpora) for each English word if they exist in Croatian. The choice of the Croatian equivalent depended greatly on the available corpus data on word frequency as well as Croatian online dictionaries. Furthermore, single-word and multi-word English expressions are represented separately in the database for reasons of visual transparency and simplification of word search. We would like to stress that the database by no means represents a final product and is not a definite representation of data on English words in Croatian, but is, however, representative of their current status in the Croatian language. Further efforts will be made to update the database and incorporate new data.

  9. m

    Data from: Dataset for classifying English words into difficulty levels by...

    • data.mendeley.com
    Updated Oct 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nisar Kangoo (2023). Dataset for classifying English words into difficulty levels by undergraduate and postgraduate students [Dataset]. http://doi.org/10.17632/p2wrs7hm4z.4
    Explore at:
    Dataset updated
    Oct 24, 2023
    Authors
    Nisar Kangoo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains English words in column B. Corresponding to each word the other columns contain its frequency(fre), length(len), parts of speech(PS), the number of undergraduate students which marked it difficult (difficult_ug) and the number of postgraduate students which marked it difficult (difficult_pg).The dataset has a total of 5368 unique words. The words marked as difficult by undergraduate students are 680; and those marked as difficult by postgraduate students are 151; all the remaining words, viz., 4537, are easy and hence are not marked as difficult either by undergraduate and postgraduate students. The word against which there is hyphen (-) in difficult_ug column means that this word is not present in the text circulated to undergraduate students. Likewise hyphen(-) in difficult_pg column means words not present in text circulated to postgraduate students. The data is collected from the students of Jammu and Kashmir (a Union Territory of India). Latitude and Longitude (32.2778° N, 75.3412° E) The description of files attached is as: The dataset_english CSV file is the original dataset containing English words, its length, frequency, Parts of speech, number of undergraduate and postgraduate students which marked the particular words as difficult.
    The dataset_numerical CSV file contains the original dataset along with string fields transformed into numerical. The English language difficulty level measurement -Questionnaire (1-6) & PG1,PG2,PG3,PG4 .docx files contains the questionnaire supplied to students of College and University to underline difficult words in the English text. IGNOU English.zip file contains the Indra Gandhi National Open University (IGNOU) English text books for graduation and post graduation students. The text for above questionnaires were taken from these IGNOU English text books.

  10. o

    Tamil-English Code-Switching Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Tamil-English Code-Switching Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/04384d71-a1ca-42de-8d03-927122b6cf1e
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    This dataset offers a rich collection of over 600,000 unique Tanglish words and their cleaned forms. These words were extracted from a large body of more than 650,000 comments and transcripts gathered from 1,260 videos. It serves as a valuable resource for Natural Language Processing (NLP) tasks, particularly those involving Tamil-English mixed text, often referred to as "Tanglish." Key features include a substantial lexicon, preprocessed and cleaned text to ensure high-quality inputs for machine learning, and specific focus on Tamil-English text, making it useful for multilingual and low-resource NLP research. It is applicable to tasks such as text classification, sentiment analysis, and transliteration.

    Columns

    • word: Represents a unique Tanglish or Tamil term.
    • count: Indicates the frequency of the specific word within the source corpus.

    Distribution

    The dataset is typically provided in a CSV format. It comprises over 600,000 unique Tanglish words, derived from over 650,000 comments and transcripts. While the exact number of rows in the full dataset is not specified, it represents a substantial collection of word-frequency pairs. The sample provided shows a structure of word and its corresponding count. The dataset was listed on 08/06/2025.

    Usage

    This dataset is ideal for various applications and use cases, including: * Building and refining language models tailored for Tanglish. * Creating datasets for machine translation and transliteration projects. * Advancing linguistic studies focused on code-switching and low-resource languages. * General NLP tasks such as text classification, sentiment analysis, and transliteration.

    Coverage

    The dataset's regional coverage is global. Its linguistic scope is focused on Tamil-English mixed text, specifically "Tanglish." The data originates from comments and transcripts collected from 1,260 videos. Specific notes on data availability for certain groups or years are not detailed beyond the general collection from video comments.

    License

    CCO

    Who Can Use It

    This dataset is particularly useful for: * The Natural Language Processing (NLP) community. * Researchers and developers working on regional and multilingual languages. * Individuals or teams focused on building and fine-tuning language models for Tanglish. * Those developing solutions for machine translation and transliteration tasks involving Tamil-English content. * Linguists interested in code-switching phenomena and low-resource language studies.

    Dataset Name Suggestions

    • Tamil-Tanglish Word Frequency Lexicon
    • YouTube Comments Tanglish Word Counts
    • Tanglish NLP Lexicon
    • Multilingual Social Media Word List
    • Tamil-English Code-Switching Dataset

    Attributes

    Original Data Source: Tamil and Tanglish Transliterated Words Dataset

  11. National Institute of the Korean Language Corpus

    • kaggle.com
    Updated Oct 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). National Institute of the Korean Language Corpus [Dataset]. https://www.kaggle.com/rtatman/national-institute-of-the-korean-language-corpus/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rachael Tatman
    Description

    Context:

    How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing.

    This dataset contains frequency information on Korean, which is spoken by 80 million people. For each item, both the frequency (number of times it occurs in the corpus) and its relative rank to other lemmas is provided.

    Content:

    This dataset contains six sub-files with frequency information. The files have been renamed in English, but no changes have been made to the file contents. The files and their headers are listed below. The text in this dataset is UTF-8.

    • Frequency by Jamo (letter)
      • 순위: Rank
      • 빈도: Frequency
      • 위치: Location
      • 자모: Jamo (Hangul letter)
    • Frequency
      • 순위: Rank
      • 빈도: Frequency
      • 항목: Location
      • 범주: Category
    • Frequency by Syllable
      • 순위: Rank
      • 빈도: Frequency
      • 음절: Syllable
    • Borrowings
      • 순위: Rank
      • 빈도: Frequency
      • 항목: Item
      • 풀이: Root
    • Non Standard Words
      • 순위: Rank
      • 빈도: Frequency
      • 어휘: Vocabulary
      • 풀이: Notes
      • 품사: Part of Speech
    • Frequency (Longer version)
      • 순위: Rank
      • 빈도: Frequency
      • 항목: Location
      • 범주: Category

    Acknowledgements:

    This dataset was collected and made available by the National Institute of Korean Language. The dataset and additional documentation (in Korean) can be found here.

    This dataset is distributed under a Korean Open Government Liscence, type 4. It may be redistributed with attribution, without derivatives and not for commercial purposes.

    Inspiration:

    • What are the most frequent jamo (Hangul characters) in Korean? Least frequent?
    • What qualities do borrowed words have?
    • Is there a relationship between word length and frequency?

    You may also like:

  12. o

    SUBTLEX-ESP

    • osf.io
    Updated Apr 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Brysbaert (2025). SUBTLEX-ESP [Dataset]. https://osf.io/xp6sz
    Explore at:
    Dataset updated
    Apr 19, 2025
    Dataset provided by
    Center For Open Science
    Authors
    Marc Brysbaert
    Description

    No description was included in this Dataset collected from the OSF

  13. f

    Table_1_A comparative analysis of the COVID-19 Infodemic in English and...

    • figshare.com
    • frontiersin.figshare.com
    docx
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Luo; Daiyun Peng; Lei Shi; Didier El Baz; Xinran Liu (2023). Table_1_A comparative analysis of the COVID-19 Infodemic in English and Chinese: insights from social media textual data.docx [Dataset]. http://doi.org/10.3389/fpubh.2023.1281259.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Frontiers
    Authors
    Jia Luo; Daiyun Peng; Lei Shi; Didier El Baz; Xinran Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The COVID-19 infodemic, characterized by the rapid spread of misinformation and unverified claims related to the pandemic, presents a significant challenge. This paper presents a comparative analysis of the COVID-19 infodemic in the English and Chinese languages, utilizing textual data extracted from social media platforms. To ensure a balanced representation, two infodemic datasets were created by augmenting previously collected social media textual data. Through word frequency analysis, the 30 most frequently occurring infodemic words are identified, shedding light on prevalent discussions surrounding the infodemic. Moreover, topic clustering analysis uncovers thematic structures and provides a deeper understanding of primary topics within each language context. Additionally, sentiment analysis enables comprehension of the emotional tone associated with COVID-19 information on social media platforms in English and Chinese. This research contributes to a better understanding of the COVID-19 infodemic phenomenon and can guide the development of strategies to combat misinformation during public health crises across different languages.

  14. P

    CELEX Dataset

    • paperswithcode.com
    Updated Feb 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hr Baayen (2021). CELEX Dataset [Dataset]. https://paperswithcode.com/dataset/celex
    Explore at:
    Dataset updated
    Feb 20, 2021
    Authors
    Hr Baayen
    Description

    CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.

  15. f

    The frequency of number words in English and Spanish.

    • figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Ramscar; Melody Dye; Hanna Muenke Popick; Fiona O'Donnell-McCarthy (2023). The frequency of number words in English and Spanish. [Dataset]. https://figshare.com/articles/dataset/_The_frequency_of_number_words_in_English_and_Spanish_/423578
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Michael Ramscar; Melody Dye; Hanna Muenke Popick; Fiona O'Donnell-McCarthy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *estimate.The table shows the spoken frequency counts of numbers 1–7 as they occur prenominally (e.g., “six hats”). The counts are taken from the 385 million word Corpus of Contemporary English (COCA) [53] and the 100 million word Corpus Del Español (CORDES) [54], respectively. (Note: The English-Spanish comparison is slightly complicated because “uno” is gendered in Spanish: it takes the form “una” with some nouns, and “una” is not used exclusively as a number word. The figure for “uno” presented here is a weighted estimate: number-word+noun sequences : tokens of each number word in the corpus.).

  16. B

    Data from: CELEX2

    • borealisdata.ca
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R H. Baayen; R Piepenbrock; L Gulikers (2023). CELEX2 [Dataset]. http://doi.org/10.5683/SP2/XGW4WY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Borealis
    Authors
    R H. Baayen; R Piepenbrock; L Gulikers
    License

    https://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/XGW4WYhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.5683/SP2/XGW4WY

    Description

    Introduction This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.0). CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Pre-mastering and production was done by the LDC. For each language, this data set contains detailed information on: orthography (variations in spelling, hyphenation) phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress) morphology (derivational and compositional structure, inflectional paradigms) syntax (word class, word class-specific subcategorizations, argument structures) word frequency (summed word and lemma counts, based on recent and representative text corpora) The databases have not been tailored to fit any particular database management program. Instead, the information is in ASCII files in a UNIX directory tree that can be queried with tools, such as AWK or ICON. Unique identity numbers allow the linking of information from different files. Some kinds of information have to be computed online; wherever necessary, AWK functions have been provided to recover this information. README files specify the details of their use. A detailed User Guide describing the various kinds of lexical information available is supplied. All sections of this guide are POSTSCRIPT files, except for some additional notes on the German lexicon in plain ASCII. CELEX-2 The second release of CELEX contains an enhanced, expanded version of the German lexical database (2.5), featuring approximately 1,000 new lemma entries, revised morphological parses, verb argument structures, inflectional paradigm codes and a corpus type lexicon. A complete PostScript version of the Germanic Linguistic Guide is also included, in both European A-4 format and American Letter format. For German, the total number of lemmas included is now 51,728, while all their inflected forms number 365,530. Moreover, phonetic syllable frequencies have been added for (British) English and Dutch. Apart from this, and provision of frequency information alongside every lexical feature, no changes have been made to Dutch and English lexicons. Complete AWK-scripts are now provided to compute representations not found in the (plain ASCII) lexical data files, corresponding to the features described in CELEX User Guide, which is included as well. For each language, i.e. English, German and Dutch, the data contains detailed information on the orthography (variations in spelling, hyphenation), the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), the morphology (derivational and compositional structure, inflectional paradigms), the syntax (word class, word-class specific subcategorisation, argument structures) and word frequency (summed word and lemma counts, based on resent and representative text corpora) of both wordforms and lemmas. Unique identity numbers allow the linking of information from different files with the aid of an efficient, index-based C-program. Like its predecessor, this release is mastered using the ISO 9660 daa format, with the Rock Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh and UNIX environments. As the new release does not omit any data from the first edition, the current release will replace the old one. Updates Petra Stiener has developed a number of scripts to modify and update CELEX2 to a modern format. They are available on her github page. LREC papers related to these updates are accessible at the following urls: http://aclweb.org/anthology/W17-7619 & http://www.lrec-conf.org/proceedings/lrec2016/summaries/761.html.

  17. o

    tvarchive Dataset

    • explore.openaire.eu
    Updated Jul 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WhatEvery1Says (WE1S) Project (2021). tvarchive Dataset [Dataset]. http://doi.org/10.5281/zenodo.5068196
    Explore at:
    Dataset updated
    Jul 3, 2021
    Authors
    WhatEvery1Says (WE1S) Project
    Description

    The tvarchive dataset contains word-frequency and other non-consumptive-use data about 1,205,844 English-language transcriptions of U.S. television news broadcasts. The documents were scraped from the Internet Archive's TV News Archive, which includes automatic captions of select U.S. news broadcasts since 2009. While the complete TV News Archive contains over 2.2 million transcripts, WE1S researchers were only able to collect about 1.2 million documents containing complete transcripts. The full TV News Archive includes transcripts from 33 networks and hundreds of shows. Unlike other WE1S datasets, the tvarchive dataset was not collected using keyword searches for specific terms (i.e., documents containing the word "humanities"). (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.") WE1S makes available word frequency data only "non-consumptive use". This dataset cannot be used to access, read, or reconstruct the original texts.The data has been archived in jsonl format (each json document is delimited by a line break).

  18. f

    Correlation matrix over all languages for mean frequency of word groups...

    • plos.figshare.com
    bin
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gisbert Wilhelm Teepe; Edda Magareta Glase; Ulf-Dietrich Reips (2023). Correlation matrix over all languages for mean frequency of word groups (z-transformed) from 1970 until 2019. [Dataset]. http://doi.org/10.1371/journal.pone.0284091.t001
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Gisbert Wilhelm Teepe; Edda Magareta Glase; Ulf-Dietrich Reips
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Correlation matrix over all languages for mean frequency of word groups (z-transformed) from 1970 until 2019.

  19. Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...

    • datarade.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
    Explore at:
    .csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Oxford Languageshttps://www.lexico.com/
    Area covered
    Chile, Ecuador, Costa Rica, Nicaragua, Bolivia (Plurinational State of), Cuba, Paraguay, Panama, Honduras, Colombia
    Description

    Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

    1. Spanish Monolingual Dictionary Data
    2. Spanish Bilingual Dictionary Data
    3. Spanish Sentences Data
    4. Synonyms and Antonyms Data
    5. Audio Data
    6. Word list Data

    Key Features (approximate numbers):

    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Headwords: 73,000
    • Senses: 123,000
    • Sentence examples: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

  20. d

    Replication data for: Chunking or predicting – frequency information and...

    • search.dataone.org
    • dataverse.no
    • +1more
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenz, David; Tizón-Couto, David (2024). Replication data for: Chunking or predicting – frequency information and reduction in the perception of multi-word sequences [Dataset]. http://doi.org/10.18710/7TSABU
    Explore at:
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    DataverseNO
    Authors
    Lorenz, David; Tizón-Couto, David
    Time period covered
    May 9, 2016 - Nov 24, 2016
    Description

    This is the data and code from a word-monitoring task, in which participants responded to the word 'to' in verb + to-infinitive structures (V-to-Vinf) in English, where 'to' could occur in a full or reduced pronunciation. Accuracy and response times were analysed with mixed-effects generalized additive models (GAMM); the code also includes visualisations of these models. The paper is accepted for publication in Cognitive Linguistics. The experiment was run with OpenSesame (version 3.0.7 for Mac, cf. Mathôt et al. 2012). The data include information on frequencies of occurrence of words and bigrams; this was extracted from the Corpus of Contemporary American English (COCA, Davies 2008–). We used R (R Core Team 2017) for all data analyses, hence the code can best be replicated in R. Abstract: Frequently used linguistic structures become entrenched in memory; this is often assumed to make their consecutive parts more predictable, as well as fuse them into a single unit (chunking). High frequency moreover leads to a propensity for phonetic reduction. We present a word recognition experiment which tests how frequency information (string frequency, transitional probability) interacts with reduction in speech perception. Detection of the element to is tested in V-to-Vinf sequences in English (e.g. need to Vinf), where to can undergo reduction (“needa”). Results show that reduction impedes recognition, but this can be mitigated by the predictability of the item. Recognition generally benefits from surface frequency, while a modest chunking effect is found in delayed responses to reduced forms of high-frequency items. Transitional probability shows a facilitating effect on reduced but not on full forms. Reduced forms also pose more difficulty when the phonological context obscures the onset of to. We conclude that listeners draw on frequency information in a predictive manner to cope with reduction. High-frequency structures are not inevitably perceived as chunks, but depend on cues in the phonetic form – reduction leads to perceptual prominence of the whole over the parts and thus promotes a holistic access.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Maxim Belikov (2024). English-Valid-Words [Dataset]. https://huggingface.co/datasets/Maximax67/English-Valid-Words

English-Valid-Words

English Valid Words List

Maximax67/English-Valid-Words

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2024
Authors
Maxim Belikov
License

https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/

Description

English Valid Words

This repository contains CSV files with valid English words along with their frequency, stem, and stem valid probability. Dataset Github link: https://github.com/Maximax67/English-Valid-Words

  Files included

valid_words_sorted_alphabetically.csv:

N: Counter for each word entry. Word: The English word itself. Frequency count: The number of occurrences of the word in the 1-grams dataset. Stem: The stem of the word. Stem valid probability: Probability… See the full description on the dataset page: https://huggingface.co/datasets/Maximax67/English-Valid-Words.

Search
Clear search
Close search
Google apps
Main menu