100+ datasets found
  1. S

    Word cloud for data science

    • scidb.cn
    Updated Apr 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lili Zhang (2023). Word cloud for data science [Dataset]. http://doi.org/10.57760/sciencedb.07847
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Lili Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes a .txt file and a .ipynb file. Raw data are captured through Web of Science as retrieval records on 24 February 2023. Refined by only published articles entitled "data science," 3490 pieces of records with abstracts are selected. Besides, the python code for word cloud analysis is also shared. This package provides supporting details for a paper, Looking Back to the Future: A Glimpse at Twenty Years of Data Science, submitted to the Data science Journal.

  2. s

    Wake Word Hebrew Dataset

    • shaip.com
    • ta.shaip.com
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word Hebrew Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-hebrew-dataset/
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Wake Word Hebrew DatasetHigh-Quality Hebrew Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Hebrew Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…

  3. m

    Data from: Dataset for classifying English words into difficulty levels by...

    • data.mendeley.com
    Updated Jun 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nisar Kangoo (2023). Dataset for classifying English words into difficulty levels by undergraduate and postgraduate students [Dataset]. http://doi.org/10.17632/p2wrs7hm4z.3
    Explore at:
    Dataset updated
    Jun 12, 2023
    Authors
    Nisar Kangoo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains English words in column B. Corresponding to each word the other columns contain its frequency(fre), length(len), parts of speech(PS) and difficulty level (level).The dataset has a total of 5372 unique words. The words marked as difficult at level 1 are 691; at level 2, they are 141; and all remaining words, viz., 4541, are easy and hence have difficulty level 0. The words are labeled "level 2" if they are difficult for post-graduate students, and "level 1" if they are difficult for undergraduate students. The words are labeled "level 0" if they are neither difficult for undergraduate students nor postgraduate students. The data is collected from the students of Jammu and Kashmir (a Union Territory of India). Latitude and Longitude (32.2778° N, 75.3412° E) The description of files attached is as: The dataset_level CSV file is the original dataset containing English words, its length, frequency, Parts of speech and Level(difficulty level).
    The dataset_numerical CSV file contains the original dataset along with string fields transformed into numerical. The English language difficulty level measurement -Questionnaire (1-6) & PG1,PG2,PG3,PG4 .docx files contains the questionnaire supplied to students of College and University to underline difficult words in the English text. IGNOU English.zip file contains the Indra Gandhi National Open University (IGNOU) English text books for graduation and post graduation students. The text for above questionnaires were taken from these IGNOU English text books.

  4. s

    Wake Word Mandarin Dataset

    • shaip.com
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Mandarin Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-mandarin-dataset/
    Explore at:
    Dataset updated
    Mar 22, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Mandarin DatasetHigh-Quality Mandarin Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleMandarin Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word / Keyphrase collection of…

  5. E

    Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1

    • live.european-language-grid.eu
    binary format
    Updated May 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Parallel sense-annotated corpus ELEXIS-WSD 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22947
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    May 21, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

    The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

    The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation.

    List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

    The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression).

    Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

    For more information, please refer to 00README.txt.

    Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024).

  6. s

    Wake Word US Spanish Dataset

    • shaip.com
    • tl.shaip.com
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word US Spanish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-us-spanish-dataset/
    Explore at:
    Dataset updated
    Oct 13, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home US Spanish DatasetHigh-Quality US Spanish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleUS Spanish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…

  7. Data from: CWID-hi: A Dataset for Complex Word Identification in Hindi Text

    • zenodo.org
    csv
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gayatri Venugopal; Gayatri Venugopal; Dhanya Pramod; Dhanya Pramod (2022). CWID-hi: A Dataset for Complex Word Identification in Hindi Text [Dataset]. http://doi.org/10.5281/zenodo.5790833
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gayatri Venugopal; Gayatri Venugopal; Dhanya Pramod; Dhanya Pramod
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.

  8. m

    Indian sign Language-Real-life Words

    • data.mendeley.com
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akansha Tyagi (2022). Indian sign Language-Real-life Words [Dataset]. http://doi.org/10.17632/s6kgb6r3ss.2
    Explore at:
    Dataset updated
    Aug 10, 2022
    Authors
    Akansha Tyagi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The dataset contains the RGB images of hand gestures of twenty ISL words, namely, ‘afraid’,’agree’,’assistance’,’bad’,’become’,’college’,’doctor’,’from’,’pain’,’pray’, ’secondary’, ’skin’, ’small’, ‘specific’, ‘stand’, ’today’, ‘warn’, ‘which’, ‘work’, ‘you’’ which are commonly used to convey messages or seek support during medical situations. All the words included in this dataset are static. The images were captured from 8 individuals including 6 males and 2 females in the age group of 9 years to 30 years. The dataset contains a 18000 images in jpg format. The images are labelled using the format ISLword_X_YYYY_Z, where: • ISLword corresponds to the words ‘afraid’, ‘agree’, ‘assistance’, ‘bad’, ‘become’, ‘college’, ‘doctor’ ,‘from’, ’pray’, ‘pain’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘stand’, ‘today’, ‘warn’, ‘which’, ‘work’, ‘you’. • X is an image number in the range 1 to 900. • YYYY is an identifier of the participant and is in the range of 1 to 6. • Z corresponds to 01 or 02 that identifies the sample number for each subject. For example, the file named afraid_1_user1_1 is the image sequence of the first sample of the ISL gesture of the word ‘afraid’ presented by the 1st user.

  9. P

    One Billion Word Benchmark Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson, One Billion Word Benchmark Dataset [Dataset]. https://paperswithcode.com/dataset/one-billion-word-benchmark
    Explore at:
    Authors
    Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson
    Description

    Text corpus with almost one billion words of training data for statistical language modeling benchmarking. The scale of approximately one billion words attempts to strike a balance between the relevance of the benchmark in a world of abundant data against the ease with which researchers can evaluate their modeling approaches. Monolingual english data was obtained from the WMT11 website and prepared using a variety of best-practices for machine learning dataset preparations.

  10. E

    Data from: Dataset of Slovene word formation trees ArboSloleks 1.0

    • live.european-language-grid.eu
    binary format
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset of Slovene word formation trees ArboSloleks 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/23752
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 29, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ArboSloleks is a dataset containing Slovene word formation trees that have been automatically constructed from word relations (http://hdl.handle.net/11356/1986) extracted from Sloleks 2.0 (http://hdl.handle.net/11356/1230). Each word formation tree begins with a root lexeme from Sloleks (e.g. abolicionizem); morphologically related lexemes are then listed in pairs (original lexeme, related lexeme) along with the levels of word formation (e.g. abolicionizem – abolicionist (Level 1); abolicionist – abolicionistka (Level 2)).

    Version 1.0 includes 14.918 word formation trees constructed from 66.360 lexeme pairs. It is available in an ad-hoc .txt format – for information on the structure and how to parse the data, please consult 00README.txt.

  11. Z

    Data from: Ancient Greek language models

    • data.niaid.nih.gov
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stopponi (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
    Explore at:
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Nissim
    Peels-Matthey
    Stopponi
    McGillivray
    Pedrazzini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

    Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

    [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

    [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

    Diachronica models

    Training data

    Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

    Classical subcorpus

    Hellenistic subcorpus

    Whole corpus

    Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

    Word2Vec

    Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

    Syntactic word embeddings

    Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

    ALP models

    Training data

    Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

    Word2Vec

    Software used: Gensim library (Řehůřek and Sojka, 2010)

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

    References

    Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

    Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

    Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

    Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

    Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

    Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

    Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

    Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

    Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

    Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.

  12. c

    Slovenian Word in Context dataset SloWiC 1.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781
    Explore at:
    Dataset updated
    Mar 23, 2023
    Authors
    Timotej Knez; Slavko Žitnik
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

    Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example

  13. m

    Data for: Cognitive processes underlying spoken word recognition during soft...

    • data.mendeley.com
    Updated Apr 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristi Hendrickson (2021). Data for: Cognitive processes underlying spoken word recognition during soft speech [Dataset]. http://doi.org/10.17632/h5p3bgm6nb.1
    Explore at:
    Dataset updated
    Apr 26, 2021
    Authors
    Kristi Hendrickson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Time series data for two Visual World Paradigm eye-tracking experiments. Stimuli consistented of words presented at a conversational level (65 dBA) and words at lower intensities (40 and 50 dBA). Data files are .edf outputs from an EyeLink System,

  14. T

    United States Imports from Mexico of Typewriters and word processing...

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Feb 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). United States Imports from Mexico of Typewriters and word processing machines [Dataset]. https://tradingeconomics.com/united-states/imports/mexico/typewriters-word-processing-machines
    Explore at:
    excel, csv, json, xmlAvailable download formats
    Dataset updated
    Feb 10, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1990 - Dec 31, 2025
    Area covered
    United States
    Description

    United States Imports from Mexico of Typewriters and word processing machines was US$27.07 Thousand during 2012, according to the United Nations COMTRADE database on international trade. United States Imports from Mexico of Typewriters and word processing machines - data, historical chart and statistics - was last updated on June of 2025.

  15. B

    Data in Support of "Orthographic and Semantic Learning During Shared...

    • borealisdata.ca
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Savannah Heintzman; Deacon S. Hélène (2024). Data in Support of "Orthographic and Semantic Learning During Shared Reading: Investigating Relations to Early Word Reading" [Dataset]. http://doi.org/10.5683/SP3/WOKZGB
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    Borealis
    Authors
    Savannah Heintzman; Deacon S. Hélène
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    To assess their knowledge of meanings and spellings of non-words, eighty-four 4- to 6-year-old children listened to a researcher read a short story with non-words in it. After listening to the story, children were administered multiple-choice tasks to determine their understanding. The children also completed standardized measures of alphabet knowledge, word reading skill, and phonological awareness. This data was collected from April to June 2023, in Halifax, Nova Scotia, Canada. "Orthographic and Semantic Learning During Shared Reading Data.csv" is a record of the results and supports the manuscript “Orthographic and Semantic Learning During Shared Reading: Investigating Relations to Word Reading.” The file contains 15 variables regarding participants’ age, gender, and first language, and their scores on the orthographic and semantic learning tasks, and on the standardized measures of alphabet knowledge, word reading, and phonological awareness.

  16. T

    France Imports from Vietnam of Typewriters and word processing machines

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Jul 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2024). France Imports from Vietnam of Typewriters and word processing machines [Dataset]. https://tradingeconomics.com/france/imports/vietnam/typewriters-word-processing-machines
    Explore at:
    json, excel, xml, csvAvailable download formats
    Dataset updated
    Jul 14, 2024
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1990 - Dec 31, 2025
    Area covered
    France
    Description

    France Imports from Vietnam of Typewriters and word processing machines was US$2.55 Thousand during 2016, according to the United Nations COMTRADE database on international trade. France Imports from Vietnam of Typewriters and word processing machines - data, historical chart and statistics - was last updated on July of 2025.

  17. E

    Data from: Frequency lists of words from the GOS 1.0 corpus

    • live.european-language-grid.eu
    binary format
    Updated Nov 17, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Frequency lists of words from the GOS 1.0 corpus [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/8319
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 17, 2019
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

    The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted:

    1) one containing lemmas and their text-type distribution,

    2) one containing lower-case word forms as well as their normalized forms, lemmas, and morphosyntactic tags along with their text-type distribution.

    In addition, four lists were extracted from all words (regardless of their part-of-speech category):

    1) a list of all lemmas along with their part-of-speech category and text-type distribution;

    2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution;

    3) a list of all lower-case word forms with their normalized word forms, lemmas, part-of-speech categories, and text-type distribution;

    4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).

  18. m

    Data for: Adversarial Learning of Sentiment Word Representation for...

    • data.mendeley.com
    Updated Jul 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin Wang (2020). Data for: Adversarial Learning of Sentiment Word Representation for Sentiment Analysis [Dataset]. http://doi.org/10.17632/4xtbjd7hfr.1
    Explore at:
    Dataset updated
    Jul 4, 2020
    Authors
    Jin Wang
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    The codes of our work "Adversarial Learning of Sentiment Word Representation for Sentiment Analysis". The training results will upload latter on my Github.

  19. s

    Wake Word Brazilian Portuguese Dataset

    • shaip.com
    • ja.shaip.com
    Updated Apr 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Brazilian Portuguese Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-brazilian-portuguese-dataset/
    Explore at:
    Dataset updated
    Apr 19, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Brazil
    Description

    Home Brazilian Portuguese DatasetHigh-Quality Brazilian Portuguese Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleBrazilian Portuguese Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…

  20. Data from: MGS MARS MAG CALIBRATED MAPPING DETAIL WORD RESOLUTION V1.0

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). MGS MARS MAG CALIBRATED MAPPING DETAIL WORD RESOLUTION V1.0 [Dataset]. https://data.nasa.gov/dataset/mgs-mars-mag-calibrated-mapping-detail-word-resolution-v1-0
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Calibrated high time resolution (detail word) from the MAG instrument on the MGS spacecraft, collected during the mapping and extended mission phases (1997-09-12 to 2006-11-06) expressed in payload and Sun-State coordinates.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lili Zhang (2023). Word cloud for data science [Dataset]. http://doi.org/10.57760/sciencedb.07847

Word cloud for data science

Explore at:
251 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2023
Dataset provided by
Science Data Bank
Authors
Lili Zhang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset includes a .txt file and a .ipynb file. Raw data are captured through Web of Science as retrieval records on 24 February 2023. Refined by only published articles entitled "data science," 3490 pieces of records with abstracts are selected. Besides, the python code for word cloud analysis is also shared. This package provides supporting details for a paper, Looking Back to the Future: A Glimpse at Twenty Years of Data Science, submitted to the Data science Journal.

Search
Clear search
Close search
Google apps
Main menu