100+ datasets found

S
Word cloud for data science
scidb.cn
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lili Zhang (2023). Word cloud for data science [Dataset]. http://doi.org/10.57760/sciencedb.07847
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.07847
Dataset updated
Apr 3, 2023
Dataset provided by
Science Data Bank
Authors
Lili Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes a .txt file and a .ipynb file. Raw data are captured through Web of Science as retrieval records on 24 February 2023. Refined by only published articles entitled "data science," 3490 pieces of records with abstracts are selected. Besides, the python code for word cloud analysis is also shared. This package provides supporting details for a paper, Looking Back to the Future: A Glimpse at Twenty Years of Data Science, submitted to the Data science Journal.
s
Wake Word Hebrew Dataset
shaip.com
ta.shaip.com
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Wake Word Hebrew Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-hebrew-dataset/
Explore at:
Dataset updated
Nov 8, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Word Hebrew DatasetHigh-Quality Hebrew Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Hebrew Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
m
Data from: Dataset for classifying English words into difficulty levels by...
data.mendeley.com
Updated Jun 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nisar Kangoo (2023). Dataset for classifying English words into difficulty levels by undergraduate and postgraduate students [Dataset]. http://doi.org/10.17632/p2wrs7hm4z.3
Explore at:
Unique identifier
https://doi.org/10.17632/p2wrs7hm4z.3
Dataset updated
Jun 12, 2023
Authors
Nisar Kangoo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains English words in column B. Corresponding to each word the other columns contain its frequency(fre), length(len), parts of speech(PS) and difficulty level (level).The dataset has a total of 5372 unique words. The words marked as difficult at level 1 are 691; at level 2, they are 141; and all remaining words, viz., 4541, are easy and hence have difficulty level 0. The words are labeled "level 2" if they are difficult for post-graduate students, and "level 1" if they are difficult for undergraduate students. The words are labeled "level 0" if they are neither difficult for undergraduate students nor postgraduate students. The data is collected from the students of Jammu and Kashmir (a Union Territory of India). Latitude and Longitude (32.2778° N, 75.3412° E) The description of files attached is as: The dataset_level CSV file is the original dataset containing English words, its length, frequency, Parts of speech and Level(difficulty level).
The dataset_numerical CSV file contains the original dataset along with string fields transformed into numerical. The English language difficulty level measurement -Questionnaire (1-6) & PG1,PG2,PG3,PG4 .docx files contains the questionnaire supplied to students of College and University to underline difficult words in the English text. IGNOU English.zip file contains the Indra Gandhi National Open University (IGNOU) English text books for graduation and post graduation students. The text for above questionnaires were taken from these IGNOU English text books.
s
Wake Word Mandarin Dataset
shaip.com
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word Mandarin Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-mandarin-dataset/
Explore at:
Dataset updated
Mar 22, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Mandarin DatasetHigh-Quality Mandarin Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleMandarin Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word / Keyphrase collection of…
E
Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1
live.european-language-grid.eu
binary format
Updated May 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Parallel sense-annotated corpus ELEXIS-WSD 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22947
Explore at:
binary formatAvailable download formats
Dataset updated
May 21, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation.

List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression).

Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

For more information, please refer to 00README.txt.

Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024).
s
Wake Word US Spanish Dataset
shaip.com
tl.shaip.com
Updated Oct 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Wake Word US Spanish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-us-spanish-dataset/
Explore at:
Dataset updated
Oct 13, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home US Spanish DatasetHigh-Quality US Spanish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleUS Spanish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…
Data from: CWID-hi: A Dataset for Complex Word Identification in Hindi Text
zenodo.org
csv
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gayatri Venugopal; Gayatri Venugopal; Dhanya Pramod; Dhanya Pramod (2022). CWID-hi: A Dataset for Complex Word Identification in Hindi Text [Dataset]. http://doi.org/10.5281/zenodo.5790833
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5790833
Dataset updated
Apr 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gayatri Venugopal; Gayatri Venugopal; Dhanya Pramod; Dhanya Pramod
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was created by conducting a human intelligence test, wherein native and non-native Hindi speakers annotated words they could not understand in Hindi text. They were then asked to rank the complexity of these words along with their synonyms. A word that received an average rank of <=3 (out of 5) is labeled 1 and the word that received an average rank of >3 is labeled 0. 1 indicates complex and 0 indicates simple.
m
Indian sign Language-Real-life Words
data.mendeley.com
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akansha Tyagi (2022). Indian sign Language-Real-life Words [Dataset]. http://doi.org/10.17632/s6kgb6r3ss.2
Explore at:
Unique identifier
https://doi.org/10.17632/s6kgb6r3ss.2
Dataset updated
Aug 10, 2022
Authors
Akansha Tyagi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
The dataset contains the RGB images of hand gestures of twenty ISL words, namely, ‘afraid’,’agree’,’assistance’,’bad’,’become’,’college’,’doctor’,’from’,’pain’,’pray’, ’secondary’, ’skin’, ’small’, ‘specific’, ‘stand’, ’today’, ‘warn’, ‘which’, ‘work’, ‘you’’ which are commonly used to convey messages or seek support during medical situations. All the words included in this dataset are static. The images were captured from 8 individuals including 6 males and 2 females in the age group of 9 years to 30 years. The dataset contains a 18000 images in jpg format. The images are labelled using the format ISLword_X_YYYY_Z, where: • ISLword corresponds to the words ‘afraid’, ‘agree’, ‘assistance’, ‘bad’, ‘become’, ‘college’, ‘doctor’ ,‘from’, ’pray’, ‘pain’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘stand’, ‘today’, ‘warn’, ‘which’, ‘work’, ‘you’. • X is an image number in the range 1 to 900. • YYYY is an identifier of the participant and is in the range of 1 to 6. • Z corresponds to 01 or 02 that identifies the sample number for each subject. For example, the file named afraid_1_user1_1 is the image sequence of the first sample of the ISL gesture of the word ‘afraid’ presented by the 1st user.
P
One Billion Word Benchmark Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson, One Billion Word Benchmark Dataset [Dataset]. https://paperswithcode.com/dataset/one-billion-word-benchmark
Explore at:
Authors
Ciprian Chelba; Tomas Mikolov; Mike Schuster; Qi Ge; Thorsten Brants; Phillipp Koehn; Tony Robinson
Description
Text corpus with almost one billion words of training data for statistical language modeling benchmarking. The scale of approximately one billion words attempts to strike a balance between the relevance of the benchmark in a world of abundant data against the ease with which researchers can evaluate their modeling approaches. Monolingual english data was obtained from the WMT11 website and prepared using a variety of best-practices for machine learning dataset preparations.
E
Data from: Dataset of Slovene word formation trees ArboSloleks 1.0
live.european-language-grid.eu
binary format
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dataset of Slovene word formation trees ArboSloleks 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/23752
Explore at:
binary formatAvailable download formats
Dataset updated
Nov 29, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ArboSloleks is a dataset containing Slovene word formation trees that have been automatically constructed from word relations (http://hdl.handle.net/11356/1986) extracted from Sloleks 2.0 (http://hdl.handle.net/11356/1230). Each word formation tree begins with a root lexeme from Sloleks (e.g. abolicionizem); morphologically related lexemes are then listed in pairs (original lexeme, related lexeme) along with the levels of word formation (e.g. abolicionizem – abolicionist (Level 1); abolicionist – abolicionistka (Level 2)).

Version 1.0 includes 14.918 word formation trees constructed from 66.360 lexeme pairs. It is available in an ad-hoc .txt format – for information on the structure and how to parse the data, please consult 00README.txt.
Z
Data from: Ancient Greek language models
data.niaid.nih.gov
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stopponi (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
Explore at:
Dataset updated
Apr 29, 2024
Dataset provided by
Nissim
Peels-Matthey
Stopponi
McGillivray
Pedrazzini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

Diachronica models

Training data

Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

Classical subcorpus

Hellenistic subcorpus

Whole corpus

Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

Models

Count-based

Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

Word2Vec

Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

Syntactic word embeddings

Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

ALP models

Training data

Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

Models

Count-based

Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

Word2Vec

Software used: Gensim library (Řehůřek and Sojka, 2010)

a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

References

Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
c
Slovenian Word in Context dataset SloWiC 1.0
clarin.si
live.european-language-grid.eu
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781
Explore at:
Dataset updated
Mar 23, 2023
Authors
Timotej Knez; Slavko Žitnik
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example
m
Data for: Cognitive processes underlying spoken word recognition during soft...
data.mendeley.com
Updated Apr 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kristi Hendrickson (2021). Data for: Cognitive processes underlying spoken word recognition during soft speech [Dataset]. http://doi.org/10.17632/h5p3bgm6nb.1
Explore at:
Unique identifier
https://doi.org/10.17632/h5p3bgm6nb.1
Dataset updated
Apr 26, 2021
Authors
Kristi Hendrickson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Time series data for two Visual World Paradigm eye-tracking experiments. Stimuli consistented of words presented at a conversational level (65 dBA) and words at lower intensities (40 and 50 dBA). Data files are .edf outputs from an EyeLink System,
T
United States Imports from Mexico of Typewriters and word processing...
tradingeconomics.com
csv, excel, json, xml
Updated Feb 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2020). United States Imports from Mexico of Typewriters and word processing machines [Dataset]. https://tradingeconomics.com/united-states/imports/mexico/typewriters-word-processing-machines
Explore at:
excel, csv, json, xmlAvailable download formats
Dataset updated
Feb 10, 2020
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1990 - Dec 31, 2025
Area covered
United States
Description
United States Imports from Mexico of Typewriters and word processing machines was US$27.07 Thousand during 2012, according to the United Nations COMTRADE database on international trade. United States Imports from Mexico of Typewriters and word processing machines - data, historical chart and statistics - was last updated on June of 2025.
B
Data in Support of "Orthographic and Semantic Learning During Shared...
borealisdata.ca
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Savannah Heintzman; Deacon S. Hélène (2024). Data in Support of "Orthographic and Semantic Learning During Shared Reading: Investigating Relations to Early Word Reading" [Dataset]. http://doi.org/10.5683/SP3/WOKZGB
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/WOKZGB
Dataset updated
Oct 3, 2024
Dataset provided by
Borealis
Authors
Savannah Heintzman; Deacon S. Hélène
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
To assess their knowledge of meanings and spellings of non-words, eighty-four 4- to 6-year-old children listened to a researcher read a short story with non-words in it. After listening to the story, children were administered multiple-choice tasks to determine their understanding. The children also completed standardized measures of alphabet knowledge, word reading skill, and phonological awareness. This data was collected from April to June 2023, in Halifax, Nova Scotia, Canada. "Orthographic and Semantic Learning During Shared Reading Data.csv" is a record of the results and supports the manuscript “Orthographic and Semantic Learning During Shared Reading: Investigating Relations to Word Reading.” The file contains 15 variables regarding participants’ age, gender, and first language, and their scores on the orthographic and semantic learning tasks, and on the standardized measures of alphabet knowledge, word reading, and phonological awareness.
T
France Imports from Vietnam of Typewriters and word processing machines
tradingeconomics.com
csv, excel, json, xml
Updated Jul 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2024). France Imports from Vietnam of Typewriters and word processing machines [Dataset]. https://tradingeconomics.com/france/imports/vietnam/typewriters-word-processing-machines
Explore at:
json, excel, xml, csvAvailable download formats
Dataset updated
Jul 14, 2024
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1990 - Dec 31, 2025
Area covered
France
Description
France Imports from Vietnam of Typewriters and word processing machines was US$2.55 Thousand during 2016, according to the United Nations COMTRADE database on international trade. France Imports from Vietnam of Typewriters and word processing machines - data, historical chart and statistics - was last updated on July of 2025.
E
Data from: Frequency lists of words from the GOS 1.0 corpus
live.european-language-grid.eu
binary format
Updated Nov 17, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Frequency lists of words from the GOS 1.0 corpus [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/8319
Explore at:
binary formatAvailable download formats
Dataset updated
Nov 17, 2019
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted:
1) one containing lemmas and their text-type distribution,
2) one containing lower-case word forms as well as their normalized forms, lemmas, and morphosyntactic tags along with their text-type distribution.

In addition, four lists were extracted from all words (regardless of their part-of-speech category):
1) a list of all lemmas along with their part-of-speech category and text-type distribution;
2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution;
3) a list of all lower-case word forms with their normalized word forms, lemmas, part-of-speech categories, and text-type distribution;
4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).
m
Data for: Adversarial Learning of Sentiment Word Representation for...
data.mendeley.com
Updated Jul 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin Wang (2020). Data for: Adversarial Learning of Sentiment Word Representation for Sentiment Analysis [Dataset]. http://doi.org/10.17632/4xtbjd7hfr.1
Explore at:
Unique identifier
https://doi.org/10.17632/4xtbjd7hfr.1
Dataset updated
Jul 4, 2020
Authors
Jin Wang
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
The codes of our work "Adversarial Learning of Sentiment Word Representation for Sentiment Analysis". The training results will upload latter on my Github.
s
Wake Word Brazilian Portuguese Dataset
shaip.com
ja.shaip.com
Updated Apr 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word Brazilian Portuguese Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-brazilian-portuguese-dataset/
Explore at:
Dataset updated
Apr 19, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Brazil
Description
Home Brazilian Portuguese DatasetHigh-Quality Brazilian Portuguese Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleBrazilian Portuguese Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…
Data from: MGS MARS MAG CALIBRATED MAPPING DETAIL WORD RESOLUTION V1.0
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). MGS MARS MAG CALIBRATED MAPPING DETAIL WORD RESOLUTION V1.0 [Dataset]. https://data.nasa.gov/dataset/mgs-mars-mag-calibrated-mapping-detail-word-resolution-v1-0
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Calibrated high time resolution (detail word) from the MAG instrument on the MGS spacecraft, collected during the mapping and extended mission phases (1997-09-12 to 2006-11-06) expressed in payload and Sun-State coordinates.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lili Zhang (2023). Word cloud for data science [Dataset]. http://doi.org/10.57760/sciencedb.07847

Word cloud for data science

Explore at:

251 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57760/sciencedb.07847

Dataset updated

Apr 3, 2023

Dataset provided by

Science Data Bank

Authors

Lili Zhang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset includes a .txt file and a .ipynb file. Raw data are captured through Web of Science as retrieval records on 24 February 2023. Refined by only published articles entitled "data science," 3490 pieces of records with abstracts are selected. Besides, the python code for word cloud analysis is also shared. This package provides supporting details for a paper, Looking Back to the Future: A Glimpse at Twenty Years of Data Science, submitted to the Data science Journal.

Clear search

Close search

Google apps

Main menu

Word cloud for data science

Wake Word Hebrew Dataset

Data from: Dataset for classifying English words into difficulty levels by...

Wake Word Mandarin Dataset

Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1

Wake Word US Spanish Dataset

Data from: CWID-hi: A Dataset for Complex Word Identification in Hindi Text

Indian sign Language-Real-life Words

One Billion Word Benchmark Dataset

Data from: Dataset of Slovene word formation trees ArboSloleks 1.0

Data from: Ancient Greek language models

Slovenian Word in Context dataset SloWiC 1.0

Data for: Cognitive processes underlying spoken word recognition during soft...

United States Imports from Mexico of Typewriters and word processing...

Data in Support of "Orthographic and Semantic Learning During Shared...

France Imports from Vietnam of Typewriters and word processing machines

Data from: Frequency lists of words from the GOS 1.0 corpus

Data for: Adversarial Learning of Sentiment Word Representation for...

Wake Word Brazilian Portuguese Dataset

Data from: MGS MARS MAG CALIBRATED MAPPING DETAIL WORD RESOLUTION V1.0

Word cloud for data science