These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is Adverbs : a graphic guide to grammar. It features 4 columns: authors, books, and publication dates.
The dataset contains the quantitative data used to create the tables and graphics in the article "The category of throw verbs as productive source of the Spanish inchoative construction." The data from the 21th century originates from the Spanish Web Corpus (esTenTen18), accessed via Sketch Engine. Only the subcorpus for European Spanish Data was selected. After downloading, the samples were manually cleaned. In the dataset, maximally 500 tokens were retained per auxiliary. For the earlier centuries, the data was extracted from the Corpus Diacrónico del Español (Corde). See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific corpus queries that were used. The data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'Century', 'INF' (infinitive) and 'Class' were used as input for the analysis (see data-specific sections below for more information about the variables). The empirical analysis is based on the downloaded data from the Spanish Web corpus (esTenTen18) (Kilgariff & Renau 2013). The Spanish Web corpus contains 20.3 billion words, from which 3.5 billion belong to the European Spanish domain. This corpus contains internet data, with observations originating from fora, blogs, Wikipedia, etc. Only the subcorpus with European Spanish data was consulted. The search syntax that was used to detect the inchoative construction was the following: “[lemma="echar"] [tag="R.*"]{0,3}"a"[tag="V.*"] within ” (consult Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for all corpus queries). After downloading, all the observations were manually cleaned. In total, the dataset contains, after the removal of false positives, 5514 tokens with a maximum of 500 tokens per auxiliary. False positive tokens were, for example, tagging errors wrongly coding nouns, such as Superman, Pokémon, Irán, among others, as infinitives, and also observations in which the auxiliary in combination with the infinitive did not express the inchoative value but its orginal semantic meaning, such as "saltar a nadar", for example, which means “to jump to swim” and not “to start to swim”. Of the auxiliaries with less than 500 relevant tokens in the esTenTen corpus, all tokens in the dataset were retained; for the auxiliaries with more than 500 tokens in the esTenTen corpus, only the first 500 were selected. For this specific study on the throw verbs, only the following auxilaries were retained: arrojar, disparar, echar, lanzar and tirar. For the diachronic data, the Corpus Diacrónico del Español (CORDE) was consulted. See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific queries that were used to retrieve the data in CORDE.
Consists of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG). These visual arithmetic problems are in the form of geometric figures: each problem has a set of geometric shapes as its context and embedded number symbols.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Purpose: This study examined the efficacy of the Vocabulary Acquisition and Usage for Late Talkers (VAULT) treatment in a version that manipulated the length of clinician utterance in which a target word was presented (dose length). The study also explored ways to characterize treatment responders versus nonresponders.Method: Nineteen primarily English-speaking late-talking toddlers (aged 24–34 months at treatment onset) received VAULT and were quasirandomly assigned to have target words presented in grammatical utterances matching one of two lengths: brief (four words or fewer) or extended (five words or more). Children were measured on their pre- and posttreatment production of (a) target and control words specific to treatment and (b) words not specific to treatment. Classification and Regression Tree (CART) analysis was used to classify responders versus nonresponders.Results: VAULT was successful as a whole (i.e., treatment effect sizes of greater than 0), with no difference between the brief and extended conditions. Despite the overall significant treatment effect, the treatment was not successful for all participants. CART results (using participants from the current study and a previous iteration of VAULT) provided a dual-node decision tree for classifying treatment responders versus nonresponders.Conclusions: The input-based VAULT treatment protocol is efficacious and offers some flexibility in terms of utterance length. When VAULT works, it works well. The CART decision tree uses pretreatment vocabulary levels and performance in the first two treatment sessions to provide clinicians with promising guidelines for who is likely to be a nonresponder and thus might need a modified treatment plan.Supplemental Material S1. Individual performance for participants identified as responders (effect size > 0) across baseline and treatment sessions for target and control words. The treatment condition for each participant (brief, i.e., 4 words or fewer; extended, i.e., 5 words or more) is indicated at the top of each graph.Supplemental Material S2. Individual performance for participants identified as nonresponders across baseline and treatment sessions for target and control words. The treatment condition for each participant (brief, i.e., 4 words or fewer; extended, i.e., 5 words or more) is indicated at the top of each graph.Alt, M., Figueroa, C. R., Mettler, H. M., Evans-Reitz, N., & Erikson, J. A. (2021). A vocabulary acquisition and usage for late talkers treatment efficacy study: The effect of input utterance length and identification of responder profiles. Journal of Speech, Language, and Hearing Research. Advance online publication. https://doi.org/10.1044/2020_JSLHR-20-00525
Not seeing a result you expected?
Learn how you can add new datasets to our index.
These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.