5 datasets found
  1. d

    Grammar transformations of topographic feature type annotations of the U.S....

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g
    Explore at:
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

  2. w

    Dataset of authors, books and publication dates of book series where books...

    • workwithdata.com
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of authors, books and publication dates of book series where books equals Adverbs : a graphic guide to grammar [Dataset]. https://www.workwithdata.com/datasets/book-series?col=book_series%2Cj0-author%2Cj0-book%2Cj0-publication_date&f=1&fcol0=j0-book&fop0=%3D&fval0=Adverbs+%3A+a+graphic+guide+to+grammar&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book series. It has 1 row and is filtered where the books is Adverbs : a graphic guide to grammar. It features 4 columns: authors, books, and publication dates.

  3. d

    Replication Data for: \"The category of throw verbs as productive source of...

    • search.dataone.org
    • dataverse.no
    • +1more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hulle, Sven; Enghels, Renata (2024). Replication Data for: \"The category of throw verbs as productive source of the Spanish inchoative construction.\" [Dataset]. http://doi.org/10.18710/TR2PWJ
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    DataverseNO
    Authors
    Van Hulle, Sven; Enghels, Renata
    Time period covered
    Jan 1, 1200 - Jan 1, 2000
    Description

    The dataset contains the quantitative data used to create the tables and graphics in the article "The category of throw verbs as productive source of the Spanish inchoative construction." The data from the 21th century originates from the Spanish Web Corpus (esTenTen18), accessed via Sketch Engine. Only the subcorpus for European Spanish Data was selected. After downloading, the samples were manually cleaned. In the dataset, maximally 500 tokens were retained per auxiliary. For the earlier centuries, the data was extracted from the Corpus Diacrónico del Español (Corde). See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific corpus queries that were used. The data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'Century', 'INF' (infinitive) and 'Class' were used as input for the analysis (see data-specific sections below for more information about the variables). The empirical analysis is based on the downloaded data from the Spanish Web corpus (esTenTen18) (Kilgariff & Renau 2013). The Spanish Web corpus contains 20.3 billion words, from which 3.5 billion belong to the European Spanish domain. This corpus contains internet data, with observations originating from fora, blogs, Wikipedia, etc. Only the subcorpus with European Spanish data was consulted. The search syntax that was used to detect the inchoative construction was the following: “[lemma="echar"] [tag="R.*"]{0,3}"a"[tag="V.*"] within ” (consult Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for all corpus queries). After downloading, all the observations were manually cleaned. In total, the dataset contains, after the removal of false positives, 5514 tokens with a maximum of 500 tokens per auxiliary. False positive tokens were, for example, tagging errors wrongly coding nouns, such as Superman, Pokémon, Irán, among others, as infinitives, and also observations in which the auxiliary in combination with the infinitive did not express the inchoative value but its orginal semantic meaning, such as "saltar a nadar", for example, which means “to jump to swim” and not “to start to swim”. Of the auxiliaries with less than 500 relevant tokens in the esTenTen corpus, all tokens in the dataset were retained; for the auxiliaries with more than 500 tokens in the esTenTen corpus, only the first 500 were selected. For this specific study on the throw verbs, only the following auxilaries were retained: arrojar, disparar, echar, lanzar and tirar. For the diachronic data, the Corpus Diacrónico del Español (CORDE) was consulted. See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific queries that were used to retrieve the data in CORDE.

  4. P

    Machine Number Sense Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenhe Zhang; Chi Zhang; Yixin Zhu; Song-Chun Zhu (2021). Machine Number Sense Dataset [Dataset]. https://paperswithcode.com/dataset/machine-number-sense
    Explore at:
    Dataset updated
    Feb 14, 2021
    Authors
    Wenhe Zhang; Chi Zhang; Yixin Zhu; Song-Chun Zhu
    Description

    Consists of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG). These visual arithmetic problems are in the form of geometric figures: each problem has a set of geometric shapes as its context and embedded number symbols.

  5. Input utterance length (Alt et al., 2021)

    • asha.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mary Alt; Cecilia R. Figueroa; Heidi M. Mettler; Nora Evans-Reitz; Jessie A. Erikson (2023). Input utterance length (Alt et al., 2021) [Dataset]. http://doi.org/10.23641/asha.14226641.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    American Speech–Language–Hearing Associationhttp://www.asha.org/
    Authors
    Mary Alt; Cecilia R. Figueroa; Heidi M. Mettler; Nora Evans-Reitz; Jessie A. Erikson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose: This study examined the efficacy of the Vocabulary Acquisition and Usage for Late Talkers (VAULT) treatment in a version that manipulated the length of clinician utterance in which a target word was presented (dose length). The study also explored ways to characterize treatment responders versus nonresponders.Method: Nineteen primarily English-speaking late-talking toddlers (aged 24–34 months at treatment onset) received VAULT and were quasirandomly assigned to have target words presented in grammatical utterances matching one of two lengths: brief (four words or fewer) or extended (five words or more). Children were measured on their pre- and posttreatment production of (a) target and control words specific to treatment and (b) words not specific to treatment. Classification and Regression Tree (CART) analysis was used to classify responders versus nonresponders.Results: VAULT was successful as a whole (i.e., treatment effect sizes of greater than 0), with no difference between the brief and extended conditions. Despite the overall significant treatment effect, the treatment was not successful for all participants. CART results (using participants from the current study and a previous iteration of VAULT) provided a dual-node decision tree for classifying treatment responders versus nonresponders.Conclusions: The input-based VAULT treatment protocol is efficacious and offers some flexibility in terms of utterance length. When VAULT works, it works well. The CART decision tree uses pretreatment vocabulary levels and performance in the first two treatment sessions to provide clinicians with promising guidelines for who is likely to be a nonresponder and thus might need a modified treatment plan.Supplemental Material S1. Individual performance for participants identified as responders (effect size > 0) across baseline and treatment sessions for target and control words. The treatment condition for each participant (brief, i.e., 4 words or fewer; extended, i.e., 5 words or more) is indicated at the top of each graph.Supplemental Material S2. Individual performance for participants identified as nonresponders across baseline and treatment sessions for target and control words. The treatment condition for each participant (brief, i.e., 4 words or fewer; extended, i.e., 5 words or more) is indicated at the top of each graph.Alt, M., Figueroa, C. R., Mettler, H. M., Evans-Reitz, N., & Erikson, J. A. (2021). A vocabulary acquisition and usage for late talkers treatment efficacy study: The effect of input utterance length and identification of responder profiles. Journal of Speech, Language, and Hearing Research. Advance online publication. https://doi.org/10.1044/2020_JSLHR-20-00525

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. Geological Survey (2024). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g

Grammar transformations of topographic feature type annotations of the U.S. to structured graph data.

Explore at:
Dataset updated
Jul 20, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description

These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

Search
Clear search
Close search
Google apps
Main menu