100+ datasets found
  1. d

    Data from: Grammar transformations of topographic feature type annotations...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

  2. Grammar Correction

    • kaggle.com
    zip
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satish Gunjal (2023). Grammar Correction [Dataset]. https://www.kaggle.com/datasets/satishgunjal/grammar-correction
    Explore at:
    zip(63861 bytes)Available download formats
    Dataset updated
    Dec 19, 2023
    Authors
    Satish Gunjal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset encapsulates the nuances of English grammar through a collection of errors and corrections across various categories. It serves as an invaluable resource for language learners, educators, and NLP enthusiasts. By categorizing common grammatical mistakes, the dataset offers insights into patterns of errors, providing a foundation for developing more sophisticated language models and educational tools. The data has been meticulously compiled to cover a broad spectrum of common grammatical pitfalls, making it a robust tool for improving both written and spoken English proficiency.

    This dataset is designed for a wide range of applications, including but not limited to Generative AI use cases, LLM testing, linguistic research, NLP model training, and educational purposes for both native and non-native English speakers.

    Modeling Potential - Train different machine learning models on a subset of the data and evaluate their performance in correcting the errors. - This can provide a benchmark for the dataset's utility in building AI-driven grammar correction tools.

  3. C4_200M

    • kaggle.com
    zip
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A0155991R_Li Liwei (2021). C4_200M [Dataset]. https://www.kaggle.com/datasets/a0155991rliwei/c4-200m
    Explore at:
    zip(20746659480 bytes)Available download formats
    Dataset updated
    Nov 13, 2021
    Authors
    A0155991R_Li Liwei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    Grammar Error Correction dataset synthesized based on: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

    Content

    This dataset contains roughly 185 Million sentence pairs generated using C4/en/3.0.1 dataset

    The data is stored in the format: { "input": "This is an grammatically wrong sentences.", "output": "This is a grammatically correct sentence." }

    Acknowledgements

    The C4 dataset was downloaded from allenai: https://github.com/allenai/allennlp/discussions/5056 The modified scripts used to generate the sentence pairs were referenced from: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction.

    Inspiration

    We hope that this dataset will help others by saving the trouble and time of generating this dataset.

  4. Z

    Data from: Using Grammar Patterns to Interpret Test Method Name Evolution

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +1more
    Updated Mar 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Peruma; Emily Hu; Jiajun Chen; Eman Abdullah Alomar; Mohamed Wiem Mkaouer; Christian D. Newman (2021). Using Grammar Patterns to Interpret Test Method Name Evolution [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4608142
    Explore at:
    Dataset updated
    Mar 22, 2021
    Dataset provided by
    Stony Brook University
    Tufts University
    Rochester Institute of Technology
    Authors
    Anthony Peruma; Emily Hu; Jiajun Chen; Eman Abdullah Alomar; Mohamed Wiem Mkaouer; Christian D. Newman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset that accompanies the study: "Using Grammar Patterns to Interpret Test Method Name Evolution." This study has been accepted for publication at 29th IEEE/ACM International Conference on Program Comprehension.

    Following is the abstract of the study: It is good practice to name test methods such that they are comprehensible to developers; they must be written in such a way that their purpose and functionality are clear to those who will maintain them. Unfortunately, there is little automated support for writing or maintaining the names of test methods. This can lead to inconsistent and low-quality test names and increase the maintenance cost of supporting these methods. Due to this risk, it is essential to help developers in maintaining their test method names over time. In this paper, we use grammar patterns, and how they relate to test method behavior, to understand test naming practices. This data will be used to support an automated tool for maintaining test names.

    Following are the contents of the dataset:

    ICPC2021-Public.sqlite -- A SQLite database containing the raw dataset used in this project

    ICPC2021-Public.xlsx -- Excel spreadsheet containing the complete listings for the tables in the paper

    Contents of SANER2021-Public.sqlite

    Table Name ---- Table Description "gitCommit" ---- The commit log for all projects "refactoring" ---- Mined refactoring operations from RefactoringMiner "refactoring_renamedMethod" ---- Mined Rename Method refactoring operations "detected_testfiles" ---- Detected unit test files "detected_testfiles_refactored" ---- Refactored unit test files "detected_testfiles_refactored_renamemethod" ---- Renamed Methods in refactored unit test files "annotation_grammar" ---- The data that was provided to the annotators "annotation_grammar_results" ---- The finalized results of the annotation "annotation_grammar_results_prefix2" ---- The first two part-of-speech tags of finalized annotation "annotation_grammar_results_prefix3" ---- The first three part-of-speech tags of finalized annotation "annotation_grammar_results_prefix4" ---- The first four part-of-speech tags of finalized annotation "annotation_grammar_results_prefix5" ---- The first five part-of-speech tags of finalized annotation

    "annotation_grammar_results_semantic " ---- The semantic relationship between the old and new names of the annotation results

  5. E

    Rule-based Synthetic Data for Japanese GEC

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    tsv
    Updated Oct 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Rule-based Synthetic Data for Japanese GEC [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7679
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Oct 28, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Title: Rule-based Synthetic Data for Japanese GEC. Dataset Contents:This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows:Synthetic Corpus - synthesized_data.tsv. This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction.These paired sentences are derived from data scraped from the keyword-lookup site

  6. c

    Data from an Investigation of Music Analysis by the Application of...

    • research-data.cardiff.ac.uk
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Humphreys; Kirill Sidorov; Andrew Marshall; Andrew Jones (2024). Data from an Investigation of Music Analysis by the Application of Grammar-based Compressor [Dataset]. http://doi.org/10.17035/d.2020.0098047203
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 18, 2024
    Dataset provided by
    Cardiff University
    Authors
    David Humphreys; Kirill Sidorov; Andrew Marshall; Andrew Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is composed of output and result data from various experiments performed on a substantial collection of digital musical scores. It does not contain these scores, but at the time of writing they are publicly available from the following resources: 1. The Acadia Early Music Archive (http://www.acadiau.ca/~gcallon/www/archive/) 2. The Choral Public Domain Library (http://www2.cpdl.org/) 3. Musopen (https://musopen.org/) 4. Music21 (https://web.mit.edu/music21/) 5. KernScores (http://kern.ccarh.org/) 6. The 1850 edition of O‘Neill’s Music Of Ireland (http://trillian.mit.edu/~jc/music/book/oneills/1850) 7. The Meertens Tune Collections (http://www.liederenbank.nl/mtc/) 8. The Johannes Kepler University Patterns Development Database (http://tomcollinsresearch.net/research/data/mirex/JKUPDD-Aug2013.zip)Digital scores were transformed into specific representations, and compressive models built using various compressors (ZZ, IRR, LZW, BWT, GZIP and COSIATEC). Performance on various tasks was evaluated from the model attributes (primarily model size, measured in symbols). Where possible, model metrics and computed performance figures are included in this dataset.3.1.1.1. An exhaustive test of sensitivity to an increasing number of errors (for Bach’s Fugue No. 10 from Das Wohltemperierte Clavier Book I)----------------------------------------------------------------------------------------------------------------------------------------------Rows: 2-46Compressor used: ZZ.Representations used: chromatic and diatonic pitch, chromatic and diatonic intervals, chromatic and diatonic contour, chromatic and diatonic pitch modulo 12, note duration.Attributes: Common to all experiments: representation Type of data taken from each score as input to the compressor. Compressor output:The size of the unaltered, compressed model.Increase in size from initial_model_size as an error is introduced to each position in sequence. Experiment result:The average increase in model size over all positions.Standard deviation within model_size_change.3.1.1. Sensitivity to point errors----------------------------------Rows: 47-260460Compressors used: ZZ, IRR, LZW, BWT, GZIP, COSIATEC.Representation used: diatonic intervals.Attributes: Common to all experiments:The changes made to a given position. The experiment is repeated in its entirety for each value.A list of indices into the input data; an alteration (change_value), representing an error, is made at each index. Compressor output:The size of the unaltered, compressed model.Change in size from initial_model_size when an error is present at exactly one location (from an index in positions_tested).Only one error is present within the piece at each iteration. One set of size changes exists for each value in change_made. Experiment result:The average increase in model size over all positions, for each change made.Standard deviation within each set of values in model_size_change.Time taken to perform all compression and measurement operations, in seconds. Times are system-dependent and circumstantial.3.1.2. Sensitivity to increasing number of errors-------------------------------------------------Rows: 260461-501987Compressors used: ZZ, IRR, LZW, BWT, GZIP, COSIATEC.Representation used: diatonic intervals.Attributes: Common to all experiments:The changes made to a given position. The experiment is repeated once for each value, but each change is chosen randomly from this list of values.A list of indices into the input data; an alteration (change_value), representing an error, is made at each index. Compressor output:The size of the unaltered, compressed model.Change in size from initial_model_size when an error is added to a new location (from the indices in positions_tested).Number of errors within the piece increases at each iteration. One set of size changes exists for each value in change_made, but the actual change at each position is a random selection from change_made. Experiment result:The average increase in model size over all positions, for each additional change made.Standard deviation within each set of values in model_size_change.Time taken to perform all compression and measurement operations, in seconds. Times are system-dependent and circumstantial.3.1.3. Automatic selection of candidate Transcription Error Positions---------------------------------------------------------------------Rows: 501988-3079962Compressors used: ZZ, IRR, LZW, BWT, GZIP, COSIATEC.Representation used: diatonic intervals.Attributes: Common to all experiments:The changes made to a given position. The experiment is repeated in its entirety for each value.A list of indices into the input data; an alteration (change_value), representing an error, is made at each index.A list of indices into the input data; an alteration (change_value), representing a possible correction, is made at each index, upon a model containing exactly one alteration (error) at one of each position from the error_positions list. Compressor output:The size of a model containing exactly one alteration (representing an error). One model exists for each position in error_positions.Change in size from initial_model_size when a potential correction is added to a new location (from the indices in positions_tested). For each error, one attempt to correct at all positions_tested is made.One output set exists for each value in change_made.Number of true and false positives, chosen by selecting all indices where the "corrected" model is smaller that the model containing only the error.Number of true positives; when 1, this means an attempt to correct the value at the index of the error resulted in a smaller model size.Of all sorted unique model sizes occurring at the index of the error, the correct change belongs to the nth group. When -1, no attempt to correct the error resulted in a smaller model. Experiment result:Average F-measure from all attempts to select the correct position of the error.Average Precision from all attempts to select the correct position of the error.Average Recall from all attempts to select the correct position of the error.Average rank from all instances where the correct position of the error was found.Time taken to perform all compression and measurement operations, in seconds. Times are system-dependent and circumstantial.3.2.1. Classification of the Meertens Tune Collections by Family----------------------------------------------------------------Rows: 3079963-3136586Compressors used: ZZ, IRR, LZW.Representation used: chromatic and diatonic pitch, chromatic and diatonic intervals, chromatic and diatonic contour, chromatic and diatonic pitch modulo 12, note duration.Attributes: Results (for the given representation):The size of the unaltered, compressed model for the named piece.The names of all pieces to whom distance is calculated.The size of the model resulting from the concatenation [ piece_a piece_b ].The size of the model resulting from the concatenation [ piece_b piece_a ].Normalised Compression Formula by the specified formula between piece_a, piece_b.Normalised Compression Formula by the specified formula between piece_a, piece_b.Normalised Compression Formula by the specified formula between piece_a, piece_b.Simple distance calculated as the sum of sizes of models for piece_a and piece_b. 3.3.1. MIREX 2016 Discovery of Repeated Themes & Sections task--------------------------------------------------------------Rows: 3136587-3136671Compressor used: ZZ.Representation used: diatonic intervals.Attributes, as defined in the MIREX 2016 Discovery of Repeated Themes and Sections task: Experiment result:Number of patterns sought, from ground truth.Number of patterns identified the by algorithm.Establishment precision.Establishment recall.Establishment F-measure.Occurrence precision, with detection threshold of 0.75.Occurrence recall, with detection threshold of 0.75.Occurrence F-measure, with detection threshold of 0.75.Three-layer precision.Three-layer recall.Three-layer F-measure.Occurrence precision, with detection threshold of 0.5.Occurrence recall, with detection threshold of 0.5.Occurrence F-measure, with detection threshold of 0.5.Standard precision.Standard recall.Standard F-measure. 3.3.2. Structural Analysis of Bach's Well-Tempered Clavier----------------------------------------------------------Rows: 3136672-3136719Compressor used: ZZRepresentation used: diatonic intervals.Attributes: Result of matching model segmentation to that specified by S. Bruhn:Semantic name of defined segment.Number of the closet-matching rule.Jaccard Index (intersection over union) for the match.Voice numbers containing instances of the chosen rule.Position within the first voice where the first instance of the chosen rule occurs.Position within the first voice where the first instance of the chosen rule ends.Research results based upon these data are published at https://doi.org/10.1080/09298215.2021.1978505

  7. r

    Data from: Event conceptualisation and aspect in L2 English and Persian: An...

    • researchdata.se
    • demo.researchdata.se
    Updated Nov 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Somaje Abdollahian Barough (2019). Event conceptualisation and aspect in L2 English and Persian: An application of the Heidelberg-Paris model [Dataset]. http://doi.org/10.5878/wz3s-wt38
    Explore at:
    (10147845)Available download formats
    Dataset updated
    Nov 7, 2019
    Dataset provided by
    Stockholm University
    Authors
    Somaje Abdollahian Barough
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Time period covered
    Aug 1, 2010 - Jul 31, 2013
    Area covered
    Sweden, Iran, Islamic Republic of, United Kingdom, United States
    Description

    The data have been used in an investigation for a PhD thesis in English Linguistics on similarities and differences in the use of the progressive aspect in two different language systems, English and Persian, both of which have the grammaticalised progressive. It is an application of the Heidelberg-Paris model of investigation into the impact of the progressive aspect on event conceptualisation. It builds on an analysis of single event descriptions at sentence level and re-narrations of a film clip at discourse level, as presented in von Stutterheim and Lambert (2005) DOI: 10.1515/9783110909593.203; Carroll and Lambert (2006: 54–73) http://libris.kb.se/bib/10266700; and von Stutterheim, Andermann, Carroll, Flecken & Schmiedtová (2012) DOI: 10.1515/ling-2012-0026. However, there are system-based typological differences between these two language systems due to the absence/presence of the imperfective-perfective categories, respectively. Thus, in addition to the description of the status of the progressive aspect in English and Persian and its impact on event conceptualisation, an important part of the investigation is the analysis of the L2 English speakers’ language production as the progressives in the first languages, L1s, exhibit differences in their principles of use due to the typological differences. The question of importance in the L2 context concerns the way they conceptualise ongoing events when the language systems are different, i.e. whether their language production is conceptually driven by their first language Persian.

    The data consist of two data sets as the study includes two linguistic experiments, Experiment 1 and Experiment 2. The data for both experiments were collected by email. Separate forms of instructions, and language background questions were prepared for the six different informant groups, i.e. three speaker groups and two experimental tasks, as well as a Nelson English test https://www.worldcat.org/isbn/9780175551972 on the proficiency of English for Experiment 2 was selected and modified for the L2 English speaker group. Nelson English tests are published in Fowler, W.S. & Coe, N. (1976). Nelson English tests. Middlesex: Nelson and Sons. The test battery provides tests for all levels of proficiency. The graded tests are compiled in ten sets from elementary to very advanced level. Each set includes four graded tests, i.e. A, B, C, and D, resulting in 40 separate tests, each with 50 multiple-choice questions. The test entitled 250C was selected for this project. It belongs to the slot 19 out of the 40 slots of the total battery. The multiple-choice questions were checked with a native English professional and 5 inadequate questions relevant for pronunciation were omitted. In addition, a few modifications of the grammar questions were made, aiming at including questions that involve a contrast for the Persian L2 English learner with respect to the grammars of the two languages. The omissions and modifications provide an appropriate grammar test for very advanced Iranian learners of L2 English who have learnt the language in a classroom setting. The data set collected from the informants are characterised as follows: The data from Experiment 1 functions as the basis for the description of the progressive aspect in English, Persian and L2 English, while the data from Experiment 2 is the basis for the analysis of its use in a long stretch of discourse/language production for the three speaker groups. The parameters selected for the investigation comprised, first, phasal decomposition, which involves the use of the progressive in unrelated single motion events and narratives, and uses of begin/start in narratives. Second, granularity in narratives, which relates to the overall amount of language production in narratives. Third, event boundedness (encoded in the use of 2-state verbs and 1-state verbs with an endpoint adjunct) partly in single motion events and partly in temporal shift in narratives. Temporal shift is defined as follows: Events in the narrative which are bounded shift the time line via a right boundary; events with a left boundary also shift the time line, even if they are unbounded. Fourth, left boundary comprising the use of begin/start and try in narratives. Finally, temporal structuring, which involves the use of bounded versus unbounded events preceding the temporal adverbial then in narratives (The tests are described in the documentation files aspectL2English_Persian_Exp2Chi-square-tests-in-SPSS.docx and aspectL2English_Persian_Exp2Chi-square-tests-in-SPSS.rtf). In both experiments the participants watched a video, one relevant for single event descriptions, the other relevant for re-narration of a series of events. Thus, two different videos with stimuli for the different kinds of experimental tasks were used. For Experiment 1, a video of 63 short film clips presenting unrelated single events was provided by Professor Christiane von Stutterheim, Heidelberg University Language & Cognition (HULC) Lab, at Heidelberg University, German, https://www.hulclab.eu/. For Experiment 2, an animation called Quest produced by Thomas Stellmach 1996 was used. It is available online at http://www.youtube.com/watch?v=uTyev6OaThg. Both stimuli have been used in the previous investigations on different languages by the research groups associated with the HULC Lab. The informants were asked to describe the events seen in the stimuli videos, to record their language production and send it to the researcher. For Experiment 2, most part of the L1 English data were provided by Prof. von Stutterheim, Heidelberg University, making available 34 re-narrations of the film Quest in English. 24 of them were selected for the present investigation. The project used six different informant groups, i.e. fully separate groups for the two experiments. The data from single event descriptions in Experiment 1 were analysed quantitatively in Excel. The re-narrations of Experiment 2 were coded in NVivo 10 (2014) providing frequencies of various parametrical features (Ltd, Nv. (2014). NVivo QSR International Pty Ltd, Version 10. Doncaster, Australia: QSR International). The numbers from NVivo 10 were analysed statistically in Excel and SPSS (2017). The tools are appropriate for this research. Excel suits well for the smaller data load in Experiment 1 while NVivo 10 is practical for the large amount of data and parameters in Experiment 2. Notably, NVivo 10 enabled the analysis of the three data sets to take place in the same manner once the categories of analysis and parameters had been defined under different nodes. As the results were to be extracted in the same fashion from each data set, the L1 English data received from the Heidelberg for Experiment 2 were re-analysed according to the criteria employed in this project. Yet, the analysis in the project conforms to the criteria used earlier in the model.

  8. h

    grammar

    • huggingface.co
    Updated Dec 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSY-ModelCloud (2023). grammar [Dataset]. https://huggingface.co/datasets/csy-modelcloud/grammar
    Explore at:
    Dataset updated
    Dec 22, 2023
    Authors
    CSY-ModelCloud
    Description

    from datasets import load_dataset

    grammar_plus_v1_gpt4 is gdd v1 data, train data size is 95, validation size is 5.

    dataset_1 = load_dataset("LnL-AI/grammar", name="grammar_plus_v1_gpt4")

    grammar_plus_v2_calude is gdd v2 data, train data size is 47, validation size is 3.

    dataset_v2 = load_dataset("LnL-AI/grammar", name="grammar_plus_v2_calude")

    grammar_plus_v3_gpt4 is the data generated on August 21, 2023, train data size is 141, validation size is 18.

    dataset_v3 =… See the full description on the dataset page: https://huggingface.co/datasets/csy-modelcloud/grammar.

  9. f

    Data from: The temporality of the imperfect past tense of the subjunctive in...

    • scielo.figshare.com
    jpeg
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angela Cristina Di Palma Back; Márluce Coan (2023). The temporality of the imperfect past tense of the subjunctive in relation to its point of reference: theoretical perspectives [Dataset]. http://doi.org/10.6084/m9.figshare.20024909.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    SciELO journals
    Authors
    Angela Cristina Di Palma Back; Márluce Coan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract In this paper, the perspectives by Bello (1841), Reichenbach (1947), Comrie (1990) and Rojo and Veiga (1999) on the reference point are applied to 350 data of subjunctive imperfect from 60 sociolinguistic interviews present in the Sociolinguistic Atlas of the AMREC region. The proposal is to: demonstrate to what extent these perspectives approach or distance themselves; empirically prove the application of the proposals, through qualitative-quantitative analysis; attest that time is a discursive category. Thereunto, similarities are observed in terms of explanatory power, logical resources and correlation between time and tense. The main difference resides in the vision of temporality: logical or discursive ones. Regarding to the data, 116 smples are ambiguous, if we consider Bello (1841), Reichenbach (1947) and Comrie (1990) proposals. By Rojo and Veiga (1999) proposal, because it is recursive and goes beyond sentence, it dissolves ambiguity and these data are analyzed discursively as previous or posterior or co-temporal to the reference point.

  10. Swedish Speech Acts

    • kaggle.com
    zip
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Tufvesson (2024). Swedish Speech Acts [Dataset]. https://www.kaggle.com/datasets/danieltufvesson/swedics-speech-acts
    Explore at:
    zip(507722308 bytes)Available download formats
    Dataset updated
    Jun 3, 2024
    Authors
    Daniel Tufvesson
    Description

    What are Speech Acts?

    What is done through speaking? In a sense, a spoken utterance is just a string of vocal sounds. But in another sense, it is also a social action that has real effects on the world. For example, "Can you pass the salt?" is an act of requesting the salt, which can then result in obtaining the salt. These spoken actions are referred to as speech acts. We humans unconsciously understand and categorize speech acts all the time. The meaning of a speech act depends both on the syntax and semantics of the sentence and the conversational context in which it occurs.

    What are in these Data Sets?

    The data sets consist of isolated, Swedish sentences originating from online discussion forums (familjeliv.se and flashback.se). I have hand-labeled these with their respective speech acts.

    What Speech Acts are Annotated?

    The sentences are annotated with the following speech acts, which are taken from The Swedish Academy Grammar (Teleman et al., 1999):

    • Assertive: the speaker holds that the content of the sentence is true or at least true to a varying degree. For example: “They launched a car into space.”

    • Question: the speaker requests information regarding whether or not something is true, or under what conditions it is true. For example: “Are you busy?” or “How much does the car cost?”.

    • Directive: the speaker attempts to get the listener to carry out the action described by the sentence. For example: “Open the door!” or “Will you hold this for me?”

    • Expressive: the speaker expresses some feeling or emotional attitude about the content of the sentence. For example: “What an adorable dog!” or “The Avengers are awesome!”

    Why do these Exist?

    I created these data sets for training and evaluating two machine learning classifiers. These are available on GitHub.

    Data Files

    These are all CoNLL-U corpora. They all consist of sentences manually annotated with speech acts. The sentences were also automatically annotated with sentiment (positive, negative, neutral), and its probability score.

    • all-data.conllu.bz2 - All annotated sentences.

    • dev-set.conllu.bz2 - The dev (or validation) set. Split from all-data.conllu.bz2.

    • dev-test-set.conllu.bz2 - A test split of the dev set.

    • dev-test-set-upsampled.conllu.bz2 - An upsampled version of dev-test-set.

    • dev-train-set.conllu.bz2 - A train split of the dev set.

    • dev-train-set-upsampled.conllu.bz2 - An upsampled version of dev-train-set.

    • test-set.conllu.bz2 - The test set used for evaluation. Split from all-data.conllu.bz2.

    • test-set-upsampled.conllu.bz2 - An upsampled version of the test-set.

    • train-set.conllu.bz2 - A train set. This was automatically annotated by a rule-based classifier.

    CoNLL-U Format

    The corpora are formatted as CoNLL-U. In addition to the standard CoNLL-U annotations (Universal Dependencies, n.d.-a), I have added the following attributes as sentence comments to each sentence: - sent_id: a unique identifying integer. This is unique across all the data sets. - text: the full, unsegmented sentence. - date: the date and time on which the sentence was posted on the internet forum. - url: the URL of where the sentence was posted. - genre: the text genre of the sentence. This is technically superfluous since all the sentences are of the same genre, namely internet_forum. - x_sent_id: the ID of the sentence in the original corpus. - speech_act: the annotated speech act of the sentence, whether automatically or manually annotated. The possible values are assertion, question, directive, and expressive. - sentiment_label: the label denoting the sentiment of the sentence. This was automatically tagged by the sentiment tagger. The labels are either positive, neutral, or negative. - sentiment_score: the estimated probability of the sentiment label. As with the sentiment label, this was also done by the sentiment tagger.

    # sent_id = 2200888
    # text = Känns hoppfull med så många exempel.
    # date = 2009-10-26 16:19:10
    # url = http://www.familjeliv.se/forum/thread/48269320-bara-solsken-och-hopp/1#anchor-m3
    # genre = internet_forum
    # x_sent_id = 053044fa6
    # speech_act = expressive
    # sentiment_label = positive
    # sentiment_score = 0.9705862402915955
    1  Känns    känna|kännas  VERB  VB  Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass  0  root  _  _
    2  hoppfull  hoppfull    ADJ   JJ  Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing  1  xcomp  _  _
    3  med     med       ADP   PP  _  6  case  _  _
    4  så     så       ADVERB AB  _  5  advmod  _  _
    5  många    _        ADJ   JJ  Case=Nom|Definite=Def,Ind|Degree=Pos|Gender=Com,Neut|Num...
    
  11. D

    Replication Data for: A grammar of Pnar

    • researchdata.ntu.edu.sg
    bin +4
    Updated Apr 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiram Ring; Hiram Ring (2018). Replication Data for: A grammar of Pnar [Dataset]. http://doi.org/10.21979/N9/KVFGBZ
    Explore at:
    bin(9264897), bin(41702878), bin(34214407), bin(19514436), bin(19908255), bin(25168775), bin(62878709), bin(31763907), bin(35224946), tsv(5723), bin(12762029), bin(43559663), bin(25299500), zip(170606), bin(10262043), bin(37264028), bin(1699896), bin(77861570), bin(16117238), text/plain; charset=utf-8(184185), text/plain; charset=us-ascii(3440), bin(27746834), bin(11602432), bin(33464854), bin(24491746), bin(161643597), bin(281781090), bin(48424119), bin(13946831), text/plain; charset=utf-8(4002573), bin(24195718), bin(20741544), bin(13642955), bin(12518803), bin(6731660), bin(31676066), bin(15714163), bin(83799384), bin(45955540), bin(3997308), text/plain; charset=utf-8(4002773), bin(12793807), text/plain; charset=us-ascii(1692), bin(76616630)Available download formats
    Dataset updated
    Apr 30, 2018
    Dataset provided by
    DR-NTU (Data)
    Authors
    Hiram Ring; Hiram Ring
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Jun 2011 - Jul 2013
    Area covered
    Meghalaya, India
    Description

    This dataset consists of linguistic data (recordings and transcriptions/translations) gathered on the Pnar language by Dr. Hiram Ring. Pnar (ISO 639-3: pbv) is a language spoken in Meghalaya state of northeast India by around 400,000 people. The recordings and transcriptions in this dataset were carried out in and around Jowai between June 2011 and July 2013. There are two files that describe the contents of this dataset in more detail: README_Dataset.txt and README_Toolbox.txt - these should be consulted for instructions on how to access the data, once downloaded. The Dictionary.txt and Texts.txt files contain lexical data and interlinearized transcriptions/translations respectively. The majority of the dataset consists of sound files encoded as *.flc or FLAC (Free Lossless Audio Codec) files. These have the advantage of containing lossless audio but taking up significantly less space than lossless WAV files.

  12. F

    Spanish Extraction Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Spanish Extraction Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/spanish-extraction-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Welcome to the Spanish Extraction Type Prompt-Response Dataset, a meticulously curated collection of 1500 prompt and response pairs. This dataset is a valuable resource for enhancing the data extraction abilities of Language Models (LMs), a critical aspect in advancing generative AI.

    Dataset Content

    This extraction dataset comprises a diverse set of prompts and responses where the prompt contains input text, extraction instruction, constraints, and restrictions while completion contains the most accurate extraction data for the given prompt. Both these prompts and completions are available in Spanish language.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Spanish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.

    Prompt Diversity

    To ensure diversity, this extraction dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The extraction dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.

    Response Formats

    To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, single sentence, and paragraph type of response. These responses encompass text strings, numerical values, and date and time, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Spanish Extraction Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.

    Quality and Accuracy

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Spanish version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom extraction prompt and completion data tailored to specific needs, providing flexibility and customization options.

    License

    The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Spanish Extraction Prompt-Completion Dataset to enhance the data extraction abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  13. d

    Replication Data for: \"The category of throw verbs as productive source of...

    • search.dataone.org
    • dataverse.no
    • +1more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hulle, Sven; Enghels, Renata (2024). Replication Data for: \"The category of throw verbs as productive source of the Spanish inchoative construction.\" [Dataset]. http://doi.org/10.18710/TR2PWJ
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    DataverseNO
    Authors
    Van Hulle, Sven; Enghels, Renata
    Time period covered
    Jan 1, 1200 - Jan 1, 2000
    Description

    The dataset contains the quantitative data used to create the tables and graphics in the article "The category of throw verbs as productive source of the Spanish inchoative construction." The data from the 21th century originates from the Spanish Web Corpus (esTenTen18), accessed via Sketch Engine. Only the subcorpus for European Spanish Data was selected. After downloading, the samples were manually cleaned. In the dataset, maximally 500 tokens were retained per auxiliary. For the earlier centuries, the data was extracted from the Corpus Diacrónico del Español (Corde). See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific corpus queries that were used. The data were annotated for the infinitive observed after the preposition 'a' and for the semantic class to which this infinitive belongs, following the existing ADESSE classification (see below), besides other criteria that are not taken into account for this study. Concretely, the variables 'Century', 'INF' (infinitive) and 'Class' were used as input for the analysis (see data-specific sections below for more information about the variables). The empirical analysis is based on the downloaded data from the Spanish Web corpus (esTenTen18) (Kilgariff & Renau 2013). The Spanish Web corpus contains 20.3 billion words, from which 3.5 billion belong to the European Spanish domain. This corpus contains internet data, with observations originating from fora, blogs, Wikipedia, etc. Only the subcorpus with European Spanish data was consulted. The search syntax that was used to detect the inchoative construction was the following: “[lemma="echar"] [tag="R.*"]{0,3}"a"[tag="V.*"] within ” (consult Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for all corpus queries). After downloading, all the observations were manually cleaned. In total, the dataset contains, after the removal of false positives, 5514 tokens with a maximum of 500 tokens per auxiliary. False positive tokens were, for example, tagging errors wrongly coding nouns, such as Superman, Pokémon, Irán, among others, as infinitives, and also observations in which the auxiliary in combination with the infinitive did not express the inchoative value but its orginal semantic meaning, such as "saltar a nadar", for example, which means “to jump to swim” and not “to start to swim”. Of the auxiliaries with less than 500 relevant tokens in the esTenTen corpus, all tokens in the dataset were retained; for the auxiliaries with more than 500 tokens in the esTenTen corpus, only the first 500 were selected. For this specific study on the throw verbs, only the following auxilaries were retained: arrojar, disparar, echar, lanzar and tirar. For the diachronic data, the Corpus Diacrónico del Español (CORDE) was consulted. See Spanish_ThrowVerbs_Inchoatives_queries_20230413.txt for the specific queries that were used to retrieve the data in CORDE.

  14. D

    Replication Data for: The Case for Case in Putin’s Speeches

    • dataverse.no
    • dataverse.azure.uit.no
    • +2more
    csv, txt
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Obukhova; Anna Obukhova (2023). Replication Data for: The Case for Case in Putin’s Speeches [Dataset]. http://doi.org/10.18710/APDMDZ
    Explore at:
    csv(67974), csv(46452), csv(198), txt(10483), csv(4856), txt(416), csv(25549)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    DataverseNO
    Authors
    Anna Obukhova; Anna Obukhova
    License

    https://dataverse.no/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.18710/APDMDZhttps://dataverse.no/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.18710/APDMDZ

    Time period covered
    Feb 10, 2022 - Mar 2, 2022
    Area covered
    Russia
    Dataset funded by
    The Research Council of Norway
    Description

    This is the data from the study that applies Keymorph Analysis of grammatical cases of nouns used in the Russian president V. Putin's speeches. The dataset includes: 1) metadata of the texts – twenty-nine transcripts of Putin's direct speech, produced between February 10, 2022 and March 2, 2022, which are the raw data in our study; 2) the sentences with the nouns meaning 'Russia', 'Ukraine', and 'NATO', extracted from the texts and tagged according to the grammatical cases of these nouns as well as the semantic meanings of the cases; 3) the calculated difference index (DIN*) values for the grammatical cases of the nouns meaning 'Russia', 'Ukraine', and 'NATO'. The DIN* was used as the effect size metric. The R code for creation of the bar chart with DIN* values for the grammatical cases of the nouns meaning 'Russia', 'Ukraine', and 'NATO' is also provided.

  15. E

    Arabic dictionary of inflected words with recognition of agglutinated...

    • catalog.elra.info
    • live.european-language-grid.eu
    • +1more
    Updated Aug 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) (2017). Arabic dictionary of inflected words with recognition of agglutinated clitics and inflection system [Dataset]. https://catalog.elra.info/en-us/repository/browse/arabic-dictionary-of-inflected-words-with-recognition-of-agglutinated-clitics-and-inflection-system/0476139ea9d711e7a093ac9e1701ca023d3af0aa9c84473ba347de06536beb9f/
    Explore at:
    Dataset updated
    Aug 31, 2017
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    This dictionary consists of 6 million inflected forms, fully vowelized, generated in compliance with the grammatical rules of Arabic and tagged with grammatical information which includes POS and grammatical features, including number, gender, case, definiteness, tense, mood and compatibility with clitic agglutination.It is accompanied by a grammatical resource that recognizes hundreds of millions of valid agglutinated words, i.e. words consisting of one of the forms in the dictionary preceded and/or followed by clitics (conjunctions, prepositions, articles, pronouns) in compliance with the grammatical rules of Arabic.In order to be able to update the full-form dictionary, a dictionary of 65 000 lemmas and the data required to inflect them and regenerate the full-form dictionary are also provided. This allows adapting the dictionary to specific applications by deleting and/or adding entries. The resource as it stands covers more than 98% of the forms found in any sort of literature, newspaper articles...; the remaining 2% include proper names, which can be relevant.The data is formatted in conformity with the data formats of Unitex/GramLab, an open source corpus processing system for language processing. These data formats are publicly documented. The data can either be converted into user-specific formats, or be used directly with Unitex/GramLab.This dictionary is also available without recognition of agglutinated clitics and without inflection system in the ELRA Catalogue under reference ELRA-L0098.Authors: Alexis NEME et Eric LAPORTE

  16. Z

    Duhumbi Grammar - Sound Files, Toolbox and Transcriber File, PDFs of files...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bodt, Timotheus Adrianus (2024). Duhumbi Grammar - Sound Files, Toolbox and Transcriber File, PDFs of files (Part 2) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3871819
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Bern University
    Authors
    Bodt, Timotheus Adrianus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains the .wav sound files, .trs Transcriber files, .txt Toolbox-compatible Notepad files and .pdf files with the completely transcribed, glossed, parsed and translated examples of the recordings that belong to the following publication:

    Bodt, Timotheus Adrianus. 2020. Grammar of Duhumbi. Leiden: Brill. ISBN 978-90-04-40947-7. https://brill.com/view/title/55767

    The explanation of all the grammatical features that occur in these sound files can be found in the Grammar of Duhumbi.

    The main Toolbox files can be found in the zip file “Settings”, this includes the IPA keys for Duhumbi, the entire setup of the Toolbox database, and the Duhumbi dictionary and Parsing dictionary.

    The .wav, .txt and .trs files combined in the same folder will enable to open Toolbox and work with the recordings, e.g. play them sentence for sentence and see the transcriptions and translations.

    Transcriber version 1.5.1: http://trans.sourceforge.net/en/presentation.php or https://osdn.net/projects/sfnet_trans/downloads/transcriber/1.5.1/Transcriber-1.5.1-Windows.exe/

    Toolbox version 1.6.1: https://software.sil.org/toolbox/download/

    This data set contains the files belonging to the sound files as mentioned in the pdf file “Duhumbi Grammar All Files Upload 2”. The S/N code corresponds to the code used in the Grammar to identify the text from which an example was taken. The name of the file refers to the name of the .wav, .trs, .txt and .pdf files in this upload. The subject is a short description of the topic of the text. The duration is the duration of the recording.

    For the metadata of the sound files in this data set, I refer to Chapter 13 Texts in the Grammar of Duhumbi. This Chapter has a complete listing of the texts, their topics, the speakers and their background etc.

    This material is made freely available to everyone for informative or scientific purposes as long as the source (this DOI) / the collectors are properly credited. Please note that use of the material for commercial purposes of any kind, which includes conversion into commercial audio-visual media (documentaries etc.), storage and dissemination through sites that require registration & payment for access, or sites that rely on advertisement (including YouTube) is not permitted without specific written consent from the speakers and their community, obtained through the collector of the material. By downloading this material, you agree to these restrictions.

    This data set falls under the Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license. This license lets you remix, tweak, and build upon this work non-commercially, as long as you credit us and license your new creations under the identical terms. License Deed on https://creativecommons.org/licenses/by-nc-sa/4.0/. Legal Code on https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.

    Tim Bodt: monpasang (at) gmail (dot) com

  17. d

    Replication Data for Grammatical Gender in Norwegian Dialects: Variation,...

    • search.dataone.org
    • dataverse.azure.uit.no
    • +1more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lohndal, Terje; van Baal, Yvonne; Eik, Ragnhild; Solbakken, Hedda (2024). Replication Data for Grammatical Gender in Norwegian Dialects: Variation, Acquisition and Change (GenVAC) [Dataset]. http://doi.org/10.18710/TKNNRQ
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    DataverseNO
    Authors
    Lohndal, Terje; van Baal, Yvonne; Eik, Ragnhild; Solbakken, Hedda
    Time period covered
    Oct 1, 2021 - Oct 1, 2022
    Area covered
    Norway
    Description

    [Dataset abstract:] The dataset consists of data related to investigations of grammatical gender across multiple Norwegian dialects. The data has been collected as part of the GenVAC project, funded by the Research Council of Norway 2020-2025, grant number 301094. The goal of GenVAC is to study changes in grammatical gender through large-scale experimental studies. In particular, it scrutinizes to what extent feminine gender is disappearing from Norwegian dialects based on four production experiments and eye-tracking studies. The data will be made available towards the end of the project. In addition, this dataset also includes files that enable scholars to use our methodology. The PowerPoint slides are available as original files and as pdfs for all four experiments. The background questionnaire has also been included, alongside a brief description of how to conduct the four experiments. These files are made available immediately. Since this project focuses on Norwegian, the accompanying article alongside all the material in this dataset are in Norwegian. Since this kind of work is impossible to do without knowing Norwegian, the material has not been translated.

  18. Z

    Dataset of Mordvin GOAL-cases

    • data-staging.niaid.nih.gov
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riku Erkkilä (2022). Dataset of Mordvin GOAL-cases [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5940174
    Explore at:
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    Ludwig-Maximilians-Universität München & University of Helsinki
    Authors
    Riku Erkkilä
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This open access dataset contains examples of the goal-oriented cases illative and lative in Mordvin languages Erzya and Moksha. The dataset consists of two parts: senses of the goal-cases (files with prefix sense) and type of the landmark noun (files with prefix LM). There is currently one dataset that includes 200 analyzed examples of each case in each language (800 examples in total). All the data is analyzed according to sense and landmark type. The data is collected from MokshEr corpus maintained by University of Turku.

    Sense means the semantic content that is expressed by the goal-case in a clause. The following values are found in the data: direction, location, part, place, purpose, reason, result, staying, target, and temporal. In addition, contextual variants of the senses are shown in brackets.

    Landmark type means what kind of entity the referent of the landmark is. There are eight categories in the data: 1D object, 2D bounded landmark, 2D unbounded landmark, 3D bounded landmark, 3D unbounded landmark, abstract landmark, institution, and temporal landmark.

    The senses-data is annotated for following information:

    The noun or relational noun phrase inflected in goal-case.

    The predicate as inflected in the data.

    Translations of both (mainly in citation form, but in predicate sometimes with some grammatical information)

    The sense of the goal-case in the utterance.

    The prototypicality of the example as a member of its sense on a scale from 1 (non- prototypical) to 5 (prototypical) NB! This assessment is based on the authors language competence and on general semantic principles, and as such should be considered only as directive.

    The original sentence from the corpus.

    Free translation. Some of the translations are done following the lexical meanings and syntactic structures of Mordvin languages, so the English is unidiomatic from time to time.

    The file name with which the original sentence can be located in the corpus.

    The landmark-data is annotated for the same information, except that there is additional information of the landmark type, and the prototypicality score is a value from 1 (non-prototypical) to 4 (prototypical) showing how prototypical the referent of the landmark is in its landmark type. Score 0 means that the prototypicality scale is not applied to the landmark type in question. Unlike in sense-data, the prototypicality in the landmark-data is based on prelinguistic spatial primitives of containment and support, as well as the human ability to recognize boundaries.

    The author of this dataset is Riku Erkkilä and it is published under CC-BY-NC-ND licence. The data is originally collected in the framework of Descriptive Grammar of Mordvin project (University of Helsinki) funded by Kone Foundation. If used in a publication, please refer to this publication as well as mention the original source:

    MokshEr V.3 (2010). Corpus of Mordvin languages. University of Turku.

    This dataset has been used in following publications:

  19. Cases of Complements of Finnish Verbs

    • kaggle.com
    zip
    Updated Jun 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mika Hämäläinen (2020). Cases of Complements of Finnish Verbs [Dataset]. https://www.kaggle.com/mikahama/cases-of-complements-of-finnish-verbs
    Explore at:
    zip(217452 bytes)Available download formats
    Dataset updated
    Jun 17, 2020
    Authors
    Mika Hämäläinen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    Cases of the complements of Finnish verbs. The data is useful for natural language generation (NLG). The data is described in the following paper, which should also be cited if this data is used:

    Hämäläinen, Mika and Rueter, Jack 2018. Development of an Open Source Natural Language Generation Tool for Finnish. In Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages, 51–58.

    Content

    The file contains a list of Finnish verbs from the Finnish Internet Parsebank, these have been lemmatized and filtered by part-of-speech with Omorfi. For each verb, there is a list of grammatical cases together with how many times that case has occurred with a syntactic relation to the verb on the right context of the verb.

    Essentially, this data can be used to see the most typical direct object case (partitive, genitive, elative..) for each Finnish verb. The data can also indicate whether the verb can take an indirect object as well or not.

    Inspiration

    This data is important for NLG tasks. One could learn to predict if a verb can take a direct object or also an indirect object.

    This data has been used to generate poems in Finnish:

    Hämäläinen, M. (2018). Harnessing NLG to Create Finnish Poetry Automatically. In F. Pachet, A. Jordanous, & C. León (Eds.), Proceedings of the Ninth International Conference on Computational Creativity (pp. 9-15). Salamanca: Association for Computational Creativity (ACC).

  20. r

    Automatic derivation of compact abstract syntax data types from concrete...

    • resodate.org
    Updated Jun 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Lorenzen (2020). Automatic derivation of compact abstract syntax data types from concrete syntax descriptions [Dataset]. http://doi.org/10.14279/depositonce-10233
    Explore at:
    Dataset updated
    Jun 15, 2020
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Florian Lorenzen
    Description

    We describe an algorithm (written in Haskell) to automatically derive compact abstract syntax data types from concrete grammar descriptions. The input for the algorithm is almost the grammar language of the parser generator PaGe which not only constructs a parser but also the data types necessary to represent an abstract syntax tree. The algorithm of this report is suitable to minimize the data type used to represent the parsing result to both improve the handling of abstract syntax trees and their space requirements.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. Geological Survey (2025). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g

Data from: Grammar transformations of topographic feature type annotations of the U.S. to structured graph data.

Related Article
Explore at:
Dataset updated
Oct 29, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description

These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

Search
Clear search
Close search
Google apps
Main menu