86 datasets found
  1. W

    Webis-Sentences-17

    • webis.de
    • anthology.aicmu.ac.cn
    205950
    Updated 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Benno Stein; Stefan Lucks (2017). Webis-Sentences-17 [Dataset]. http://doi.org/10.5281/zenodo.205950
    Explore at:
    205950Available download formats
    Dataset updated
    2017
    Dataset provided by
    Bauhaus-Universität Weimar
    The Web Technology & Information Systems Network
    Authors
    Johannes Kiesel; Benno Stein; Stefan Lucks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Sentences-17 corpus is a collection of 3,369,618,811 sentences extracted from the ClueWeb12 web crawl. It is designed to allow for statistical analyses of human-written sentences. More details on the sentence extraction can be found in the associated publication.

  2. E

    Webis-Simple-Sentences-17 Corpus

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    txt
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Webis-Simple-Sentences-17 Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7442
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 17, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were sampled from a larger corpus to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word.The corpus is split in training and test set as it is used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts.More information on the corpus can be found on the corpus web page at our university (listed under documented by).

  3. E

    Data from: COSTRA 1.0: A Dataset of Complex Sentence Transformations

    • live.european-language-grid.eu
    • explore.openaire.eu
    • +1more
    binary format
    Updated Dec 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). COSTRA 1.0: A Dataset of Complex Sentence Transformations [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1288
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Dec 2, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing.

    The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.

    The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.

  4. E

    COSTRA 1.1: A Dataset of Complex Sentence Transformations and Comparisons

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Jun 14, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). COSTRA 1.1: A Dataset of Complex Sentence Transformations and Comparisons [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1304
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 14, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.

  5. E

    Data from: Serbian-English parallel corpus MaCoCu-sr-en 1.0

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Apr 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Serbian-English parallel corpus MaCoCu-sr-en 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21562
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 25, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Serbian-English parallel corpus MaCoCu-sr-en 1.0 was built by crawling the “.rs” and “.срб” internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains as well.

    All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus.

    The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. In each format, the texts are separated based on the script into two files: a Latin and a Cyrillic subcorpus. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level.

    Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document.

    The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American).

    Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

    This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

  6. E

    Webis Abstractive Snippet Corpus 2020

    • live.european-language-grid.eu
    json
    Updated Aug 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Webis Abstractive Snippet Corpus 2020 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7817
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Aug 19, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis Abstractive Snippet 2020 (Webis-Snippete-20) comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million

  7. E

    Data from: Slovenian Definition Extraction training dataset DF_NDF_wiki_slo...

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated May 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Slovenian Definition Extraction training dataset DF_NDF_wiki_slo 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21587
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    May 18, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Slovenian definition extraction training dataset DF_NDF_wiki_slo contains 38613 sentences extracted from the Slovenian Wikipedia. The first sentence of a term's description on Wikipedia is considered a definition, and all other sentences are considered non-definitions.

    The corpus consists of the following files each containing one definition / non-definition sentence per line:

    1. Definitions: df_ndf_wiki_slo_Y.txt with 3251 definition sentences.
    2. Non-definitions: df_ndf_wiki_slo_N.txt with 14678 non-definition sentences which do not contain the term at the beginning of the sentence.
    3. Non-definitions: df_ndf_wiki_slo_N1.txt with 20684 non-definition sentences which may also contain the term at the beginning of the sentence.

    The dataset is described in more detail in Fišer et al. 2010. If you use this resource, please cite:

    Fišer, D., Pollak, S., Vintar, Š. (2010). Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). https://aclanthology.org/L10-1089/

    Reference to training Transformer-based definition extraction models using this dataset: Tran, T.H.H., Podpečan, V., Jemec Tomazin, M., Pollak, Senja (2023). Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT. Proceedings of the ELEX 2023: Electronic lexicography in the 21st century. Invisible lexicography: everywhere lexical data is used without users realizing they make use of a “dictionary”.

    Related resources: Jemec Tomazin, M. et al. (2023). Slovenian Definition Extraction evaluation datasets RSDO-def 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1841

  8. E

    WMT16 APE Shared Task Data - Reference sentences

    • live.european-language-grid.eu
    • lindat.cz
    binary format
    Updated Jul 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). WMT16 APE Shared Task Data - Reference sentences [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1190
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 11, 2017
    License

    https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21

    Description

    Training, development and test data consist in German sentences belonging to the IT domain and already tokenized. These sentences are the references of the data released for the 2016 edition of the WMT APE shared task. Differently from the data previously released, these sentences are obtained by manually translating the source sentence without leveraging the raw mt outputs. Training and development respectively contain 12,000 and 1,000 segments, while the test set 2,000 items. All data is provided by the EU project QT21 (http://www.qt21.eu/).

  9. E

    Data from: Greek-English parallel corpus MaCoCu-el-en 1.0

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Jul 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Greek-English parallel corpus MaCoCu-el-en 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22970
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 6, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Greek-English parallel corpus MaCoCu-el-en 1.0 was built by crawling the “.gr", ".ελ", ".cy" and ".eu" internet top-level domain in 2023, extending the crawl dynamically to other domains as well.

    All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus.

    The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level.

    Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document.

    The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American).

    Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

    This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

  10. E

    Parallel sense-annotated corpus ELEXIS-WSD 1.0

    • live.european-language-grid.eu
    binary format
    Updated Jul 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Parallel sense-annotated corpus ELEXIS-WSD 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20269
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 27, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

    The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

    The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation.

    List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

    The corpus is available in a CONLL-like tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression).

    Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

    For more information, please refer to 00README.txt.

  11. E

    Simple Italian sentences ranked by readability

    • live.european-language-grid.eu
    Updated Jan 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Simple Italian sentences ranked by readability [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/960
    Explore at:
    Dataset updated
    Jan 26, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 500,000 sentences extracted from the Paisà corpus (https://www.corpusitaliano.it/) which have been selected for being easy to read according to four parameters: token number, average word length, depth of the parse tree and verb "arity". The sentences are ranked by readability.

  12. E

    Albanian-English parallel corpus MaCoCu-sq-en 1.0

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Apr 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Albanian-English parallel corpus MaCoCu-sq-en 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21561
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 25, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Albanian-English parallel corpus MaCoCu-sq-en 1.0 was built by crawling the “.al” internet top-level domain in 2022, extending the crawl dynamically to other domains as well.

    All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus.

    The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level.

    Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document.

    The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American).

    Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

    This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

  13. E

    Webis EditorialSum Corpus 2020

    • live.european-language-grid.eu
    csv
    Updated Oct 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Webis EditorialSum Corpus 2020 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7658
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 19, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis EditorialSum Corpus consists of 1330 manually curated extractive summaries for 266 news editorials spanning three diverse portals: Al-Jazeera, Guardian and Fox News. Each editorial has 5 summaries, each labeled for overall quality and fine grained properties such as thesis-relevance, persuasiveness, reasonableness, self-containedness.The files are organized as follows:corpus.csv - Contains all the editorials and their acquired summariesNote: (X = [1,5] for five summaries)- article_id : Article ID in the corpus- title : Title of the editorial- article_text : Plain text of the editorial- summary_{X}_text : Plain text of the corresponding summary- thesis_{X}_text : Plain text of the thesis from the corresponding summary- lead : top 15% of the editorial's segments- body : segments between lead and conclusion sections- conclusion : bottom 15% of the editorial's segments- article_segments: Collection of paragraphs, each further divided into collection of segments containing: { "number": segment order in the editorial, "text" : segment text, "label": ADU type }- summary_{X}_segments: Collection of summary segments containing:{ "number": segment order in the editorial, "text" : segment text, "adu_label": ADU type from the editorial, "summary_label": can be 'thesis' or 'justification'}quality-groups.csv - Contains the IDs for high(and low)-quality summaries for each quality dimension per editorialFor example: article_id 2 has four high_quality summaries (summary_1, summary_2, summary_3, summary_4) and one low_quality summary (summary_5) in terms of overall quality.The summary texts can be obtained from corpus.csv respectively.

  14. E

    Bangor University's Corpus of Welsh Speech Recognition Sentences

    • live.european-language-grid.eu
    txt
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Unit (2021). Bangor University's Corpus of Welsh Speech Recognition Sentences [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8046
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 17, 2021
    Dataset authored and provided by
    Language Technologies Unit
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Bangor
    Description

    This is a collection of Welsh language sentences released under a CC0

    license and collected by members of the Language Technologies Unit,

    Bangor University expressly to serve as prompts for recording audio to

    train Welsh speech recognition models. Sentences were collected from various sources and include:

    * Original sentences

    * Sentences from novels, essays and other out of copyright materials

    * Sentences from the Welsh Wicipedia where authors gave us permission to release them under a CC0 licence

    * Tweets, emails, and other electronic material gifted to the project to be used as prompts

    * CoVoST 2 corpus

  15. E

    Data from: SimPA: A Sentence-Level Simplification Corpus for the Public...

    • live.european-language-grid.eu
    Updated May 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SimPA: A Sentence-Level Simplification Corpus for the Public Administration Domain [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7451
    Explore at:
    Dataset updated
    May 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a sentence-level simplification corpus with content from the Public Administration (PA) domain. The corpus contains 1,100 original sentences with manual simplifications collected through a two-stage process. Firstly, annotators were asked to simplify only words and phrases (lexical simplification). Each sentence was simplified by three annotators. Secondly, one lexically simplified version of each original sentence was further simplified at the syntactic level. In its current version there are 3,300 lexically simplified sentences plus 1,100 syntactically simplified sentences. The corpus will be used for evaluation of text simplification approaches in the scope of the EU H2020 SIMPATICO project - which focuses on accessibility of e-services in the PA domain - and beyond. The main advantage of this corpus is that lexical and syntactic simplifications can be analysed and used in isolation. The lexically simplified corpus is also multi-reference (three different simplifications per original sentence). This is an ongoing effort and our final aim is to collect manual simplifications for the entire set of original sentences, with over 10K sentences.

  16. E

    Data from: WUT Relations Between Sentences Corpus

    • live.european-language-grid.eu
    binary format
    Updated Apr 24, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). WUT Relations Between Sentences Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8517
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Apr 24, 2016
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    WUT Relations Between Sentences Corpus contains 2827 pairs of related sentences. Relationships are derived from Cross-document Structure Theory (CST), which enables multi-document summarization through identification of cross-document rhetorical relationships within a cluster of related documents. Every relation was marked by at least 3 annotators.

  17. E

    Data from: A Computational Theory for the Emergence of Grammatical...

    • live.european-language-grid.eu
    Updated Jun 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). A Computational Theory for the Emergence of Grammatical Categories in Cortical Dynamics [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/939
    Explore at:
    Dataset updated
    Jun 9, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The file Corpora.txt keeps the corpus used to train the model and the different instances of the classifier. It is basically a text file with one sentence per line from the original corpus called test.tsv available at https://github.com/google-research-datasets/wiki-split.git. We eliminated punctuation marks and special characters from the original file putting each sentence per line.

    Enju_Output.txt holds the outputs generated by Enju in -so mode (Output in stand-off format) using Corpora.txt as input. This file has basically a natural language English per-sentence parse with a wide-coverage probabilistic for HPSG grammar.

    The file Supervision.txt keeps the grammatical tags of the corpus. This file holds a tag per word and each tag is situated in a single line. Sentences are separated by one empty line while tags from words in the same sentence are located in adjacent lines.

    The file Word_Category.txt carries the coarse-grained word category information needed by the model and introduced in it by apical dendrites. Each word in the corpus has a word-category tag which provides additional constraints to those provided by lateral dendrites. This file contains a tag per word and each tag is situated in a single line. Sentences are separated by one empty line while tags from words in the same sentence are located in adjacent lines.

    The file SynSemTests.xlsx keeps all the grammar classification results as well as the statistical analysis in the classification tests.

  18. E

    Data from: Hindi Web Texts

    • live.european-language-grid.eu
    • lindat.cz
    • +1more
    binary format
    Updated Nov 22, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). Hindi Web Texts [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1021
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 22, 2011
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens

  19. E

    Data from: Slovenian Word in Context dataset SloWiC 1.0

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Mar 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21522
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Mar 22, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

    Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example

  20. E

    Spanish text corpus for NLP/linguistics research

    • live.european-language-grid.eu
    txt
    Updated Oct 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Spanish text corpus for NLP/linguistics research [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7694
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 13, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Spanish text-corpus extracted from Wikipedia, using the platform described on Cadavid Rengifo, Héctor Fabio, and Jonatan Gómez Perdomo. "Web text corpus extraction system for linguistic tasks." Ingeniería e Investigación 29.3 (2009): 54-60, and the related master thesis available on ResearchGate.rawdata.dat: raw outcome of the extraction process from Wikipedia.sentences.txt: sentences extracted from the raw data after cleaning/filtering.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johannes Kiesel; Benno Stein; Stefan Lucks (2017). Webis-Sentences-17 [Dataset]. http://doi.org/10.5281/zenodo.205950

Webis-Sentences-17

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
205950Available download formats
Dataset updated
2017
Dataset provided by
Bauhaus-Universität Weimar
The Web Technology & Information Systems Network
Authors
Johannes Kiesel; Benno Stein; Stefan Lucks
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis-Sentences-17 corpus is a collection of 3,369,618,811 sentences extracted from the ClueWeb12 web crawl. It is designed to allow for statistical analyses of human-written sentences. More details on the sentence extraction can be found in the associated publication.

Search
Clear search
Close search
Google apps
Main menu