14 datasets found
  1. W

    Webis-TLDR-17

    • webis.de
    1043504
    Updated 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein (2017). Webis-TLDR-17 [Dataset]. http://doi.org/10.5281/zenodo.1043504
    Explore at:
    1043504Available download formats
    Dataset updated
    2017
    Dataset provided by
    Bauhaus-Universität Weimar
    NEC Laboratories Europe
    The Web Technology & Information Systems Network
    University of Kassel, hessian.AI, and ScaDS.AI
    Authors
    Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.

  2. O

    tldr-17

    • opendatalab.com
    zip
    Updated Dec 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bauhaus University, Weimar (2023). tldr-17 [Dataset]. https://opendatalab.com/OpenDataLab/tldr-17
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Bauhaus University, Weimar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus contains preprocessed posts from the Reddit dataset (Webis-TLDR-17). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit-id. Content is used as document and summary is used as summary.

  3. P

    Webis-TLDR-17 Corpus Dataset

    • paperswithcode.com
    Updated Aug 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein (2017). Webis-TLDR-17 Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/webis-tldr-17-corpus
    Explore at:
    Dataset updated
    Aug 31, 2017
    Authors
    Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein
    Description

    This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below: - author: string (nullable = true) - body: string (nullable = true) - normalizedBody: string (nullable = true) - content: string (nullable = true) - content_len: long (nullable = true) - summary: string (nullable = true) - summary_len: long (nullable = true) - id: string (nullable = true) - subreddit: string (nullable = true) - subreddit_id: string (nullable = true) - title: string (nullable = true)

    Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

    Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.

  4. E

    Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0

    • live.european-language-grid.eu
    • clarin.si
    binary format
    Updated Feb 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Summarization datasets from the KAS corpus KAS-Sum 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20154
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 3, 2022
    License

    https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0

    Description

    Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.

    The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.

    The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.

    References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228

  5. c

    openai-summarize-tldr

    • cdk.bar
    • hf-mirror.llyke.com
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martim Santos (2024). openai-summarize-tldr [Dataset]. https://cdk.bar/datasets/martimfasantos/openai-summarize-tldr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2024
    Authors
    Martim Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summarize TL;DR Filtered Dataset

    This is the version of the dataset used in https://arxiv.org/abs/2009.01325. If starting a new project we would recommend using https://huggingface.co/datasets/openai/summarize_from_feedback. For more information see https://github.com/openai/summarize-from-feedback and for the original TL;DR dataset see https://huggingface.co/datasets/webis/tldr-17.

  6. E

    Webis EditorialSum Corpus 2020

    • live.european-language-grid.eu
    csv
    Updated Oct 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Webis EditorialSum Corpus 2020 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7658
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 19, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis EditorialSum Corpus consists of 1330 manually curated extractive summaries for 266 news editorials spanning three diverse portals: Al-Jazeera, Guardian and Fox News. Each editorial has 5 summaries, each labeled for overall quality and fine grained properties such as thesis-relevance, persuasiveness, reasonableness, self-containedness.The files are organized as follows:corpus.csv - Contains all the editorials and their acquired summariesNote: (X = [1,5] for five summaries)- article_id : Article ID in the corpus- title : Title of the editorial- article_text : Plain text of the editorial- summary_{X}_text : Plain text of the corresponding summary- thesis_{X}_text : Plain text of the thesis from the corresponding summary- lead : top 15% of the editorial's segments- body : segments between lead and conclusion sections- conclusion : bottom 15% of the editorial's segments- article_segments: Collection of paragraphs, each further divided into collection of segments containing: { "number": segment order in the editorial, "text" : segment text, "label": ADU type }- summary_{X}_segments: Collection of summary segments containing:{ "number": segment order in the editorial, "text" : segment text, "adu_label": ADU type from the editorial, "summary_label": can be 'thesis' or 'justification'}quality-groups.csv - Contains the IDs for high(and low)-quality summaries for each quality dimension per editorialFor example: article_id 2 has four high_quality summaries (summary_1, summary_2, summary_3, summary_4) and one low_quality summary (summary_5) in terms of overall quality.The summary texts can be obtained from corpus.csv respectively.

  7. E

    Webis Abstractive Snippet Corpus 2020

    • live.european-language-grid.eu
    json
    Updated Aug 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Webis Abstractive Snippet Corpus 2020 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7817
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Aug 19, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis Abstractive Snippet 2020 (Webis-Snippete-20) comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million

  8. E

    Central Statistical Office Dataset

    • live.european-language-grid.eu
    • data.europa.eu
    xml
    Updated Sep 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Central Statistical Office Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18867
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Sep 9, 2022
    License

    https://elrc-share.eu/terms/publicDomain.htmlhttps://elrc-share.eu/terms/publicDomain.html

    Description

    Two Polish-English publications of the Polish Central Statistical Office in the XLIFF format: 1. "Statistical Yearbook of the Republic of Poland 2015" is the main summary publication of the Central Statistical Office, including a comprehensive set of statistical data describing the condition of the natural environment, the socio-economic and demographic situation of Poland, and its position in Europe and in the world. 2. "Women in Poland" contains statistical information regarding women's place and participation in socio-economic life of the country including international comparisons. The texts were aligned at the level of translation segments (mostly sentences and short paragraphs) and manually verified.

  9. E

    ELITR Minuting Corpus

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Mar 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ELITR Minuting Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18691
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Mar 30, 2022
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.

    Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.

    Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.

    This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.

    Please find a more detailed description of the data in the included README and stats.tsv files.

    If you use this corpus, please cite: Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O. (2022). ELITR Minuting Corpus: A novel dataset for automatic minuting from multi-party meetings in English and Czech. In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC-2022), Marseille, France, June. European Language Resources Association (ELRA). In print.

    @inproceedings{elitr-minuting-corpus:2022, author = {Anna Nedoluzhko and Muskaan Singh and Marie Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar}, title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech}, booktitle = {Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC-2022)}, year = 2022, month = {June}, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, note = {In print.} }

  10. E

    SumeCzech

    • live.european-language-grid.eu
    binary format
    Updated Feb 12, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SumeCzech [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1229
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 12, 2018
    License

    http://www.mozilla.org/MPL/2.0/http://www.mozilla.org/MPL/2.0/

    Description

    This entry contain the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.

    The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.

    The MPL 2.0 license applied to the scripts downloading the dataset and to the RougeRAW implementation.

  11. E

    ArguAna Counterargs

    • live.european-language-grid.eu
    • webis.de
    • +1more
    txt
    Updated May 7, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). ArguAna Counterargs [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7836
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An English corpus for studying the retrieval of the best counterargument to an argument. It contains 6753 pairs of argument and best counterargument from the online debate portal idebate.org, along with different experiment files with up to millions of candidate pairs:arguana-counterargs-corpus.zipIn case you publish any results related to the ArguAna Counterargs corpus, please cite our upcoming ACL 2018 paper on counterarguments.

  12. E

    Polish-English parallel corpus from the website "Science in Poland"...

    • live.european-language-grid.eu
    • catalogue.elra.info
    • +1more
    tmx
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Polish-English parallel corpus from the website "Science in Poland" (Processed) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/3171
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Nov 14, 2018
    License

    https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html

    Area covered
    Poland
    Description

    Polish-English parallel corpus from the website "Science in Poland" (https://scienceinpoland.pap.pl/en and https://naukawpolsce.pap.pl/)

  13. E

    "Le Monde Diplomatique" Text corpus in Arabic

    • live.european-language-grid.eu
    • catalog.elra.info
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    "Le Monde Diplomatique" Text corpus in Arabic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2524
    Explore at:
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTML file contains one article.

    Number of articles available per year : • 2000: 61 articles (November and December available only) (75,305 words) • 2001: 346 articles (479,435 words) • 2002: 369 articles (461,803 words) • 2003: 343 articles (344,376 words) • 2004: 291 articles (178,046 words)

    Nota: Prices are indicated for one year of data only. If you would like to obtain several years, please indicate in your cart the number of copies (=years) and specify in the comments which years you would like to get.

  14. E

    BMVI Website (Processed)

    • live.european-language-grid.eu
    • data.europa.eu
    tmx
    Updated Mar 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). BMVI Website (Processed) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/3028
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Mar 1, 2018
    License

    https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html

    Description

    tmx file, 2718 TUs, bilingual German/English, texts from the website of the Federal Ministry of Transport and Digital Infrastructure (BMVI) on transport issues. Original tmx file corrected and stripped

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein (2017). Webis-TLDR-17 [Dataset]. http://doi.org/10.5281/zenodo.1043504

Webis-TLDR-17

Explore at:
40 scholarly articles cite this dataset (View in Google Scholar)
1043504Available download formats
Dataset updated
2017
Dataset provided by
Bauhaus-Universität Weimar
NEC Laboratories Europe
The Web Technology & Information Systems Network
University of Kassel, hessian.AI, and ScaDS.AI
Authors
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.

Search
Clear search
Close search
Google apps
Main menu