92 datasets found
  1. h

    rag-mini-wikipedia

    • huggingface.co
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  2. P

    Wizard of Wikipedia Dataset

    • paperswithcode.com
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Dinan; Stephen Roller; Kurt Shuster; Angela Fan; Michael Auli; Jason Weston (2021). Wizard of Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/wizard-of-wikipedia
    Explore at:
    Dataset updated
    Jun 30, 2021
    Authors
    Emily Dinan; Stephen Roller; Kurt Shuster; Angela Fan; Michael Auli; Jason Weston
    Description

    Wizard of Wikipedia is a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. It is used to train and evaluate dialogue systems for knowledgeable open dialogue with clear grounding

  3. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +1more
    pdf, tsv
    Updated Jul 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  4. h

    wikipedia-small-3000-embedded

    • huggingface.co
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2024
    Authors
    Hafedh Hichri
    License

    https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/

    Description

    this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

    load dataset in streaming mode (no download and it's fast)

    dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

    select 3000 samples

    from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.

  5. n

    Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

    • narcis.nl
    • data.mendeley.com
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzuki, T (via Mendeley Data) (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.2
    Explore at:
    Dataset updated
    Apr 19, 2022
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Suzuki, T (via Mendeley Data)
    Description

    Brief ExplanationThis dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).2. img directory This directory is a placeholder directory to fetch image files for downloading.3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.

  6. P

    French Wikipedia Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot (2021). French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia
    Explore at:
    Dataset updated
    Feb 18, 2021
    Authors
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
    Area covered
    French
    Description

    French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps

  7. P

    Wiki-zh Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jul 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yumo Xu; Mirella Lapata (2019). Wiki-zh Dataset [Dataset]. https://paperswithcode.com/dataset/wiki-zh
    Explore at:
    Dataset updated
    Jul 25, 2019
    Authors
    Yumo Xu; Mirella Lapata
    Description

    Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.

  8. P

    WikiGraphs Dataset

    • paperswithcode.com
    Updated Jul 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luyu Wang; Yujia Li; Ozlem Aslan; Oriol Vinyals (2021). WikiGraphs Dataset [Dataset]. https://paperswithcode.com/dataset/wikigraphs
    Explore at:
    Dataset updated
    Jul 21, 2021
    Authors
    Luyu Wang; Yujia Li; Ozlem Aslan; Oriol Vinyals
    Description

    WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data.

    WikiGraphs is collected by pairing each Wikipedia article from the established WikiText-103 benchmark with a subgraph from the Freebase knowledge graph. This makes it easy to benchmark against other state-of-the-art text generative models that are capable of generating long paragraphs of coherent text. Both the graphs and the text data are of significantly larger scale compared to prior graph-text paired datasets.

  9. wiki40b

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Jun 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2024). wiki40b [Dataset]. https://huggingface.co/datasets/google/wiki40b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    Googlehttp://google.com/
    Description

    Dataset Card for "wiki40b"

      Dataset Summary
    

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.… See the full description on the dataset page: https://huggingface.co/datasets/google/wiki40b.

  10. WikiTableQuestions (Semi-structured Tables Q&A)

    • kaggle.com
    Updated Nov 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Investigation of Semi-Structured Tables: WikiTableQuestions

    A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

    By [source]

    About this dataset

    The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

    To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

    Happy Kaggling!

    Research Ideas

    • The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

    • The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

    • The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: 0.csv

    File: 1.csv

    File: 10.csv

    File: 11.csv

    File: 12.csv

    File: 14.csv

    File: 15.csv

    File: 17.csv

    File: 18.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  11. Wikipedia Category Tree

    • kaggle.com
    Updated Sep 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Lu (2020). Wikipedia Category Tree [Dataset]. https://www.kaggle.com/kevinlu1248/wikipedia-category-tree/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kevin Lu
    Description

    Context

    When given a Wikipedia term, I wanted a way to determine its ancestor category so I downloaded the entirety of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz. Unfortunately, this file has the relationships between categories and pages as well, and there was a lot of overhead with unneeded information such as prefix keys. Thus, all of this was cleaned, and categories with identical terms are grouped into one row, reducing the 20Gb SQL file to a 227 mb CSV file.

    Although this is called a category "tree" by the official Media Wiki documentation, it is actually a DAG, a directed acyclic graph, which can be interpreted as a generalization of trees.

    Content

    Every line of children_cats.csv and page_children_cats.csv is of the form category,children, where category is the name of the category with spaces replaced by underscores, and children is a space-separated list of subcategories of the category.

    Acknowledgements

    Special thanks to Wikipedia dumps for this dataset. This was downloaded and cleaned from https://dumps.wikimedia.org/enwiki/latest/.

    Inspiration

    This dataset is published so that in the case that anyone else would like to download the category tree, they won't have to go through the trouble of cleaning a 20Gb SQL file.

  12. E

    Plaintext Wikipedia dump 2018

    • live.european-language-grid.eu
    binary format
    Updated Feb 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1242
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 24, 2018
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

    The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).

    For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

    The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).

    Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.

    The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  13. o

    Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

    • explore.openaire.eu
    • zenodo.org
    Updated Jan 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blagoj Mitrevski; Tiziano Piccardi; Robert West (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605387
    Explore at:
    Dataset updated
    Jan 12, 2020
    Authors
    Blagoj Mitrevski; Tiziano Piccardi; Robert West
    Description

    Introduction Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions. We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML. For more details, please refer to the description below and to the dataset paper: Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. https://arxiv.org/abs/2001.10256 When using the dataset, please cite the above paper. Dataset summary The dataset consists of three parts: English Wikipedia’s full revision history parsed to HTML, a table of the creation times of all Wikipedia pages (page_creation_times.json.gz), a table that allows for resolving redirects for any point in time (redirect_history.json.gz). Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses. Getting the data Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options: use a Torrent-based solution as described at https://github.com/epfl-dlab/WikiHist.html - Option 1 (recommended approach for the full download) use our download scripts by following the instructions at https://github.com/epfl-dlab/WikiHist.html - Option 2 (the download scripts allow you to bulk-download all data as well as to download revisions for specific articles only). download it manually from the Internet Archive at https://archive.org/details/WikiHist_html Dataset details Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML): id: id of this revision parentid: id of revision modified by this revision timestamp: time when revision was made cont_username: username of contributor cont_id: id of contributor cont_ip: IP address of contributor comment: comment made by contributor model: content model (usually "wikitext") format: content format (usually "text/x-wiki") sha1: SHA-1 hash title: page title ns: namespace (always 0) page_id: page id redirect_title: if page is redirect, title of target page html: revision content in HTML format Part 2: Page creation times (page_creation_times.json.gz) This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format: page_id: page id title: page title ns: namespace (0 for articles) timestamp: time when page was created Part 3: Redirect history (redirect_history.json.gz) This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format: page_id: page id of redirect source title: page title of redirect source ns:...

  14. h

    wikitext2

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Dataset updated
    Oct 21, 2023
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  15. P

    Wiki-CS Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Péter Mernyei; Cătălina Cangea, Wiki-CS Dataset [Dataset]. https://paperswithcode.com/dataset/wiki-cs
    Explore at:
    Authors
    Péter Mernyei; Cătălina Cangea
    Description

    Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.

    The dataset has 11,701 nodes and 216,123 edges.

  16. a

    Wikipedia Training Data for Megatron-LM

    • academictorrents.com
    bittorrent
    Updated Aug 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2021). Wikipedia Training Data for Megatron-LM [Dataset]. https://academictorrents.com/details/b6215a898a2a08b6061d23f2e4e1094121fb7082
    Explore at:
    bittorrent(7840268306)Available download formats
    Dataset updated
    Aug 28, 2021
    Authors
    None
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A preprocessed dataset for training. Please see instructions in for how to use it. Note: the author does not own any copyrights of the data.

  17. f

    Wikipedia pagecounts sorted by page (year 2014)

    • figshare.com
    txt
    Updated Feb 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Bogon; Cristian Consonni; Alberto Montresor (2016). Wikipedia pagecounts sorted by page (year 2014) [Dataset]. http://doi.org/10.6084/m9.figshare.2085643.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 15, 2016
    Dataset provided by
    figshare
    Authors
    Alessio Bogon; Cristian Consonni; Alberto Montresor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the page view statistics for all the WikiMedia projects in the year 2014, ordered by (project, page, timestamp). It has been generated starting from the WikiMedia's pagecounts-raw[1] dataset.The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:* project: the project name* page: the page requested, url-escaped* timestamp: the timestamp of the hour (format: "%Y%m%d-%H%M%S")* count: the number of times the page has been requested (in that hour)* bytes: the number of bytes transferred (in that hour)You can download the full dataset via torrent[2].Further information about this dataset are available at:http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/[1] https://dumps.wikimedia.org/other/pagecounts-raw/[2] http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/#download

  18. Z

    Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Apr 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Webb, Geoff (2021). Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892918
    Explore at:
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Webb, Geoff
    Hyndman, Rob
    Godahewa, Rakshitha
    Bergmeir, Christoph
    Montero-Manso, Pablo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.

    The original dataset contains missing values. They have been simply replaced by zeros.

  19. Data from: WikiMuTe: A web-sourced dataset of semantic descriptions for...

    • zenodo.org
    csv
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Weck; Benno Weck; Holger Kirchhoff; Holger Kirchhoff; Peter Grosche; Peter Grosche; Serra Xavier; Serra Xavier (2024). WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [Dataset]. http://doi.org/10.5281/zenodo.10223363
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Benno Weck; Benno Weck; Holger Kirchhoff; Holger Kirchhoff; Peter Grosche; Peter Grosche; Serra Xavier; Serra Xavier
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This upload contains the supplementary material for our paper presented at the MMM2024 conference.

    Dataset

    The dataset contains rich text descriptions for music audio files collected from Wikipedia articles.

    The audio files are freely accessible and available for download through the URLs provided in the dataset.

    Example

    A few hand-picked, simplified examples of the dataset.

    file

    aspects

    sentences

    🔈 Bongo sound.wav

    ['bongoes', 'percussion instrument', 'cumbia', 'drums']

    ['a loop of bongoes playing a cumbia beat at 99 bpm']

    🔈 Example of double tracking in a pop-rock song (3 guitar tracks).ogg

    ['bass', 'rock', 'guitar music', 'guitar', 'pop', 'drums']

    ['a pop-rock song']

    🔈 OriginalDixielandJassBand-JazzMeBlues.ogg

    ['jazz standard', 'instrumental', 'jazz music', 'jazz']

    ['Considered to be a jazz standard', 'is an jazz composition']

    🔈 Colin Ross - Etherea.ogg

    ['chirping birds', 'ambient percussion', 'new-age', 'flute', 'recorder', 'single instrument', 'woodwind']

    ['features a single instrument with delayed echo, as well as ambient percussion and chirping birds', 'a new-age composition for recorder']

    🔈 Belau rekid (instrumental).oga

    ['instrumental', 'brass band']

    ['an instrumental brass band performance']

    ...

    ...

    ...

    Dataset structure

    We provide three variants of the dataset in the data folder.

    All are described in the paper.

    1. all.csv contains all the data we collected, without any filtering.
    2. filtered_sf.csv contains the data obtained using the self-filtering method.
    3. filtered_mc.csv contains the data obtained using the MusicCaps dataset method.

    File structure

    Each CSV file contains the following columns:

    • file: the name of the audio file
    • pageid: the ID of the Wikipedia article where the text was collected from
    • aspects: the short-form (tag) description texts collected from the Wikipedia articles
    • sentences: the long-form (caption) description texts collected from the Wikipedia articles
    • audio_url: the URL of the audio file
    • url: the URL of the Wikipedia article where the text was collected from

    Citation

    If you use this dataset in your research, please cite the following paper:

    @inproceedings{wikimute,
    title = {WikiMuTe: {A} Web-Sourced Dataset of Semantic Descriptions for Music Audio},
    author = {Weck, Benno and Kirchhoff, Holger and Grosche, Peter and Serra, Xavier},
    booktitle = "MultiMedia Modeling",
    year = "2024",
    publisher = "Springer Nature Switzerland",
    address = "Cham",
    pages = "42--56",
    doi = {10.1007/978-3-031-56435-2_4},
    url = {https://doi.org/10.1007/978-3-031-56435-2_4},
    }

    License

    The data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

    Each entry in the dataset contains a URL linking to the article, where the text data was collected from.

  20. T

    wiki_bio

    • tensorflow.org
    • opendatalab.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wiki_bio [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki_bio
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WikiBio is constructed using Wikipedia biography pages, it contains the first paragraph and the infobox tokenized. The dataset follows a standarized table format.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki_bio', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia

rag-mini-wikipedia

rag-datasets/rag-mini-wikipedia

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset authored and provided by
RAG Datasets
License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

Search
Clear search
Close search
Google apps
Main menu