Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
Wizard of Wikipedia is a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. It is used to train and evaluate dialogue systems for knowledgeable open dialogue with clear grounding
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.
There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).
The document Dataset_summary includes a detailed description of the dataset.
Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )
from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps
Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.
WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data.
WikiGraphs is collected by pairing each Wikipedia article from the established WikiText-103 benchmark with a subgraph from the Freebase knowledge graph. This makes it easy to benchmark against other state-of-the-art text generative models that are capable of generating long paragraphs of coherent text. Both the graphs and the text data are of significantly larger scale compared to prior graph-text paired datasets.
Dataset Card for "wiki40b"
Dataset Summary
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.… See the full description on the dataset page: https://huggingface.co/datasets/google/wiki40b.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.
To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.
Happy Kaggling!
The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.
The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 0.csv
File: 1.csv
File: 10.csv
File: 11.csv
File: 12.csv
File: 14.csv
File: 15.csv
File: 17.csv
File: 18.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
When given a Wikipedia term, I wanted a way to determine its ancestor category so I downloaded the entirety of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz. Unfortunately, this file has the relationships between categories and pages as well, and there was a lot of overhead with unneeded information such as prefix keys. Thus, all of this was cleaned, and categories with identical terms are grouped into one row, reducing the 20Gb SQL file to a 227 mb CSV file.
Although this is called a category "tree" by the official Media Wiki documentation, it is actually a DAG, a directed acyclic graph, which can be interpreted as a generalization of trees.
Every line of children_cats.csv
and page_children_cats.csv
is of the form category,children
, where category
is the name of the category with spaces replaced by underscores, and children
is a space-separated list of subcategories of the category
.
Special thanks to Wikipedia dumps for this dataset. This was downloaded and cleaned from https://dumps.wikimedia.org/enwiki/latest/.
This dataset is published so that in the case that anyone else would like to download the category tree, they won't have to go through the trouble of cleaning a 20Gb SQL file.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.
The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).
For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].
The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).
Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.
The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Introduction Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions. We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML. For more details, please refer to the description below and to the dataset paper: Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. https://arxiv.org/abs/2001.10256 When using the dataset, please cite the above paper. Dataset summary The dataset consists of three parts: English Wikipedia’s full revision history parsed to HTML, a table of the creation times of all Wikipedia pages (page_creation_times.json.gz), a table that allows for resolving redirects for any point in time (redirect_history.json.gz). Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses. Getting the data Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options: use a Torrent-based solution as described at https://github.com/epfl-dlab/WikiHist.html - Option 1 (recommended approach for the full download) use our download scripts by following the instructions at https://github.com/epfl-dlab/WikiHist.html - Option 2 (the download scripts allow you to bulk-download all data as well as to download revisions for specific articles only). download it manually from the Internet Archive at https://archive.org/details/WikiHist_html Dataset details Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML): id: id of this revision parentid: id of revision modified by this revision timestamp: time when revision was made cont_username: username of contributor cont_id: id of contributor cont_ip: IP address of contributor comment: comment made by contributor model: content model (usually "wikitext") format: content format (usually "text/x-wiki") sha1: SHA-1 hash title: page title ns: namespace (always 0) page_id: page id redirect_title: if page is redirect, title of target page html: revision content in HTML format Part 2: Page creation times (page_creation_times.json.gz) This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format: page_id: page id title: page title ns: namespace (0 for articles) timestamp: time when page was created Part 3: Redirect history (redirect_history.json.gz) This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format: page_id: page id of redirect source title: page title of redirect source ns:...
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.
The dataset has 11,701 nodes and 216,123 edges.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A preprocessed dataset for training. Please see instructions in for how to use it. Note: the author does not own any copyrights of the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the page view statistics for all the WikiMedia projects in the year 2014, ordered by (project, page, timestamp). It has been generated starting from the WikiMedia's pagecounts-raw[1] dataset.The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:* project: the project name* page: the page requested, url-escaped* timestamp: the timestamp of the hour (format: "%Y%m%d-%H%M%S")* count: the number of times the page has been requested (in that hour)* bytes: the number of bytes transferred (in that hour)You can download the full dataset via torrent[2].Further information about this dataset are available at:http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/[1] https://dumps.wikimedia.org/other/pagecounts-raw/[2] http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-sorted-by-page-year-2014/#download
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.
The original dataset contains missing values. They have been simply replaced by zeros.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This upload contains the supplementary material for our paper presented at the MMM2024 conference.
The dataset contains rich text descriptions for music audio files collected from Wikipedia articles.
The audio files are freely accessible and available for download through the URLs provided in the dataset.
A few hand-picked, simplified examples of the dataset.
file |
aspects |
sentences |
['bongoes', 'percussion instrument', 'cumbia', 'drums'] |
['a loop of bongoes playing a cumbia beat at 99 bpm'] | |
🔈 Example of double tracking in a pop-rock song (3 guitar tracks).ogg |
['bass', 'rock', 'guitar music', 'guitar', 'pop', 'drums'] |
['a pop-rock song'] |
['jazz standard', 'instrumental', 'jazz music', 'jazz'] |
['Considered to be a jazz standard', 'is an jazz composition'] | |
['chirping birds', 'ambient percussion', 'new-age', 'flute', 'recorder', 'single instrument', 'woodwind'] |
['features a single instrument with delayed echo, as well as ambient percussion and chirping birds', 'a new-age composition for recorder'] | |
['instrumental', 'brass band'] |
['an instrumental brass band performance'] | |
... |
... |
... |
We provide three variants of the dataset in the data
folder.
All are described in the paper.
all.csv
contains all the data we collected, without any filtering.filtered_sf.csv
contains the data obtained using the self-filtering method.filtered_mc.csv
contains the data obtained using the MusicCaps dataset method.Each CSV file contains the following columns:
file
: the name of the audio filepageid
: the ID of the Wikipedia article where the text was collected fromaspects
: the short-form (tag) description texts collected from the Wikipedia articlessentences
: the long-form (caption) description texts collected from the Wikipedia articlesaudio_url
: the URL of the audio fileurl
: the URL of the Wikipedia article where the text was collected fromIf you use this dataset in your research, please cite the following paper:
@inproceedings{wikimute,
title = {WikiMuTe: {A} Web-Sourced Dataset of Semantic Descriptions for Music Audio},
author = {Weck, Benno and Kirchhoff, Holger and Grosche, Peter and Serra, Xavier},
booktitle = "MultiMedia Modeling",
year = "2024",
publisher = "Springer Nature Switzerland",
address = "Cham",
pages = "42--56",
doi = {10.1007/978-3-031-56435-2_4},
url = {https://doi.org/10.1007/978-3-031-56435-2_4},
}
The data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.
Each entry in the dataset contains a URL linking to the article, where the text data was collected from.
WikiBio is constructed using Wikipedia biography pages, it contains the first paragraph and the infobox tokenized. The dataset follows a standarized table format.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wiki_bio', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.