100+ datasets found

f
Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...
figshare.com
txt
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14113799.v4
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
h
simple-wiki
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "simple-wiki"

Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
Wikipedia Knowledge Graph dataset
zenodo.org
produccioncientifica.ugr.es
+1more
pdf, tsv
Updated Jul 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Explore at:
tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6346900
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
Data from: English Wikipedia - Species Pages
gbif.org
Updated Aug 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
Explore at:
Unique identifier
https://doi.org/10.15468/c3kkgh
Dataset updated
Aug 23, 2022
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
Markus Döring; Markus Döring
Description
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.
Wikipedia Change Metadata
redivis.com
application/jsonl +7
Updated Sep 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Graduate School of Business Library (2021). Wikipedia Change Metadata [Dataset]. https://redivis.com/datasets/1ky2-8b1pvrv76
Explore at:
stata, application/jsonl, arrow, spss, parquet, csv, sas, avroAvailable download formats
Dataset updated
Sep 22, 2021
Dataset provided by
Redivis Inc.
Authors
Stanford Graduate School of Business Library
Time period covered
Jan 16, 2001 - Mar 1, 2019
Description
Abstract

The Wikipedia Change Metadata is a curation of article changes, updates, and edits over time.

Documentation

**Source for details below: **https://zenodo.org/record/3605388#.YWitsdnML0o

Dataset details

Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and *p$2p$3 *indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

id: id of this revision

parentid: id of revision modified by this revision

timestamp: time when revision was made

cont_username: username of contributor

cont_id: id of contributor

cont_ip: IP address of contributor

comment: comment made by contributor

model: content model (usually "wikitext")

format: content format (usually "text/x-wiki")

sha1: SHA-1 hash

title: page title

ns: namespace (always 0)

page_id: page id

redirect_title: if page is redirect, title of target page

html: revision content in HTML format

%3C!-- --%3E

Part 2: Page creation times (page_creation_times.json.gz)

This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

page_id: page id

title: page title

ns: namespace (0 for articles)

timestamp: time when page was created

%3C!-- --%3E

Part 3: Redirect history (redirect_history.json.gz)

This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

page_id: page id of redirect source

title: page title of redirect source

ns: namespace (0 for articles)

revision_id: revision id of redirect source

timestamp: time at which redirect became active

redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

%3C!-- --%3E
Wikipedia Article Topics for All Languages (based on article outlinks)
figshare.com
bz2
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12619766.v3
Dataset updated
Jul 20, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Isaac Johnson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
P
Wiki-en Dataset
paperswithcode.com
opendatalab.com
Updated Jul 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yumo Xu; Mirella Lapata (2019). Wiki-en Dataset [Dataset]. https://paperswithcode.com/dataset/wiki-en
Explore at:
Dataset updated
Jul 25, 2019
Authors
Yumo Xu; Mirella Lapata
Description
Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).
Wikipedia Talk Corpus
figshare.com
application/x-gzip
Updated Jan 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4264973.v3
Dataset updated
Jan 23, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.
Z
A Wikipedia dataset of 5 categories
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maitre, Julien (2020). A Wikipedia dataset of 5 categories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3260045
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Maitre, Julien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

Economy : 44'876 articles

History : 92'041 articles

Informatics : 25'408 articles

Health : 22'143 articles

Law : 9'964 articles
o
Wikipedia Articles Dataset
opendatabay.com
.undefined
Updated May 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2025). Wikipedia Articles Dataset [Dataset]. https://www.opendatabay.com/data/premium/b6292674-e94d-4a7e-93c0-00cf1474ffdd
Explore at:
.undefinedAvailable download formats
Dataset updated
May 25, 2025
Dataset authored and provided by
Bright Data
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Access a wealth of information, including article titles, raw text, images, and structured references. Popular use cases include knowledge extraction, trend analysis, and content development.

Use our Wikipedia Articles dataset to access a vast collection of articles across a wide range of topics, from history and science to culture and current events. This dataset offers structured data on articles, categories, and revision histories, enabling deep analysis into trends, knowledge gaps, and content development.

Tailored for researchers, data scientists, and content strategists, this dataset allows for in-depth exploration of article evolution, topic popularity, and interlinking patterns. Whether you are studying public knowledge trends, performing sentiment analysis, or developing content strategies, the Wikipedia Articles dataset provides a rich resource to understand how information is shared and consumed globally.

Dataset Features - url: Direct URL to the original Wikipedia article.
- title: The title or name of the Wikipedia article.
- table_of_contents: A list or structure outlining the article's sections and hierarchy.
- raw_text: Unprocessed full text content of the article.
- cataloged_text: Cleaned and structured version of the article’s content, optimized for analysis.
- images: Links or data on images embedded in the article.
- see_also: Related articles linked under the “See Also” section.
- references: Sources cited in the article for credibility.
- external_links: Links to external websites or resources mentioned in the article.
- categories: Tags or groupings classifying the article by topic or domain.
- timestamp: Last edit date or revision time of the article snapshot.

Distribution - Data Volume: 11 Columns and 2.19 M Rows
- Format: CSV

Usage This dataset supports a wide range of applications: - Knowledge Extraction: Identify key entities, relationships, or events from Wikipedia content.
- Content Strategy & SEO: Discover trending topics and content gaps.
- Machine Learning: Train NLP models (e.g., summarisation, classification, QA systems).
- Historical Trend Analysis: Study how public interest in topics changes over time.
- Link Graph Modeling: Understand how information is interconnected.

Coverage - Geographic Coverage: Global (multi-language Wikipedia versions also available)
- Time Range: Continuous updates; snapshots available from early 2000s to present.

License

CUSTOM

Please review the respective licenses below:

Data Provider's License

Bright Data Master Service Agreement

Who Can Use It - Data Scientists: For training or testing NLP and information retrieval systems.
- Researchers: For computational linguistics, social science, or digital humanities.
- Businesses: To enhance AI-powered content tools or customer insight platforms.
- Educators/Students: For building projects, conducting research, or studying knowledge systems.

Suggested Dataset Names 1. Wikipedia Corpus+
2. Wikipedia Stream Dataset
3. Wikipedia Knowledge Bank
4. Open Wikipedia Dataset

Pricing

Based on Delivery frequency

~Up to $0.0025 per record. Min order $250

Approximately 283 new records are added each month. Approximately 1.12M records are updated each month. Get the complete dataset each delivery, including all records. Retrieve only the data you need with the flexibility to set Smart Updates.

Monthly

New snapshot each month, 12 snapshots/year Paid monthly

Quarterly

New snapshot each quarter, 4 snapshots/year Paid quarterly

Bi-annual

New snapshot every 6 months, 2 snapshots/year Paid twice-a-year

One-time purchase

New snapshot one-time delivery Paid once
Title and subtitles of Wikipedia articles
data.4tu.nl
figshare.com
zip
Updated Jun 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Sanchez-Charles (2017). Title and subtitles of Wikipedia articles [Dataset]. http://doi.org/10.4121/uuid:61fb9665-40ab-4b70-8214-767c521cc950
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:61fb9665-40ab-4b70-8214-767c521cc950
Dataset updated
Jun 6, 2017
Dataset provided by
4TUhttps://www.4tu.nl/
Authors
David Sanchez-Charles
License
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
Description
This dataset contains 871 articles from Wikipedia (retrieved on 8th August 2016), selected from the list of featured articles ({https://en.wikipedia.org/wiki/Wikipedia:Featured_articles}) of the 'Media', 'Literature and Theater', 'Music biographies', 'Media biographies', 'History biographies' and 'Video gaming' categories. From the list of articles, the structure of the document, i.e. sections and subsections of the text, is extracted.

The dataset also contains a proposed clusterization of the event names to increase comparability of Wikipedia articles.
h
roots_ar_wikipedia
huggingface.co
Updated Apr 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Data (2023). roots_ar_wikipedia [Dataset]. https://huggingface.co/datasets/bigscience-data/roots_ar_wikipedia
Explore at:
Dataset updated
Apr 10, 2023
Dataset authored and provided by
BigScience Data
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
ROOTS Subset: roots_ar_wikipedia

wikipedia

Dataset uid: wikipedia

Description Homepage Licensing Speaker Locations Sizes

3.2299 % of total 4.2071 % of en 5.6773 % of ar 3.3416 % of fr 5.2815 % of es 12.4852 % of ca 0.4288 % of zh 0.4286 % of zh 5.4743 % of indic-bn 8.9062 % of indic-ta 21.3313 % of indic-te 4.4845 % of pt 4.0493 % of indic-hi 11.3163 % of indic-ml 22.5300 % of indic-ur 4.4902 % of vi 16.9916 % of indic-kn… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_ar_wikipedia.
a
Wikipedia Training Data for Megatron-LM
academictorrents.com
bittorrent
Updated Aug 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2021). Wikipedia Training Data for Megatron-LM [Dataset]. https://academictorrents.com/details/b6215a898a2a08b6061d23f2e4e1094121fb7082
Explore at:
bittorrent(7840268306)Available download formats
Dataset updated
Aug 28, 2021
Authors
None
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A preprocessed dataset for training. Please see instructions in for how to use it. Note: the author does not own any copyrights of the data.
Wikipedia Talk Labels: Personal Attacks
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4054689.v6
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
h
wikipedia-data-en-2023-11
huggingface.co
Updated Dec 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mixedbread (2023). wikipedia-data-en-2023-11 [Dataset]. https://huggingface.co/datasets/mixedbread-ai/wikipedia-data-en-2023-11
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2023
Dataset authored and provided by
Mixedbread
Description
mixedbread-ai/wikipedia-data-en-2023-11 dataset hosted on Hugging Face and contributed by the HF Datasets community
Z
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Alyssa Lees
Yu Su
Huan Sun
Cong Yu
Xiang Deng
You Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
Google Trends and Wikipedia Page Views
zenodo.org
explore.openaire.eu
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitsuo Yoshida; Mitsuo Yoshida (2020). Google Trends and Wikipedia Page Views [Dataset]. http://doi.org/10.5281/zenodo.14539
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14539
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mitsuo Yoshida; Mitsuo Yoshida
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Abstract (our paper)

The frequency of a web search keyword generally reflects the degree of public interest in a particular subject matter. Search logs are therefore useful resources for trend analysis. However, access to search logs is typically restricted to search engine providers. In this paper, we investigate whether search frequency can be estimated from a different resource such as Wikipedia page views of open data. We found frequently searched keywords to have remarkably high correlations with Wikipedia page views. This suggests that Wikipedia page views can be an effective tool for determining popular global web search trends.

Data

personal-name.txt.gz:
The first column is the Wikipedia article id, the second column is the search keyword, the third column is the Wikipedia article title, and the fourth column is the total of page views from 2008 to 2014.

personal-name_data_google-trends.txt.gz, personal-name_data_wikipedia.txt.gz:
The first column is the period to be collected, the second column is the source (Google or Wikipedia), the third column is the Wikipedia article id, the fourth column is the search keyword, the fifth column is the date, and the sixth column is the value of search trend or page view.

Publication

This data set was created for our study. If you make use of this data set, please cite:
Mitsuo Yoshida, Yuki Arase, Takaaki Tsunoda, Mikio Yamamoto. Wikipedia Page View Reflects Web Search Trend. Proceedings of the 2015 ACM Web Science Conference (WebSci '15). no.65, pp.1-2, 2015.
http://dx.doi.org/10.1145/2786451.2786495
http://arxiv.org/abs/1509.02218 (author-created version)

Note

The raw data of Wikipedia page views is available in the following page.
http://dumps.wikimedia.org/other/pagecounts-raw/
P
Wiki Dataset
paperswithcode.com
Updated Jan 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Wiki Dataset [Dataset]. https://paperswithcode.com/dataset/wiki
Explore at:
Dataset updated
Jan 20, 2023
Description
Context There's a story behind every dataset and here's your opportunity to share yours.

Content What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration Your data will be in front of the world's largest data science community. What questions do you want to see answered?
E
Data from: English Wikipedia
live.european-language-grid.eu
explore.openaire.eu
+1more
txt
Updated Jun 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). English Wikipedia [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7443
Explore at:
txtAvailable download formats
Dataset updated
Jun 24, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This text corpus is composed of texts of English Wikipedia extracted from the Wikipedia dump of 26th September 2015 using the WikiExtractor tool (https://github.com/attardi/wikiextractor).
Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...
datasets.ai
catalog.data.gov
0, 33, 53
Updated Aug 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2024). Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data [Dataset]. https://datasets.ai/datasets/wikipedia-on-the-comptox-chemicals-dashboard-connecting-resources-to-enrich-public-chemica
Explore at:
53, 33, 0Available download formats
Dataset updated
Aug 26, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Authors
U.S. Environmental Protection Agency
Description
Spreadsheet summaries of identifier availability and correctness in Wikipedia

Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes

Investigation of John W. Huffman cannabinoid dataset

Summary of Wikipedia pages linked to DSSTox records

Complete identifier data scraped from Wikipedia Chembox and Drugbox pages.

This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).

Facebook

Twitter

Click to copy link

Link copied

Cite

KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.14113799.v4

Dataset updated

Mar 14, 2021

Dataset provided by

figshare

Authors

KayYen Wong; Diego Saez-Trumper; Miriam Redi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

Clear search

Close search

Google apps

Main menu

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

simple-wiki

Wikipedia Knowledge Graph dataset

Data from: English Wikipedia - Species Pages

Wikipedia Change Metadata

Abstract

Documentation

Wikipedia Article Topics for All Languages (based on article outlinks)

Wiki-en Dataset

Wikipedia Talk Corpus

A Wikipedia dataset of 5 categories

Wikipedia Articles Dataset

Pricing

Based on Delivery frequency

Title and subtitles of Wikipedia articles

roots_ar_wikipedia

Wikipedia Training Data for Megatron-LM

Wikipedia Talk Labels: Personal Attacks

wikipedia-data-en-2023-11

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Google Trends and Wikipedia Page Views

Wiki Dataset

Data from: English Wikipedia

Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia