3 datasets found

A Comprehensive Dataset of Classified Citations with Identifiers from...
zenodo.org
zip
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza (2023). A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023) [Dataset]. http://doi.org/10.5281/zenodo.8107239
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8107239
Dataset updated
Jul 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https://dumps.wikimedia.org/enwiki/20230220/).

Version 1: en_citations.zip is a dataset of extracted citations

Version 2: en_final.zip is the same dataset with classified citations augmented with identifiers

The fields are as follows:

type_of_citation - Wikipedia template type used to define the citation, e.g., 'cite journal', 'cite news', etc.

page_title - title of the Wikipedia article from which the citation was extracted.

Title - source title, e.g., title of the book, newspaper article, etc.

URL - link to the source, e.g., webpage where news article was published, description of the book at the publisher's website, online library webpage, etc.

tld - top link domain extracted from the URL, e.g., 'bbc' for https://www.bbc.co.uk/...

Authors - list of article or book authors, if available.

ID_list - list of publication identifiers mentioned in the citation, e.g., DOI, ISBN, etc.

citations - citation text as used in Wikipedia code

actual_label - 'book', 'journal', 'news', or 'other' label assigned based on the analysis of citation identifiers or top link domain.

acquired_ID_list - identifiers located via Google Books and Crossref APIs for citations which are likely to refer to books or journals, i.e., defined using 'cite book', 'cite journal', 'cite encyclopedia', and 'cite proceedings' templates.

The total number of news: 9.926.598

The total number of books: 2.994.601

The total number of journals: 2.052.172

Augmented with IDs via lookup 929.601 (out of 2.445.913 book, journal, encyclopedia, and proceedings template citations not classified as books or journals via given identifiers).

The source code to extract citations can be found here: https://github.com/albatros13/wikicite.

The code is a fork of the earlier project on Wikipedia citation extraction: https://github.com/Harshdeep1996/cite-classifications-wiki.
Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...
zenodo.org
data-staging.niaid.nih.gov
application/gzip, zip
Updated Jun 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3605388
Dataset updated
Jun 8, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Introduction

Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

For more details, please refer to the description below and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
https://arxiv.org/abs/2001.10256

When using the dataset, please cite the above paper.

Dataset summary

The dataset consists of three parts:

English Wikipedia’s full revision history parsed to HTML,

a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),

a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

Getting the data

Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

use a Torrent-based solution as described at https://github.com/epfl-dlab/WikiHist.html - Option 1 (recommended approach for the full download)

use our download scripts by following the instructions at https://github.com/epfl-dlab/WikiHist.html - Option 2 (the download scripts allow you to bulk-download all data as well as to download revisions for specific articles only).

download it manually from the Internet Archive at https://archive.org/details/WikiHist_html

Dataset details

Part 1: HTML revision history
The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

id: id of this revision

parentid: id of revision modified by this revision

timestamp: time when revision was made

cont_username: username of contributor

cont_id: id of contributor

cont_ip: IP address of contributor

comment: comment made by contributor

model: content model (usually "wikitext")

format: content format (usually "text/x-wiki")

sha1: SHA-1 hash

title: page title

ns: namespace (always 0)

page_id: page id

redirect_title: if page is redirect, title of target page

html: revision content in HTML format

Part 2: Page creation times (page_creation_times.json.gz)

This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

page_id: page id

title: page title

ns: namespace (0 for articles)

timestamp: time when page was created

Part 3: Redirect history (redirect_history.json.gz)

This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

page_id: page id of redirect source

title: page title of redirect source

ns: namespace (0 for articles)

revision_id: revision id of redirect source

timestamp: time at which redirect became active

redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .
r
WikiPathways
rrid.site
neuinfo.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). WikiPathways [Dataset]. http://identifiers.org/RRID:SCR_002134
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002134
Dataset updated
Jan 29, 2022
Description
Open and collaborative platform dedicated to curation of biological pathways. Each pathway has dedicated wiki page, displaying current diagram, description, references, download options, version history, and component gene and protein lists. Database of biological pathways maintained by and for scientific community.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza (2023). A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023) [Dataset]. http://doi.org/10.5281/zenodo.8107239

A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023)

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8107239

Dataset updated

Jul 5, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https://dumps.wikimedia.org/enwiki/20230220/).

Version 1: en_citations.zip is a dataset of extracted citations

Version 2: en_final.zip is the same dataset with classified citations augmented with identifiers

The fields are as follows:

type_of_citation - Wikipedia template type used to define the citation, e.g., 'cite journal', 'cite news', etc.
page_title - title of the Wikipedia article from which the citation was extracted.
Title - source title, e.g., title of the book, newspaper article, etc.
URL - link to the source, e.g., webpage where news article was published, description of the book at the publisher's website, online library webpage, etc.
tld - top link domain extracted from the URL, e.g., 'bbc' for https://www.bbc.co.uk/...
Authors - list of article or book authors, if available.
ID_list - list of publication identifiers mentioned in the citation, e.g., DOI, ISBN, etc.
citations - citation text as used in Wikipedia code
actual_label - 'book', 'journal', 'news', or 'other' label assigned based on the analysis of citation identifiers or top link domain.
acquired_ID_list - identifiers located via Google Books and Crossref APIs for citations which are likely to refer to books or journals, i.e., defined using 'cite book', 'cite journal', 'cite encyclopedia', and 'cite proceedings' templates.

The total number of news: 9.926.598
The total number of books: 2.994.601
The total number of journals: 2.052.172
Augmented with IDs via lookup 929.601 (out of 2.445.913 book, journal, encyclopedia, and proceedings template citations not classified as books or journals via given identifiers).

The source code to extract citations can be found here: https://github.com/albatros13/wikicite.

The code is a fork of the earlier project on Wikipedia citation extraction: https://github.com/Harshdeep1996/cite-classifications-wiki.

Clear search

Close search

Google apps

Main menu

A Comprehensive Dataset of Classified Citations with Identifiers from...

Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

WikiPathways

A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023)