100+ datasets found

h
simple-wiki
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "simple-wiki"

Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
Data from: English Wikipedia - Species Pages
gbif.org
Updated Aug 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
Explore at:
Unique identifier
https://doi.org/10.15468/c3kkgh
Dataset updated
Aug 23, 2022
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
Markus Döring; Markus Döring
Description
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.
f
Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...
figshare.com
txt
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14113799.v4
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
wikipedia-22-12-simple-embeddings
huggingface.co
opendatalab.com
Updated Mar 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
Wikibooks Dataset
kaggle.com
huggingface.co
zip
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhruvil Dave (2021). Wikibooks Dataset [Dataset]. https://www.kaggle.com/datasets/dhruvildave/wikibooks-dataset
Explore at:
zip(1958715136 bytes)Available download formats
Dataset updated
Oct 22, 2021
Authors
Dhruvil Dave
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is the complete dataset of contents of all the Wikibooks in 12 languages. The content contains books of the following languages: English, French, German, Spanish, Portuguese, Italian and Russian, Japanese, Dutch, Polish, Hungarian, and Hebrew; each in its own directory. Wikibooks are divided into chapters and each chapter has its own webpage. This dataset can be used for tasks like Machine Translation, Text Generation, Text Parsing, and Sematic Understanding of Natural Language. Body contents are provided in both newline delimited textual format as would be visible on the page along with its HTML for better semantic parsing.

Refer to the starter notebook: Starter: Wikibooks dataset

Data as of October 22, 2021.

Image Credits: Unsplash - itfeelslikefilm
Wikipedia Article Topics for All Languages (based on article outlinks)
figshare.com
bz2
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12619766.v3
Dataset updated
Jul 20, 2021
Dataset provided by
figshare
Authors
Isaac Johnson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
Z
It/Fr/En-Wiki-100 datasets
data.niaid.nih.gov
Updated Oct 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Albarelli (2022). It/Fr/En-Wiki-100 datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7244892
Explore at:
Dataset updated
Oct 24, 2022
Dataset provided by
Andrea Albarelli
Alessandro Zangari
Andrea Gasparetto
Matteo Marcuzzo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The 3 datasets derived from the Italian (ItWiki-100), French (FrWiki-100) and English (EnWiki-100) Wikipedia dumps, with articles tagged with related portals (100 most common per language).

If you use this data you may cite these works:

Gasparetto A, Marcuzzo M, Zangari A, Albarelli A. (2022) A Survey on Text Classification Algorithms: From Text to Predictions. Information 13, no. 2: 83. https://doi.org/10.3390/info13020083

Gasparetto A, Zangari A, Marcuzzo M, Albarelli A. (2022) A survey on text classification: Practical perspectives on the Italian language. PLOS ONE 17(7): e0270904. https://doi.org/10.1371/journal.pone.0270904
Wikipedia Knowledge Graph dataset
zenodo.org
produccioncientifica.ugr.es
+1more
pdf, tsv
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Explore at:
tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6346900
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
h
wikipedia-data-en-2023-11
huggingface.co
Updated Dec 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mixedbread (2023). wikipedia-data-en-2023-11 [Dataset]. https://huggingface.co/datasets/mixedbread-ai/wikipedia-data-en-2023-11
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2023
Dataset authored and provided by
Mixedbread
Description
mixedbread-ai/wikipedia-data-en-2023-11 dataset hosted on Hugging Face and contributed by the HF Datasets community
Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their...
zenodo.org
bin, zip
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten (2025). Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.14858280
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14858280
Dataset updated
Feb 12, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten
Description
The task of measurement extraction is typically approached in a pipeline manner, where 1) quantities are identified before 2) their individual measurement context is extracted (see our review paper). To support the development and evaluation of systems for measurement extraction, we present two large datasets that correspond to the two tasks:

Wiki-Quantities, a dataset for identifying quantities, and

Wiki-Measurements, a dataset for extracting measurement context for given quantities.

The datasets are heuristically generated from Wikipedia articles and Wikidata facts. For a detailed description of the datasets, please refer to the upcoming corresponding paper:

Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia. 2025. Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, and Detlef Stolten.

Versions

The datasets are released in different versions:

Processing level: the pre-processed versions can be used directly for training and evaluating models, while the raw versions can be used to create custom pre-processed versions or for other purposes. Wiki-Quantities is pre-processed for IOB sequence labeling, while Wiki-Measurements is pre-processed for SQuAD-style generative question answering.

Filtering level:

Wiki-Quantities is available in a raw, large, small, and tiny version: The raw version is the original version, which includes all the samples originally obtained. In the large version, all duplicates and near duplicates present in the raw version are removed. The small and tiny versions are subsets of the large version which are additionally filtered to balance the data with respect to units, properties, and topics.

Wiki-Measurements is available in a large`, small, large_strict, small_strict, small_context, and large_strict_context version: The large version contains all examples minus a few duplicates. The small version is a subset of the large version with very similar examples removed. In the context versions, additional sentences are added around the annotated sentence. In the strict versions, the quantitative facts are more strictly aligned with the text.

Quality: all data has been automatically annotated using heuristics. In contrast to the silver data, the gold data has been manually curated.

Format

The datasets are stored in JSON format. The pre-processed versions are formatted for direct use for IOB sequence labeling or SQuAD-style generative question answering in NLP frameworks such as Huggingface Transformers. In the not pre-processed versions of the datasets, annotations are visualized using emojis to facilitate curation. For example:

Wiki-Quantities (only quantities annotated):

"In a 🍏100-gram🍏 reference amount, almonds supply 🍏579 kilocalories🍏 of food energy."

"Extreme heat waves can raise readings to around and slightly above 🍏38 °C🍏, and arctic blasts can drop lows to 🍏−23 °C to 0 °F🍏."

"This sail added another 🍏0.5 kn🍏."

Wiki-Measurements (measurement context for a single quantity; qualifiers and quantity modifiers are only sparsely annotated):

"The 🔭French national census🔭 of 📆2018📆 estimated the 🍊population🍊 of 🌶️Metz🌶️ to be 🍐116,581🍐, while the population of Metz metropolitan area was about 368,000."

"The 🍊surface temperature🍊 of 🌶️Triton🌶️ was 🔭recorded by Voyager 2🔭 as 🍐-235🍐 🍓°C🍓 (-391 °F)."

"🙋The Babylonians🙋 were able to find that the 🍊value🍊 of 🌶️pi🌶️ was ☎️slightly greater than☎️ 🍐3🍐, by simply 🔭making a big circle and then sticking a piece of rope onto the circumference and the diameter, taking note of their distances, and then dividing the circumference by the diameter🔭."

The mapping of annotation types to emojis is as follows:

Basic quantitative statement:

Entity: 🌶️

Property: 🍊

Quantity: 🍏

Value: 🍐

Unit: 🍓

Quantity modifier: ☎️

Qualifier:

Temporal scope: 📆

Start time: ⏱️

End time: ⏰️

Location: 📍

Reference: 🙋

Determination method: 🔭

Criterion used: 📏

Applies to part: 🦵

Scope: 🔎

Some qualifier: 🛁

Note that for each version of Wiki-Measurements sample IDs are randomly assigned. Therefore, they are not consistent, e.g., between silver small and silver large. The proportions of train, dev, and test sets are unusual because Wiki-Quantities and Wiki-Measurements are intended to be used in conjunction with other non-heuristically generated data.

Evaluation

The evaluation directories contain the manually validated random samples used for evaluation. The evaluation is based on the large versions of the datasets. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements showed that 100% of the Wiki-Quantities samples and 94% (or 84% if strictly scored) of the Wiki-Measurements samples were correct.

License

In accordance with Wikipedia's and Wikidata's licensing terms, the datasets are released under the CC BY-SA 4.0 license, except for Wikidata facts in ./Wiki-Measurements/raw/additional_data.json, which are released under the CC0 1.0 license (the texts are still CC BY-SA 4.0).

About Us

We are the Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis belonging to the Forschungszentrum Jülich. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.

Acknowledgements

The authors would like to thank the German Federal Government, the German State Governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".
e
Plaintext Wikipedia dump 2018 - Dataset - B2FIND
b2find.eudat.eu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Plaintext Wikipedia dump 2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3074cb26-6a0d-5803-8520-d0050a22c66e
Explore at:
Description
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Wikipedia Change Metadata
redivis.com
application/jsonl +7
Updated Sep 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Graduate School of Business Library (2021). Wikipedia Change Metadata [Dataset]. https://redivis.com/datasets/1ky2-8b1pvrv76
Explore at:
application/jsonl, avro, parquet, arrow, spss, stata, csv, sasAvailable download formats
Dataset updated
Sep 22, 2021
Dataset provided by
Redivis Inc.
Authors
Stanford Graduate School of Business Library
Time period covered
Jan 16, 2001 - Mar 1, 2019
Description
Abstract

The Wikipedia Change Metadata is a curation of article changes, updates, and edits over time.

Documentation

**Source for details below: **https://zenodo.org/record/3605388#.YWitsdnML0o

Dataset details

Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and *p$2p$3 *indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

id: id of this revision

parentid: id of revision modified by this revision

timestamp: time when revision was made

cont_username: username of contributor

cont_id: id of contributor

cont_ip: IP address of contributor

comment: comment made by contributor

model: content model (usually "wikitext")

format: content format (usually "text/x-wiki")

sha1: SHA-1 hash

title: page title

ns: namespace (always 0)

page_id: page id

redirect_title: if page is redirect, title of target page

html: revision content in HTML format

%3C!-- --%3E

Part 2: Page creation times (page_creation_times.json.gz)

This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

page_id: page id

title: page title

ns: namespace (0 for articles)

timestamp: time when page was created

%3C!-- --%3E

Part 3: Redirect history (redirect_history.json.gz)

This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

page_id: page id of redirect source

title: page title of redirect source

ns: namespace (0 for articles)

revision_id: revision id of redirect source

timestamp: time at which redirect became active

redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

%3C!-- --%3E
Wikipedia Talk Labels: Personal Attacks
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4054689.v6
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
4
Title and subtitles of Wikipedia articles
data.4tu.nl
figshare.com
zip
Updated Jun 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Sanchez-Charles (2017). Title and subtitles of Wikipedia articles [Dataset]. http://doi.org/10.4121/uuid:61fb9665-40ab-4b70-8214-767c521cc950
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:61fb9665-40ab-4b70-8214-767c521cc950
Dataset updated
Jun 6, 2017
Dataset provided by
4TU.Centre for Research Data
Authors
David Sanchez-Charles
License
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
Description
This dataset contains 871 articles from Wikipedia (retrieved on 8th August 2016), selected from the list of featured articles ({https://en.wikipedia.org/wiki/Wikipedia:Featured_articles}) of the 'Media', 'Literature and Theater', 'Music biographies', 'Media biographies', 'History biographies' and 'Video gaming' categories. From the list of articles, the structure of the document, i.e. sections and subsections of the text, is extracted.

The dataset also contains a proposed clusterization of the event names to increase comparability of Wikipedia articles.
h
Wikipedia-Articles
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, Wikipedia-Articles [Dataset]. https://huggingface.co/datasets/BrightData/Wikipedia-Articles
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Bright Data
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "BrightData/Wikipedia-Articles"

Dataset Summary

Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.
Wikipedia Navigation Vectors
figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn (2023). Wikipedia Navigation Vectors [Dataset]. http://doi.org/10.6084/m9.figshare.3146878.v6
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3146878.v6
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
In this project, we learned embeddings for Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of reading sessions.Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.An additional feature of not relying on text or links, is that we can learn representations for Wikidata items by simply mapping article titles within each session to Wikidata items using Wikidata sitelinks. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.For detailed documentation, see the wiki page.
Data from: Wiki-based Communities of Interest: Demographics and Outliers
zenodo.org
bin
Updated Jan 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan (2023). Wiki-based Communities of Interest: Demographics and Outliers [Dataset]. http://doi.org/10.5281/zenodo.7537200
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7537200
Dataset updated
Jan 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets contains statements about demographics and outliers of Wiki-based Communities of Interest.

Group-centric dataset (sample):

{ "title": "winners of Priestley Medal", "recorded_members": 83, "topics": ["STEM.Chemistry"], "demographics": [ "occupation-chemist", "gender-male", "citizen-U.S." ], "outliers": [ { "reason": "NOT(chemist) unlike 82 recorded members", "members": [ "Francis Garvan (lawyer, art collector)" ] }, { "reason": "NOT(male) unlike 80 recorded members", "members": [ "Mary L. Good (female)", "Darleane Hoffman (female)", "Jacqueline Barton (female)" ] } ] }

Subject-centric dataset (sample):

{ "subject": "Serena Williams", "statements": [ { "statement": "NOT(sport-basketball) but (tennis) unlike 4 recorded winners of Best Female Athlete ESPY Award.", "score": 0.36 }, { "statement": "NOT(occupation-politician) but (tennis player, businessperson, autobiographer) unlike 20 recorded winners of Michigan Women's Hall of Fame.", "score": 0.17 } ] }

This data can be also browsed at: https://wikiknowledge.onrender.com/demographics/
Wikipedia Talk Labels: Toxicity
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nithum Thain; Lucas Dixon; Ellery Wulczyn (2017). Wikipedia Talk Labels: Toxicity [Dataset]. http://doi.org/10.6084/m9.figshare.4563973.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4563973.v2
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nithum Thain; Lucas Dixon; Ellery Wulczyn
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Z
Data for: Wikipedia as a gateway to biomedical research
data.niaid.nih.gov
zenodo.org
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wass, Joe (2020). Data for: Wikipedia as a gateway to biomedical research [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_825221
Explore at:
Dataset updated
Sep 24, 2020
Dataset provided by
Wass, Joe
Maggio, Lauren
Steinberg, Ryan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.

This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813

This dataset also includes a listing from Crossref DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636

An article based on this data was published in PLOS One:

Maggio LA, Willinsky JM, Steinberg RM, Mietchen D, Wass JL, Dong T. Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia. PloS one. 2017 Dec 21;12(12):e0190046.

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190046
Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...
catalog.data.gov
datasets.ai
Updated Nov 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data [Dataset]. https://catalog.data.gov/dataset/wikipedia-on-the-comptox-chemicals-dashboard-connecting-resources-to-enrich-public-chemica
Explore at:
Dataset updated
Nov 3, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Spreadsheet summaries of identifier availability and correctness in Wikipedia Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes Investigation of John W. Huffman cannabinoid dataset Summary of Wikipedia pages linked to DSSTox records Complete identifier data scraped from Wikipedia Chembox and Drugbox pages. This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).

Facebook

Twitter

Click to copy link

Link copied

Cite

Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki

simple-wiki

embedding-data/simple-wiki

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

Embedding Training Data

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for "simple-wiki"

  Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

  Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

  Languages

English.

  Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.

Clear search

Close search

Google apps

Main menu

simple-wiki

Data from: English Wikipedia - Species Pages

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

wikipedia-22-12-simple-embeddings

Wikibooks Dataset

Wikipedia Article Topics for All Languages (based on article outlinks)

It/Fr/En-Wiki-100 datasets

Wikipedia Knowledge Graph dataset

wikipedia-data-en-2023-11

Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their...

Versions

Format

Evaluation

License

About Us

Acknowledgements

Plaintext Wikipedia dump 2018 - Dataset - B2FIND

Wikipedia Change Metadata

Abstract

Documentation

Wikipedia Talk Labels: Personal Attacks

Title and subtitles of Wikipedia articles

Wikipedia-Articles

Wikipedia Navigation Vectors

Data from: Wiki-based Communities of Interest: Demographics and Outliers

Wikipedia Talk Labels: Toxicity

Data for: Wikipedia as a gateway to biomedical research

Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...

simple-wiki

simple-wiki

embedding-data/simple-wiki