MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "simple-wiki"
Dataset Summary
This dataset contains pairs of equivalent sentences obtained from Wikipedia.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.
See https://github.com/mdoering/wikipedia-dwca for details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is the complete dataset of contents of all the Wikibooks in 12 languages. The content contains books of the following languages: English, French, German, Spanish, Portuguese, Italian and Russian, Japanese, Dutch, Polish, Hungarian, and Hebrew; each in its own directory. Wikibooks are divided into chapters and each chapter has its own webpage. This dataset can be used for tasks like Machine Translation, Text Generation, Text Parsing, and Sematic Understanding of Natural Language. Body contents are provided in both newline delimited textual format as would be visible on the page along with its HTML for better semantic parsing.
Refer to the starter notebook: Starter: Wikibooks dataset
Data as of October 22, 2021.
Image Credits: Unsplash - itfeelslikefilm
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 3 datasets derived from the Italian (ItWiki-100), French (FrWiki-100) and English (EnWiki-100) Wikipedia dumps, with articles tagged with related portals (100 most common per language).
If you use this data you may cite these works:
Gasparetto A, Marcuzzo M, Zangari A, Albarelli A. (2022) A Survey on Text Classification Algorithms: From Text to Predictions. Information 13, no. 2: 83. https://doi.org/10.3390/info13020083
Gasparetto A, Zangari A, Marcuzzo M, Albarelli A. (2022) A survey on text classification: Practical perspectives on the Italian language. PLOS ONE 17(7): e0270904. https://doi.org/10.1371/journal.pone.0270904
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.
There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).
The document Dataset_summary includes a detailed description of the dataset.
Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
mixedbread-ai/wikipedia-data-en-2023-11 dataset hosted on Hugging Face and contributed by the HF Datasets community
The task of measurement extraction is typically approached in a pipeline manner, where 1) quantities are identified before 2) their individual measurement context is extracted (see our review paper). To support the development and evaluation of systems for measurement extraction, we present two large datasets that correspond to the two tasks:
The datasets are heuristically generated from Wikipedia articles and Wikidata facts. For a detailed description of the datasets, please refer to the upcoming corresponding paper:
Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia. 2025. Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, and Detlef Stolten.
The datasets are released in different versions:
pre-processed
versions can be used directly for training and evaluating models, while the raw
versions can be used to create custom pre-processed versions or for other purposes. Wiki-Quantities is pre-processed for IOB sequence labeling, while Wiki-Measurements is pre-processed for SQuAD-style generative question answering.raw
, large
, small
, and tiny
version: The raw
version is the original version, which includes all the samples originally obtained. In the large
version, all duplicates and near duplicates present in the raw version are removed. The small
and tiny
versions are subsets of the large version which are additionally filtered to balance the data with respect to units, properties, and topics.large
`, small
, large_strict
, small_strict
, small_context
, and large_strict_context
version: The large
version contains all examples minus a few duplicates. The small
version is a subset of the large version with very similar examples removed. In the context
versions, additional sentences are added around the annotated sentence. In the strict
versions, the quantitative facts are more strictly aligned with the text.silver
data, the gold
data has been manually curated.The datasets are stored in JSON format. The pre-processed versions are formatted for direct use for IOB sequence labeling or SQuAD-style generative question answering in NLP frameworks such as Huggingface Transformers. In the not pre-processed versions of the datasets, annotations are visualized using emojis to facilitate curation. For example:
"In a 🍏100-gram🍏 reference amount, almonds supply 🍏579 kilocalories🍏 of food energy."
"Extreme heat waves can raise readings to around and slightly above 🍏38 °C🍏, and arctic blasts can drop lows to 🍏−23 °C to 0 °F🍏."
"This sail added another 🍏0.5 kn🍏."
"The 🔭French national census🔭 of 📆2018📆 estimated the 🍊population🍊 of 🌶️Metz🌶️ to be 🍐116,581🍐, while the population of Metz metropolitan area was about 368,000."
"The 🍊surface temperature🍊 of 🌶️Triton🌶️ was 🔭recorded by Voyager 2🔭 as 🍐-235🍐 🍓°C🍓 (-391 °F)."
"🙋The Babylonians🙋 were able to find that the 🍊value🍊 of 🌶️pi🌶️ was ☎️slightly greater than☎️ 🍐3🍐, by simply 🔭making a big circle and then sticking a piece of rope onto the circumference and the diameter, taking note of their distances, and then dividing the circumference by the diameter🔭."
The mapping of annotation types to emojis is as follows:
Note that for each version of Wiki-Measurements sample IDs are randomly assigned. Therefore, they are not consistent, e.g., between silver small
and silver large
. The proportions of train, dev, and test sets are unusual because Wiki-Quantities and Wiki-Measurements are intended to be used in conjunction with other non-heuristically generated data.
The evaluation
directories contain the manually validated random samples used for evaluation. The evaluation is based on the large
versions of the datasets. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements showed that 100% of the Wiki-Quantities samples and 94% (or 84% if strictly scored) of the Wiki-Measurements samples were correct.
In accordance with Wikipedia's and Wikidata's licensing terms, the datasets are released under the CC BY-SA 4.0 license, except for Wikidata facts in ./Wiki-Measurements/raw/additional_data.json
, which are released under the CC0 1.0 license (the texts are still CC BY-SA 4.0).
We are the Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis belonging to the Forschungszentrum Jülich. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.
The authors would like to thank the German Federal Government, the German State Governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
The Wikipedia Change Metadata is a curation of article changes, updates, and edits over time.
**Source for details below: **https://zenodo.org/record/3605388#.YWitsdnML0o
Dataset details
Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and *p$2p$3 *indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):
%3C!-- --%3E
Part 2: Page creation times (page_creation_times.json.gz)
This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:
%3C!-- --%3E
Part 3: Redirect history (redirect_history.json.gz)
This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:
%3C!-- --%3E
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
This dataset contains 871 articles from Wikipedia (retrieved on 8th August 2016), selected from the list of featured articles ({https://en.wikipedia.org/wiki/Wikipedia:Featured_articles}) of the 'Media', 'Literature and Theater', 'Music biographies', 'Media biographies', 'History biographies' and 'Video gaming' categories. From the list of articles, the structure of the document, i.e. sections and subsections of the text, is extracted.
The dataset also contains a proposed clusterization of the event names to increase comparability of Wikipedia articles.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "BrightData/Wikipedia-Articles"
Dataset Summary
Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this project, we learned embeddings for Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of reading sessions.Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.An additional feature of not relying on text or links, is that we can learn representations for Wikidata items by simply mapping article titles within each session to Wikidata items using Wikidata sitelinks. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.For detailed documentation, see the wiki page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets contains statements about demographics and outliers of Wiki-based Communities of Interest.
Group-centric dataset (sample):
{
"title": "winners of Priestley Medal",
"recorded_members": 83,
"topics": ["STEM.Chemistry"],
"demographics": [
"occupation-chemist",
"gender-male",
"citizen-U.S."
],
"outliers": [
{
"reason": "NOT(chemist) unlike 82 recorded members",
"members": [
"Francis Garvan (lawyer, art collector)"
]
},
{
"reason": "NOT(male) unlike 80 recorded members",
"members": [
"Mary L. Good (female)",
"Darleane Hoffman (female)",
"Jacqueline Barton (female)"
]
}
]
}
Subject-centric dataset (sample):
{
"subject": "Serena Williams",
"statements": [
{
"statement": "NOT(sport-basketball) but (tennis) unlike 4 recorded winners of Best Female Athlete ESPY Award.",
"score": 0.36
},
{
"statement": "NOT(occupation-politician) but (tennis player, businessperson, autobiographer) unlike 20 recorded winners of Michigan Women's Hall of Fame.",
"score": 0.17
}
]
}
This data can be also browsed at: https://wikiknowledge.onrender.com/demographics/
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.
This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813
This dataset also includes a listing from Crossref DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636
An article based on this data was published in PLOS One:
Maggio LA, Willinsky JM, Steinberg RM, Mietchen D, Wass JL, Dong T. Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia. PloS one. 2017 Dec 21;12(12):e0190046.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190046
Spreadsheet summaries of identifier availability and correctness in Wikipedia Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes Investigation of John W. Huffman cannabinoid dataset Summary of Wikipedia pages linked to DSSTox records Complete identifier data scraped from Wikipedia Chembox and Drugbox pages. This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "simple-wiki"
Dataset Summary
This dataset contains pairs of equivalent sentences obtained from Wikipedia.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.