100+ datasets found
  1. h

    simple-wiki

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "simple-wiki"

      Dataset Summary
    

    This dataset contains pairs of equivalent sentences obtained from Wikipedia.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.

  2. Data from: English Wikipedia - Species Pages

    • gbif.org
    Updated Aug 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  3. f

    Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

    • figshare.com
    txt
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    figshare
    Authors
    KayYen Wong; Diego Saez-Trumper; Miriam Redi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

  4. wikipedia-22-12-simple-embeddings

    • huggingface.co
    • opendatalab.com
    Updated Mar 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.

  5. Wikibooks Dataset

    • kaggle.com
    • huggingface.co
    zip
    Updated Oct 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhruvil Dave (2021). Wikibooks Dataset [Dataset]. https://www.kaggle.com/datasets/dhruvildave/wikibooks-dataset
    Explore at:
    zip(1958715136 bytes)Available download formats
    Dataset updated
    Oct 22, 2021
    Authors
    Dhruvil Dave
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is the complete dataset of contents of all the Wikibooks in 12 languages. The content contains books of the following languages: English, French, German, Spanish, Portuguese, Italian and Russian, Japanese, Dutch, Polish, Hungarian, and Hebrew; each in its own directory. Wikibooks are divided into chapters and each chapter has its own webpage. This dataset can be used for tasks like Machine Translation, Text Generation, Text Parsing, and Sematic Understanding of Natural Language. Body contents are provided in both newline delimited textual format as would be visible on the page along with its HTML for better semantic parsing.

    Refer to the starter notebook: Starter: Wikibooks dataset

    Data as of October 22, 2021.

    Image Credits: Unsplash - itfeelslikefilm

  6. Wikipedia Article Topics for All Languages (based on article outlinks)

    • figshare.com
    bz2
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
    Explore at:
    bz2Available download formats
    Dataset updated
    Jul 20, 2021
    Dataset provided by
    figshare
    Authors
    Isaac Johnson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.

  7. Z

    It/Fr/En-Wiki-100 datasets

    • data.niaid.nih.gov
    Updated Oct 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Albarelli (2022). It/Fr/En-Wiki-100 datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7244892
    Explore at:
    Dataset updated
    Oct 24, 2022
    Dataset provided by
    Andrea Albarelli
    Alessandro Zangari
    Andrea Gasparetto
    Matteo Marcuzzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The 3 datasets derived from the Italian (ItWiki-100), French (FrWiki-100) and English (EnWiki-100) Wikipedia dumps, with articles tagged with related portals (100 most common per language).

    If you use this data you may cite these works:

    Gasparetto A, Marcuzzo M, Zangari A, Albarelli A. (2022) A Survey on Text Classification Algorithms: From Text to Predictions. Information 13, no. 2: 83. https://doi.org/10.3390/info13020083

    Gasparetto A, Zangari A, Marcuzzo M, Albarelli A. (2022) A survey on text classification: Practical perspectives on the Italian language. PLOS ONE 17(7): e0270904. https://doi.org/10.1371/journal.pone.0270904

  8. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +1more
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  9. h

    wikipedia-data-en-2023-11

    • huggingface.co
    Updated Dec 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mixedbread (2023). wikipedia-data-en-2023-11 [Dataset]. https://huggingface.co/datasets/mixedbread-ai/wikipedia-data-en-2023-11
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2023
    Dataset authored and provided by
    Mixedbread
    Description

    mixedbread-ai/wikipedia-data-en-2023-11 dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their...

    • zenodo.org
    bin, zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten (2025). Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.14858280
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten
    Description

    The task of measurement extraction is typically approached in a pipeline manner, where 1) quantities are identified before 2) their individual measurement context is extracted (see our review paper). To support the development and evaluation of systems for measurement extraction, we present two large datasets that correspond to the two tasks:

    • Wiki-Quantities, a dataset for identifying quantities, and
    • Wiki-Measurements, a dataset for extracting measurement context for given quantities.

    The datasets are heuristically generated from Wikipedia articles and Wikidata facts. For a detailed description of the datasets, please refer to the upcoming corresponding paper:

    Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia. 2025. Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, and Detlef Stolten.

    Versions

    The datasets are released in different versions:

    • Processing level: the pre-processed versions can be used directly for training and evaluating models, while the raw versions can be used to create custom pre-processed versions or for other purposes. Wiki-Quantities is pre-processed for IOB sequence labeling, while Wiki-Measurements is pre-processed for SQuAD-style generative question answering.
    • Filtering level:
      • Wiki-Quantities is available in a raw, large, small, and tiny version: The raw version is the original version, which includes all the samples originally obtained. In the large version, all duplicates and near duplicates present in the raw version are removed. The small and tiny versions are subsets of the large version which are additionally filtered to balance the data with respect to units, properties, and topics.
      • Wiki-Measurements is available in a large`, small, large_strict, small_strict, small_context, and large_strict_context version: The large version contains all examples minus a few duplicates. The small version is a subset of the large version with very similar examples removed. In the context versions, additional sentences are added around the annotated sentence. In the strict versions, the quantitative facts are more strictly aligned with the text.
    • Quality: all data has been automatically annotated using heuristics. In contrast to the silver data, the gold data has been manually curated.

    Format

    The datasets are stored in JSON format. The pre-processed versions are formatted for direct use for IOB sequence labeling or SQuAD-style generative question answering in NLP frameworks such as Huggingface Transformers. In the not pre-processed versions of the datasets, annotations are visualized using emojis to facilitate curation. For example:

    • Wiki-Quantities (only quantities annotated):
      • "In a 🍏100-gram🍏 reference amount, almonds supply 🍏579 kilocalories🍏 of food energy."
      • "Extreme heat waves can raise readings to around and slightly above 🍏38 °C🍏, and arctic blasts can drop lows to 🍏−23 °C to 0 °F🍏."
      • "This sail added another 🍏0.5 kn🍏."
    • Wiki-Measurements (measurement context for a single quantity; qualifiers and quantity modifiers are only sparsely annotated):
      • "The 🔭French national census🔭 of 📆2018📆 estimated the 🍊population🍊 of 🌶️Metz🌶️ to be 🍐116,581🍐, while the population of Metz metropolitan area was about 368,000."
      • "The 🍊surface temperature🍊 of 🌶️Triton🌶️ was 🔭recorded by Voyager 2🔭 as 🍐-235🍐 🍓°C🍓 (-391 °F)."
      • "🙋The Babylonians🙋 were able to find that the 🍊value🍊 of 🌶️pi🌶️ was ☎️slightly greater than☎️ 🍐3🍐, by simply 🔭making a big circle and then sticking a piece of rope onto the circumference and the diameter, taking note of their distances, and then dividing the circumference by the diameter🔭."

    The mapping of annotation types to emojis is as follows:

    • Basic quantitative statement:
      • Entity: 🌶️
      • Property: 🍊
      • Quantity: 🍏
      • Value: 🍐
      • Unit: 🍓
      • Quantity modifier: ☎️
    • Qualifier:
      • Temporal scope: 📆
      • Start time: ⏱️
      • End time: ⏰️
      • Location: 📍
      • Reference: 🙋
      • Determination method: 🔭
      • Criterion used: 📏
      • Applies to part: 🦵
      • Scope: 🔎
      • Some qualifier: 🛁

    Note that for each version of Wiki-Measurements sample IDs are randomly assigned. Therefore, they are not consistent, e.g., between silver small and silver large. The proportions of train, dev, and test sets are unusual because Wiki-Quantities and Wiki-Measurements are intended to be used in conjunction with other non-heuristically generated data.

    Evaluation

    The evaluation directories contain the manually validated random samples used for evaluation. The evaluation is based on the large versions of the datasets. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements showed that 100% of the Wiki-Quantities samples and 94% (or 84% if strictly scored) of the Wiki-Measurements samples were correct.

    License

    In accordance with Wikipedia's and Wikidata's licensing terms, the datasets are released under the CC BY-SA 4.0 license, except for Wikidata facts in ./Wiki-Measurements/raw/additional_data.json, which are released under the CC0 1.0 license (the texts are still CC BY-SA 4.0).

    About Us

    We are the Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis belonging to the Forschungszentrum Jülich. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.

    Acknowledgements

    The authors would like to thank the German Federal Government, the German State Governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".

  11. e

    Plaintext Wikipedia dump 2018 - Dataset - B2FIND

    • b2find.eudat.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plaintext Wikipedia dump 2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3074cb26-6a0d-5803-8520-d0050a22c66e
    Explore at:
    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  12. Wikipedia Change Metadata

    • redivis.com
    application/jsonl +7
    Updated Sep 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2021). Wikipedia Change Metadata [Dataset]. https://redivis.com/datasets/1ky2-8b1pvrv76
    Explore at:
    application/jsonl, avro, parquet, arrow, spss, stata, csv, sasAvailable download formats
    Dataset updated
    Sep 22, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Graduate School of Business Library
    Time period covered
    Jan 16, 2001 - Mar 1, 2019
    Description

    Abstract

    The Wikipedia Change Metadata is a curation of article changes, updates, and edits over time.

    Documentation

    **Source for details below: **https://zenodo.org/record/3605388#.YWitsdnML0o

    Dataset details

    Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and *p$2p$3 *indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    %3C!-- --%3E

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    %3C!-- --%3E

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    %3C!-- --%3E

  13. Wikipedia Talk Labels: Personal Attacks

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  14. 4

    Title and subtitles of Wikipedia articles

    • data.4tu.nl
    • figshare.com
    zip
    Updated Jun 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Sanchez-Charles (2017). Title and subtitles of Wikipedia articles [Dataset]. http://doi.org/10.4121/uuid:61fb9665-40ab-4b70-8214-767c521cc950
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2017
    Dataset provided by
    4TU.Centre for Research Data
    Authors
    David Sanchez-Charles
    License

    https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use

    Description

    This dataset contains 871 articles from Wikipedia (retrieved on 8th August 2016), selected from the list of featured articles ({https://en.wikipedia.org/wiki/Wikipedia:Featured_articles}) of the 'Media', 'Literature and Theater', 'Music biographies', 'Media biographies', 'History biographies' and 'Video gaming' categories. From the list of articles, the structure of the document, i.e. sections and subsections of the text, is extracted.

    The dataset also contains a proposed clusterization of the event names to increase comparability of Wikipedia articles.

  15. h

    Wikipedia-Articles

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data, Wikipedia-Articles [Dataset]. https://huggingface.co/datasets/BrightData/Wikipedia-Articles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Bright Data
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "BrightData/Wikipedia-Articles"

      Dataset Summary
    

    Explore a collection of millions of Wikipedia articles with the Wikipedia dataset, comprising over 1.23M structured records and 10 data fields updated and refreshed regularly. Each entry includes all major data points such as timestamp, URLs, article titles, raw and cataloged text, images, "see also" references, external links, and a structured table of contents. For a complete list of data points, please… See the full description on the dataset page: https://huggingface.co/datasets/BrightData/Wikipedia-Articles.

  16. Wikipedia Navigation Vectors

    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn (2023). Wikipedia Navigation Vectors [Dataset]. http://doi.org/10.6084/m9.figshare.3146878.v6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this project, we learned embeddings for Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of reading sessions.Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.An additional feature of not relying on text or links, is that we can learn representations for Wikidata items by simply mapping article titles within each session to Wikidata items using Wikidata sitelinks. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.For detailed documentation, see the wiki page.

  17. Data from: Wiki-based Communities of Interest: Demographics and Outliers

    • zenodo.org
    bin
    Updated Jan 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan (2023). Wiki-based Communities of Interest: Demographics and Outliers [Dataset]. http://doi.org/10.5281/zenodo.7537200
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hiba Arnaout; Simon Razniewski; Jeff Z. Pan; Hiba Arnaout; Simon Razniewski; Jeff Z. Pan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets contains statements about demographics and outliers of Wiki-based Communities of Interest.

    Group-centric dataset (sample):

    {
      "title": "winners of Priestley Medal", 
      "recorded_members": 83, 
      "topics": ["STEM.Chemistry"], 
      "demographics": [
          "occupation-chemist",
          "gender-male", 
          "citizen-U.S."
      ], 
      "outliers": [
        {
          "reason": "NOT(chemist) unlike 82 recorded members", 
          "members": [
          "Francis Garvan (lawyer, art collector)"
          ]
        }, 
        {
          "reason": "NOT(male) unlike 80 recorded members", 
          "members": [
          "Mary L. Good (female)",
          "Darleane Hoffman (female)", 
          "Jacqueline Barton (female)"
          ]
        }
      ]
    }

    Subject-centric dataset (sample):

    {
      "subject": "Serena Williams", 
      "statements": [
        {
          "statement": "NOT(sport-basketball) but (tennis) unlike 4 recorded winners of Best Female Athlete ESPY Award.", 
          "score": 0.36
        },
      {
          "statement": "NOT(occupation-politician) but (tennis player, businessperson, autobiographer) unlike 20 recorded winners of Michigan Women's Hall of Fame.",
          "score": 0.17
        }
      ]
    }

    This data can be also browsed at: https://wikiknowledge.onrender.com/demographics/

  18. Wikipedia Talk Labels: Toxicity

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nithum Thain; Lucas Dixon; Ellery Wulczyn (2017). Wikipedia Talk Labels: Toxicity [Dataset]. http://doi.org/10.6084/m9.figshare.4563973.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nithum Thain; Lucas Dixon; Ellery Wulczyn
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  19. Z

    Data for: Wikipedia as a gateway to biomedical research

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wass, Joe (2020). Data for: Wikipedia as a gateway to biomedical research [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_825221
    Explore at:
    Dataset updated
    Sep 24, 2020
    Dataset provided by
    Wass, Joe
    Maggio, Lauren
    Steinberg, Ryan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.

    This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813

    This dataset also includes a listing from Crossref DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636

    An article based on this data was published in PLOS One:

    Maggio LA, Willinsky JM, Steinberg RM, Mietchen D, Wass JL, Dong T. Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia. PloS one. 2017 Dec 21;12(12):e0190046.

    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190046

  20. Data from: Wikipedia on the CompTox Chemicals Dashboard: Connecting...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data [Dataset]. https://catalog.data.gov/dataset/wikipedia-on-the-comptox-chemicals-dashboard-connecting-resources-to-enrich-public-chemica
    Explore at:
    Dataset updated
    Nov 3, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Spreadsheet summaries of identifier availability and correctness in Wikipedia Tabular summaries of identifier availability and correctness in Wikipedia; summary statistics of drugboxes and chemboxes Investigation of John W. Huffman cannabinoid dataset Summary of Wikipedia pages linked to DSSTox records Complete identifier data scraped from Wikipedia Chembox and Drugbox pages. This dataset is associated with the following publication: Sinclair, G., I. Thillainadarajah, B. Meyer, V. Samano, S. Sivasupramaniam, L. Adams, E. Willighagen, A. Richard, M. Walker, and A. Williams. Wikipedia on the CompTox Chemicals Dashboard: Connecting Resources to Enrich Public Chemical Data. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 62(20): 4888-4905, (2022).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki

simple-wiki

simple-wiki

embedding-data/simple-wiki

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for "simple-wiki"

  Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

  Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

  Languages

English.

  Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.

Search
Clear search
Close search
Google apps
Main menu