100+ datasets found
  1. h

    speech-wikimedia

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons (2023). speech-wikimedia [Dataset]. https://huggingface.co/datasets/MLCommons/speech-wikimedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Dataset authored and provided by
    MLCommons
    Description

    Dataset Card for Speech Wikimedia

      Dataset Summary
    

    The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.

      Transcription languages
    

    English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.

  2. h

    wikipedia-small-3000-embedded

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Not Lain, wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Not Lain
    License

    https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/

    Description

    this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

    load dataset in streaming mode (no download and it's fast)

    dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

    select 3000 samples

    from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.

  3. Wikipedia People Page Views Data

    • kaggle.com
    zip
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netanel M (2024). Wikipedia People Page Views Data [Dataset]. https://www.kaggle.com/datasets/netanelmad/wikipedia-people-page-views-data
    Explore at:
    zip(4438894858 bytes)Available download formats
    Dataset updated
    Mar 28, 2024
    Authors
    Netanel M
    Description

    The data in this dataset is extracted from the BigQuery Wikipedia dataset. It includes:

    1. Monthly page views for all people on Wikipedia (P31 (instance of) = Q5 (human)) for the years 2015-2023.
    2. Wikidata Properties of these pages (NOTE: this information is found to be very messy, missing, and sometimes incorrect. If you wish to have a clean and validated dataset, I recommend checking out the verified people dataset by Laouenan et al. at this link).

    Date accessed: March 2024.

    Below are the queries used to get the dataset: ``sql -- 1. Get List of people on Wikipedia SELECT DISTINCT en_wiki -- page title name in english wikipedia FROMproject.wikipedia_pageviews.wikidata`, UNNEST(instance_of) AS instance_of_struct

    WHERE instance_of_struct.numeric_id = 5 -- instance_of = 5 => person

    -- 2. Get pageview data for those people SELECT title, DATETIME_TRUNC(datehour, MONTH) AS month, SUM(views) AS monthly_views

    FROM project.wikipedia_pageviews.pageviews_20xx a -- replace xx with desired year JOIN project.data_for_project.distinct_people b ON a.title = b.en_wiki

    WHERE datehour IS NOT NULL AND wiki = "en"

    GROUP BY title, DATETIME_TRUNC(datehour, MONTH)

    -- 3. Get wikidata for those people SELECT *
    FROM project.wikipedia_pageviews.wikidata, UNNEST(instance_of) AS instance_of_struct

    WHERE instance_of_struct.numeric_id = 5 ```

  4. Raw Wikipedia

    • kaggle.com
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ismael (2024). Raw Wikipedia [Dataset]. https://www.kaggle.com/datasets/ismaeldwikat/wikipedia
    Explore at:
    zip(8575597 bytes)Available download formats
    Dataset updated
    May 21, 2024
    Authors
    Ismael
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset comprises raw data extracted from Wikipedia, encompassing various types of content including articles, metadata, and user interactions. The dataset is in its unprocessed form, providing an excellent opportunity for data enthusiasts and professionals to engage in data cleaning and preprocessing tasks. It is ideal for those looking to practice and enhance their data cleaning skills, as well as for researchers and developers who require a rich and diverse corpus for natural language processing (NLP) projects.

  5. Wikipedia Talk Corpus

    • figshare.com
    • kaggle.com
    application/x-gzip
    Updated Jan 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jan 23, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.

  6. Wikipedia Link Graph Dataset - 100K Pages

    • kaggle.com
    zip
    Updated Dec 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kutay Şahin (2025). Wikipedia Link Graph Dataset - 100K Pages [Dataset]. https://www.kaggle.com/datasets/kutayahin/wikipedia-link-graph-100k
    Explore at:
    zip(908552367 bytes)Available download formats
    Dataset updated
    Dec 4, 2025
    Authors
    Kutay Şahin
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    A comprehensive Wikipedia dataset containing 100,000 pages with 28.9 million links, collected using breadth-first search crawling algorithm. This dataset includes complete page metadata, link relationships, and a network graph representation suitable for network analysis, graph algorithms, NLP research, and machine learning applications.

    Dataset Overview

    • Total Pages: 100,000
    • Total Links: 28,855,738 (directed edges)
    • Average Words per Page: 3,531
    • Language: English (en.wikipedia.org)
    • Collection Method: BFS (Breadth-First Search) crawling, depth 5
    • Data Quality Score: 99.76/100

    Files Description

    1. pages_export.csv

    Complete page metadata including: - id: Unique page ID - title: Page title - language: Language code (en) - content_length: Content length in characters - word_count: Word count - categories: JSON array of categories - infobox: JSON object of infobox data - created_at: Timestamp - url: Full Wikipedia URL

    Size: ~70 MB | Rows: 100,000

    2. links_export.csv

    Complete link graph with URLs: - id: Unique link ID - source_title: Source page title - target_title: Target page title - language: Language code - position: Link position on page - depth: Crawl depth where link was discovered - created_at: Timestamp - source_url: Full source page URL - target_url: Full target page URL

    Size: ~4.5 GB | Rows: 28,855,738

    3. graph.json

    Network graph in JSON format: - nodes: Array of node objects with id field - edges: Array of edge objects with source and target fields

    Size: ~2.1 GB | Edges: 28,855,738

    Data Quality

    • Content Coverage: 99.99% (99,992 pages have quality content)
    • Link Quality: 99.22%
    • Uniqueness: 100% (all links are unique)
    • Content Quality: 100% (average 3,531 words per page)
    • Duplicate Pages: Minimal (cleaned)
    • Self-Links: 4,326 (removed)
    • Data Validation: ✅ All entries validated and cleaned

    Use Cases

    1. Network Analysis: Study Wikipedia link structure and page connectivity
    2. Graph Algorithms: Test shortest path, centrality, community detection algorithms
    3. NLP Research: Analyze Wikipedia content, categories, and relationships
    4. Machine Learning: Train models on Wikipedia link prediction
    5. Knowledge Graph: Build knowledge graphs from Wikipedia structure
    6. PageRank: Implement and test PageRank algorithms
    7. Recommendation Systems: Build content recommendation systems

    Collection Methodology

    1. Seed Selection: Started with 5 Wikipedia pages
    2. Crawling: BFS algorithm, depth 5
    3. Rate Limiting: Balanced (0.82 pages/second)
    4. Parallel Processing: Optimized concurrent workers
    5. Caching: HTML content cached for efficiency
    6. Validation: All data validated and deduplicated
    7. Quality Control: Automated quality checks and cleaning

    Technical Details

    • Database: SQLite with WAL mode
    • Crawl Duration: ~29 hours
    • Crawl Rate: 0.82 pages/second
    • Checkpoint System: Resume-capable crawling
    • Data Cleaning: Automated duplicate removal and quality checks
  7. h

    ner-wikipedia-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stockmark Inc., ner-wikipedia-dataset [Dataset]. https://huggingface.co/datasets/stockmark/ner-wikipedia-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Stockmark Inc.
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipediaを用いた日本語の固有表現抽出データセット

    GitHub: https://github.com/stockmarkteam/ner-wikipedia-dataset/ LICENSE: CC-BY-SA 3.0

    Developed by Stockmark Inc.

  8. A Wikipedia dataset of 5 categories

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Maitre; Julien Maitre (2020). A Wikipedia dataset of 5 categories [Dataset]. http://doi.org/10.5281/zenodo.3260046
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien Maitre; Julien Maitre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

    • Economy : 44'876 articles
    • History : 92'041 articles
    • Informatics : 25'408 articles
    • Health : 22'143 articles
    • Law : 9'964 articles
  9. Wikipedia data.tsv

    • figshare.com
    txt
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyi Wei (2023). Wikipedia data.tsv [Dataset]. http://doi.org/10.6084/m9.figshare.24278299.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 10, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mengyi Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Using Wikipedia data to study AI ethics.

  10. Z

    Long document similarity datasets, Wikipedia excerptions for movies, video...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous (2021). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4468782
    Explore at:
    Dataset updated
    Jan 27, 2021
    Dataset authored and provided by
    anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three corpora in different domains extracted from Wikipedia.

    For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

    The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

    Wines

    Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are

    Dom Pérignon - Moët & Chandon

    Pinot Meunier - Chardonnay

    Movies

    The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are

    Schindler's List - The Pianist

    Lion King - The Jungle Book

    Video games

    The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

    Grand Theft Auto - Mafia

    Burnout Paradise - Forza Horizon 3

  11. Freebase/Wikidata Mappings

    • kaggle.com
    zip
    Updated Mar 13, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhruv Bansal (2026). Freebase/Wikidata Mappings [Dataset]. https://www.kaggle.com/datasets/dhruvb2028/freebasewikidata-mappings
    Explore at:
    zip(21894706 bytes)Available download formats
    Dataset updated
    Mar 13, 2026
    Authors
    Dhruv Bansal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides entity mappings between Freebase and Wikidata, enabling seamless integration between two large-scale knowledge graphs. It is based on the Wikidata data dump from October 28, 2013, and was originally published by Google under the CC0 (Public Domain) license.

    The mappings are carefully filtered to ensure high reliability:

    • Each mapping includes at least two shared Wikipedia links
    • There are no conflicting Wikipedia links

    This strict filtering results in high-confidence entity alignments, making the dataset useful for research and real-world applications in knowledge graph systems.

  12. 1000 Wikipedia Samples

    • zenodo.org
    zip
    Updated Feb 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Waldis; Andreas Waldis (2020). 1000 Wikipedia Samples [Dataset]. http://doi.org/10.5281/zenodo.3634383
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 3, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andreas Waldis; Andreas Waldis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1000 Wikipedia articles, used to evaluate a concept recognition algorithm

  13. h

    wizard_of_wikipedia

    • huggingface.co
    Updated Sep 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chujie Zheng (2020). wizard_of_wikipedia [Dataset]. https://huggingface.co/datasets/chujiezheng/wizard_of_wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2020
    Authors
    Chujie Zheng
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Wizard-of-Wikipedia data for the Findings of EMNLP 2020 paper "Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation" GitHub repo. Original paper. @inproceedings{zheng-etal-2020-diffks, title="{D}ifference-aware Knowledge Selection for Knowledge-grounded Conversation Generation", author="Zheng, Chujie and Cao, Yunbo and Jiang, Daxin and Huang, Minlie", booktitle="Findings of EMNLP", year="2020" }

  14. Z

    Long document similarity dataset, Wikipedia excerptions for movies...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous (2023). Long document similarity dataset, Wikipedia excerptions for movies collections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7019172
    Explore at:
    Dataset updated
    Jan 20, 2023
    Dataset authored and provided by
    anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Movies-related articles extracted from Wikipedia.

    For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.

    The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

    Movies

    The Wikipedia Movies dataset consists of 100,371 articles describing various movies. Each article may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.

  15. 270K Wikipedia STEM articles

    • kaggle.com
    zip
    Updated Aug 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammadreza Banaei (2023). 270K Wikipedia STEM articles [Dataset]. https://www.kaggle.com/datasets/mbanaei/all-paraphs-parsed-expanded
    Explore at:
    zip(549099946 bytes)Available download formats
    Dataset updated
    Aug 24, 2023
    Authors
    Mohammadreza Banaei
    Description

    Dataset

    This dataset was created by Mohammadreza Banaei

    Contents

  16. Wiki-talk Datasets

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    gz
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis (2020). Wiki-talk Datasets [Dataset]. http://doi.org/10.5281/zenodo.49561
    Explore at:
    gzAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    User interaction networks of Wikipedia of 28 different languages. Nodes (orininal wikipedia user IDs) represent users of the Wikipedia, and an edge from user A to user B denotes that user A wrote a message on the talk page of user B at a certain timestamp.

    More info: http://yfiua.github.io/academic/2016/02/14/wiki-talk-datasets.html

  17. R

    Wikipedia Dataset

    • universe.roboflow.com
    zip
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yolov8ui (2025). Wikipedia Dataset [Dataset]. https://universe.roboflow.com/yolov8ui/wikipedia/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    yolov8ui
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    UI Elements Bounding Boxes
    Description

    Wikipedia

    ## Overview
    
    Wikipedia is a dataset for object detection tasks - it contains UI Elements annotations for 5,522 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. wikipedia-new-user-registrations

    • huggingface.co
    • kaggle.com
    Updated May 9, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unofficial Wikimedia Community (2012). wikipedia-new-user-registrations [Dataset]. https://huggingface.co/datasets/wikimedia-community/wikipedia-new-user-registrations
    Explore at:
    Dataset updated
    May 9, 2012
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Unofficial Wikimedia Community
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Historical data on new user account registrations to the English Wikipedia and other large Wikipedias. Hourly new user registrations to the English Wikipedia (2008-2011), timestamps are aligned to 2008 (as opposed to 2011 for the original dataset) for easy year-to-year comparison.

  19. Data from: English Wikipedia - Species Pages

    • gbif.org
    Updated Aug 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  20. Data from: Citations with identifiers in Wikipedia

    • figshare.com
    gz
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Halfaker; Bahodir Mansurov; Miriam Redi; Dario Taraborelli (2023). Citations with identifiers in Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.1299540.v1
    Explore at:
    gzAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aaron Halfaker; Bahodir Mansurov; Miriam Redi; Dario Taraborelli
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset includes a list of citations with identifiers extracted from the most recent version of Wikipedia across all language editions. The data was parsed from the Wikipedia content dumps published on March 1, 2018. License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/ Projects Previous versions of this dataset ("Scholarly citations in Wikipedia") were limited to the English language edition. The current version includes one dataset for each of the 298 languages editions that Wikipedia supports as of March 2018. Projects are identified by their ISO 639-1/639-2 language code, per https://meta.wikimedia.org/wiki/List_of_Wikipedias. Identifiers • PubMed IDs (pmid) and PubMedCentral IDs (pmcid).• Digital Object Identifiers (doi)• International Standard Book Number (isbn)• ArXiv Ids (arxiv) Format Each row in the dataset represents a citation as a (Wikipedia article, cited source) pair. Metadata about when the citation was first added is included. • page_id -- The identifier of the Wikipedia article (int), e.g. 1325125• page_title -- The title of the Wikipedia article (utf-8), e.g. Club cell• rev_id -- The Wikipedia revision where the citation was first added (int), e.g. 282470030• timestamp -- The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z• type -- The type of identifier, e.g. pmid• id -- The id of the cited source (utf-8), e.g. 18179694 Source code https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia (MIT Licensed) A copy of this dataset is also available at https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/Notes Citation identifers are extracted as-is from Wikipedia article content. Our spot-checking suggests that 98% of identifiers resolve.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
MLCommons (2023). speech-wikimedia [Dataset]. https://huggingface.co/datasets/MLCommons/speech-wikimedia

speech-wikimedia

MLCommons/speech-wikimedia

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2023
Dataset authored and provided by
MLCommons
Description

Dataset Card for Speech Wikimedia

  Dataset Summary

The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.

  Transcription languages

English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.

Search
Clear search
Close search
Google apps
Main menu