Facebook
TwitterDataset Card for Speech Wikimedia
Dataset Summary
The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.
Transcription languages
English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.
Facebook
Twitterhttps://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )
from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
Facebook
TwitterThe data in this dataset is extracted from the BigQuery Wikipedia dataset. It includes:
P31 (instance of) = Q5 (human)) for the years 2015-2023.Date accessed: March 2024.
Below are the queries used to get the dataset:
``sql
-- 1. Get List of people on Wikipedia
SELECT DISTINCT en_wiki -- page title name in english wikipedia
FROMproject.wikipedia_pageviews.wikidata`,
UNNEST(instance_of) AS instance_of_struct
WHERE instance_of_struct.numeric_id = 5 -- instance_of = 5 => person
-- 2. Get pageview data for those people SELECT title, DATETIME_TRUNC(datehour, MONTH) AS month, SUM(views) AS monthly_views
FROM project.wikipedia_pageviews.pageviews_20xx a -- replace xx with desired year
JOIN project.data_for_project.distinct_people b
ON a.title = b.en_wiki
WHERE datehour IS NOT NULL AND wiki = "en"
GROUP BY title, DATETIME_TRUNC(datehour, MONTH)
-- 3. Get wikidata for those people
SELECT *
FROM project.wikipedia_pageviews.wikidata,
UNNEST(instance_of) AS instance_of_struct
WHERE instance_of_struct.numeric_id = 5 ```
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset comprises raw data extracted from Wikipedia, encompassing various types of content including articles, metadata, and user interactions. The dataset is in its unprocessed form, providing an excellent opportunity for data enthusiasts and professionals to engage in data cleaning and preprocessing tasks. It is ideal for those looking to practice and enhance their data cleaning skills, as well as for researchers and developers who require a rich and diverse corpus for natural language processing (NLP) projects.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
A comprehensive Wikipedia dataset containing 100,000 pages with 28.9 million links, collected using breadth-first search crawling algorithm. This dataset includes complete page metadata, link relationships, and a network graph representation suitable for network analysis, graph algorithms, NLP research, and machine learning applications.
pages_export.csvComplete page metadata including:
- id: Unique page ID
- title: Page title
- language: Language code (en)
- content_length: Content length in characters
- word_count: Word count
- categories: JSON array of categories
- infobox: JSON object of infobox data
- created_at: Timestamp
- url: Full Wikipedia URL
Size: ~70 MB | Rows: 100,000
links_export.csvComplete link graph with URLs:
- id: Unique link ID
- source_title: Source page title
- target_title: Target page title
- language: Language code
- position: Link position on page
- depth: Crawl depth where link was discovered
- created_at: Timestamp
- source_url: Full source page URL
- target_url: Full target page URL
Size: ~4.5 GB | Rows: 28,855,738
graph.jsonNetwork graph in JSON format:
- nodes: Array of node objects with id field
- edges: Array of edge objects with source and target fields
Size: ~2.1 GB | Edges: 28,855,738
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipediaを用いた日本語の固有表現抽出データセット
GitHub: https://github.com/stockmarkteam/ner-wikipedia-dataset/ LICENSE: CC-BY-SA 3.0
Developed by Stockmark Inc.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using Wikipedia data to study AI ethics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three corpora in different domains extracted from Wikipedia.
For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.
The article structure, and particularly the sub-titles and paragraphs are kept in these datasets
Wines
Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are
Dom Pérignon - Moët & Chandon
Pinot Meunier - Chardonnay
Movies
The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are
Schindler's List - The Pianist
Lion King - The Jungle Book
Video games
The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:
Grand Theft Auto - Mafia
Burnout Paradise - Forza Horizon 3
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides entity mappings between Freebase and Wikidata, enabling seamless integration between two large-scale knowledge graphs. It is based on the Wikidata data dump from October 28, 2013, and was originally published by Google under the CC0 (Public Domain) license.
The mappings are carefully filtered to ensure high reliability:
This strict filtering results in high-confidence entity alignments, making the dataset useful for research and real-world applications in knowledge graph systems.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1000 Wikipedia articles, used to evaluate a concept recognition algorithm
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Wizard-of-Wikipedia data for the Findings of EMNLP 2020 paper "Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation" GitHub repo. Original paper. @inproceedings{zheng-etal-2020-diffks, title="{D}ifference-aware Knowledge Selection for Knowledge-grounded Conversation Generation", author="Zheng, Chujie and Cao, Yunbo and Jiang, Daxin and Huang, Minlie", booktitle="Findings of EMNLP", year="2020" }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Movies-related articles extracted from Wikipedia.
For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.
The article structure, and particularly the sub-titles and paragraphs are kept in these datasets
Movies
The Wikipedia Movies dataset consists of 100,371 articles describing various movies. Each article may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
Facebook
TwitterThis dataset was created by Mohammadreza Banaei
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
User interaction networks of Wikipedia of 28 different languages. Nodes (orininal wikipedia user IDs) represent users of the Wikipedia, and an edge from user A to user B denotes that user A wrote a message on the talk page of user B at a certain timestamp.
More info: http://yfiua.github.io/academic/2016/02/14/wiki-talk-datasets.html
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Wikipedia is a dataset for object detection tasks - it contains UI Elements annotations for 5,522 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Historical data on new user account registrations to the English Wikipedia and other large Wikipedias. Hourly new user registrations to the English Wikipedia (2008-2011), timestamps are aligned to 2008 (as opposed to 2011 for the original dataset) for easy year-to-year comparison.
Facebook
TwitterSpecies pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.
See https://github.com/mdoering/wikipedia-dwca for details.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a list of citations with identifiers extracted from the most recent version of Wikipedia across all language editions. The data was parsed from the Wikipedia content dumps published on March 1, 2018. License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/ Projects Previous versions of this dataset ("Scholarly citations in Wikipedia") were limited to the English language edition. The current version includes one dataset for each of the 298 languages editions that Wikipedia supports as of March 2018. Projects are identified by their ISO 639-1/639-2 language code, per https://meta.wikimedia.org/wiki/List_of_Wikipedias. Identifiers • PubMed IDs (pmid) and PubMedCentral IDs (pmcid).• Digital Object Identifiers (doi)• International Standard Book Number (isbn)• ArXiv Ids (arxiv) Format Each row in the dataset represents a citation as a (Wikipedia article, cited source) pair. Metadata about when the citation was first added is included. • page_id -- The identifier of the Wikipedia article (int), e.g. 1325125• page_title -- The title of the Wikipedia article (utf-8), e.g. Club cell• rev_id -- The Wikipedia revision where the citation was first added (int), e.g. 282470030• timestamp -- The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z• type -- The type of identifier, e.g. pmid• id -- The id of the cited source (utf-8), e.g. 18179694 Source code https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia (MIT Licensed) A copy of this dataset is also available at https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/Notes Citation identifers are extracted as-is from Wikipedia article content. Our spot-checking suggests that 98% of identifiers resolve.
Facebook
TwitterDataset Card for Speech Wikimedia
Dataset Summary
The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.
Transcription languages
English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.