100+ datasets found

h
WikiMIA
huggingface.co
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weijia Shi (2023). WikiMIA [Dataset]. https://huggingface.co/datasets/swj0419/WikiMIA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 9, 2023
Authors
Weijia Shi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📘 WikiMIA Datasets

The WikiMIA datasets serve as a benchmark designed to evaluate membership inference attack (MIA) methods, specifically in detecting pretraining data from extensive large language models.

📌 Applicability

The datasets can be applied to various models released between 2017 to 2023:

LLaMA1/2 GPT-Neo OPT Pythia text-davinci-001 text-davinci-002 ... and more.

Loading the datasets

To load the dataset: from datasets import load_dataset

LENGTH =… See the full description on the dataset page: https://huggingface.co/datasets/swj0419/WikiMIA.
h
WikiMIA-24
huggingface.co
Updated Dec 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WenjieFu (2024). WikiMIA-24 [Dataset]. https://huggingface.co/datasets/wjfu99/WikiMIA-24
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 24, 2024
Authors
WenjieFu
Description
📘 WikiMIA-24 Datasets

The WikiMIA-24 datasets is a more up-to-date benchmark designed to evaluate pre-training data detection algorithms designed for large language models. The prior version of WikiMIA-24 can be found in WikiMIA

📌 Applicability

The datasets can be applied to various models released between 2017 to 2024:

Mistral Gemma LLaMA1/2 Falcon Vicuna Pythia GPT-Neo OPT ... and more.

Loading the datasets

To load the dataset: from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/wjfu99/WikiMIA-24.
h
wikimia
huggingface.co
Updated Apr 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
full (2023). wikimia [Dataset]. https://huggingface.co/datasets/wwml/wikimia
Explore at:
Dataset updated
Apr 15, 2023
Authors
full
Description
wwml/wikimia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikiMIA-2024-hard
huggingface.co
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skyler Hallinan (2024). wikiMIA-2024-hard [Dataset]. https://huggingface.co/datasets/hallisky/wikiMIA-2024-hard
Explore at:
Dataset updated
Sep 11, 2024
Authors
Skyler Hallinan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
WikiMIA-2024 Hard Dataset

Dataset Description

WikiMIA_2024 Hard is a challenging dataset for membership inference attacks intorduced in the paper "The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage" containing temporal Wikipedia articles with different versions based on date cutoffs. This dataset is designed to evaluate the robustness of privacy-preserving machine learning models against sophisticated membership inference techniques. It… See the full description on the dataset page: https://huggingface.co/datasets/hallisky/wikiMIA-2024-hard.
w
wikimia.net - Historical whois Lookup
whoisdatacenter.com
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc, wikimia.net - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/wikimia.net/
Explore at:
csvAvailable download formats
Dataset authored and provided by
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Mar 15, 1985 - Aug 16, 2025
Description
Explore the historical Whois records related to wikimia.net (Domain). Get insights into ownership history and changes over time.
wikimia.email - Historical whois Lookup
whoisdatacenter.com
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc, wikimia.email - Historical whois Lookup [Dataset]. https://whoisdatacenter.com/domain/wikimia.email/
Explore at:
csvAvailable download formats
Dataset provided by
Authors
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Mar 15, 1985 - Jul 22, 2025
Description
Explore the historical Whois records related to wikimia.email (Domain). Get insights into ownership history and changes over time.
t
Wizard of Wikipedia - Dataset - LDM
service.tib.eu
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Wizard of Wikipedia - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/wizard-of-wikipedia
Explore at:
Dataset updated
Nov 25, 2024
Description
Wizard of Wikipedia is a recent, large-scale dataset of multi-turn knowledge-grounded dialogues between a “apprentice” and a “wizard”, who has access to information from Wikipedia documents.
Wikipedia Talk Corpus
figshare.com
application/x-gzip
Updated Jan 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4264973.v3
Dataset updated
Jan 23, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.
Data from: English Wikipedia - Species Pages
gbif.org
demo.gbif.org
Updated Aug 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
Explore at:
Unique identifier
https://doi.org/10.15468/c3kkgh
Dataset updated
Aug 23, 2022
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
Markus Döring; Markus Döring
Description
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.
Wikipedia STEM 1k
kaggle.com
Updated Jul 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonid Kulyk (2023). Wikipedia STEM 1k [Dataset]. https://www.kaggle.com/datasets/leonidkulyk/wikipedia-stem-1k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Leonid Kulyk
Description
Dataset

This dataset was created by Leonid Kulyk

Contents
Wikipedia Talk Labels: Toxicity
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nithum Thain; Lucas Dixon; Ellery Wulczyn (2017). Wikipedia Talk Labels: Toxicity [Dataset]. http://doi.org/10.6084/m9.figshare.4563973.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4563973.v2
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nithum Thain; Lucas Dixon; Ellery Wulczyn
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Wikipedia Knowledge Graph dataset
zenodo.org
produccioncientifica.ugr.es
+1more
pdf, tsv
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
Explore at:
tsv, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6346900
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

The document Dataset_summary includes a detailed description of the dataset.

Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
English Wikipedia Quality Asssessment Dataset
figshare.com
application/bzip2
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morten Warncke-Wang (2023). English Wikipedia Quality Asssessment Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1375406.v2
Explore at:
application/bzip2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1375406.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Morten Warncke-Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
E
Long document similarity datasets, Wikipedia excerptions for movies, video...
live.european-language-grid.eu
data.niaid.nih.gov
+1more
csv
Updated Apr 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7843
Explore at:
csvAvailable download formats
Dataset updated
Apr 6, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three corpora in different domains extracted from Wikipedia.For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.The article structure, and particularly the sub-titles and paragraphs are kept in these datasets.
Wines: Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon, Pinot Meunier - Chardonnay.
Movies: The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The PianistLion King - The Jungle Book.
Video games: The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia, Burnout Paradise - Forza Horizon 3.
WikiReaD (Wikipedia Readability Dataset)
zenodo.org
bz2
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach (2025). WikiReaD (Wikipedia Readability Dataset) [Dataset]. http://doi.org/10.5281/zenodo.11371932
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.11371932
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Description:

The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

Dataset Details:

Number of Languages: 14

Number of files: 19

Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia.

Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version.

Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `

Attribution:

The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.

Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (

Wikipedia

Source: https://

License: CC BY-SA 4.0, GFDL

Simple English Wikipedia

Source: https://simple.wikipedia.org/wiki/

License: CC BY-SA 4.0, GFDL

Vikidia

Source: https://

License: CC BY-SA 3.0, GFDL

Klexikon

Source: https://klexikon.zum.de/wiki/

License: CC BY-SA 4.0

Txikipedia

Source: https://eu.wikipedia.org/wiki/Txikipedia:

License: CC BY-SA 4.0, GFDL

Wikikids

Source: https://wikikids.nl/

License: CC BY-SA 3.0

Related paper citation:

@inproceedings{trokhymovych-etal-2024-open, title = "An Open Multilingual System for Scoring Readability of {W}ikipedia", author = "Trokhymovych, Mykola and Sen, Indira and Gerlach, Martin", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.342/", doi = "10.18653/v1/2024.acl-long.342", pages = "6296--6311" }
m
English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...
data.mendeley.com
Updated Feb 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
Explore at:
Unique identifier
https://doi.org/10.17632/cdcztymf4k.1
Dataset updated
Feb 9, 2017
Authors
H. Bahadir Sahin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
f
Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...
figshare.com
txt
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14113799.v4
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
Long document similarity dataset, Wikipedia excerptions for video games...
zenodo.org
txt
Updated Jul 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dvir Ginzburg; Dvir Ginzburg (2021). Long document similarity dataset, Wikipedia excerptions for video games collections [Dataset]. http://doi.org/10.5281/zenodo.4812962
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4812962
Dataset updated
Jul 29, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dvir Ginzburg; Dvir Ginzburg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Video games-related articles extracted from Wikipedia.

For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

Video games

The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

Grand Theft Auto - Mafia

Burnout Paradise - Forza Horizon 3
i
Wikipedia information quality assessment - Dataset - CKAN
rdm.inesctec.pt
Updated Jul 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Wikipedia information quality assessment - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-005
Explore at:
Dataset updated
Jul 29, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset from the second part of the Master Dissertation - "Avaliação da qualidade da Wikipédia enquanto fonte de informação em saúde" (Wikipedia quality assessment as health information source), at FEUP, in 2021. It contains the data collected to assess Wikipedia health-related articles for the 1000 most viewed articles listed by WikiProject Medicine, in English. The MediaWiki API was used to collect the current state of the article’s contents and its metadata, revision history, language links, internal wiki links, and external links. Data not available through the API was obtained from the article’s markup. Besides the 7 metrics defined by Stvilia et al., other four proposed metrics and respective features were assessed. This dataset can be used to analyze quality, but also other quantitative aspects of health-related articles from EnglishWikipedia.
Citations with identifiers in Wikipedia
figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Halfaker; Bahodir Mansurov; Miriam Redi; Dario Taraborelli (2023). Citations with identifiers in Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.1299540.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1299540.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aaron Halfaker; Bahodir Mansurov; Miriam Redi; Dario Taraborelli
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes a list of citations with identifiers extracted from the most recent version of Wikipedia across all language editions. The data was parsed from the Wikipedia content dumps published on March 1, 2018. License All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/ Projects Previous versions of this dataset ("Scholarly citations in Wikipedia") were limited to the English language edition. The current version includes one dataset for each of the 298 languages editions that Wikipedia supports as of March 2018. Projects are identified by their ISO 639-1/639-2 language code, per https://meta.wikimedia.org/wiki/List_of_Wikipedias. Identifiers • PubMed IDs (pmid) and PubMedCentral IDs (pmcid).• Digital Object Identifiers (doi)• International Standard Book Number (isbn)• ArXiv Ids (arxiv) Format Each row in the dataset represents a citation as a (Wikipedia article, cited source) pair. Metadata about when the citation was first added is included. • page_id -- The identifier of the Wikipedia article (int), e.g. 1325125• page_title -- The title of the Wikipedia article (utf-8), e.g. Club cell• rev_id -- The Wikipedia revision where the citation was first added (int), e.g. 282470030• timestamp -- The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z• type -- The type of identifier, e.g. pmid• id -- The id of the cited source (utf-8), e.g. 18179694 Source code https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia (MIT Licensed) A copy of this dataset is also available at https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/Notes Citation identifers are extracted as-is from Wikipedia article content. Our spot-checking suggests that 98% of identifiers resolve.

Facebook

Twitter

Click to copy link

Link copied

Cite

Weijia Shi (2023). WikiMIA [Dataset]. https://huggingface.co/datasets/swj0419/WikiMIA

WikiMIA

swj0419/WikiMIA

Explore at:

105 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 9, 2023

Authors

Weijia Shi

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📘 WikiMIA Datasets

The WikiMIA datasets serve as a benchmark designed to evaluate membership inference attack (MIA) methods, specifically in detecting pretraining data from extensive large language models.

  📌 Applicability

The datasets can be applied to various models released between 2017 to 2023:

LLaMA1/2 GPT-Neo OPT Pythia text-davinci-001 text-davinci-002 ... and more.

  Loading the datasets

To load the dataset: from datasets import load_dataset

LENGTH =… See the full description on the dataset page: https://huggingface.co/datasets/swj0419/WikiMIA.

Clear search

Close search

Google apps

Main menu

WikiMIA

WikiMIA-24

wikimia

wikiMIA-2024-hard

wikimia.net - Historical whois Lookup

wikimia.email - Historical whois Lookup

Wizard of Wikipedia - Dataset - LDM

Wikipedia Talk Corpus

Data from: English Wikipedia - Species Pages

Wikipedia STEM 1k

Dataset

Contents

Wikipedia Talk Labels: Toxicity

Wikipedia Knowledge Graph dataset

English Wikipedia Quality Asssessment Dataset

Long document similarity datasets, Wikipedia excerptions for movies, video...

WikiReaD (Wikipedia Readability Dataset)

English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

Long document similarity dataset, Wikipedia excerptions for video games...

Wikipedia information quality assessment - Dataset - CKAN

Citations with identifiers in Wikipedia

WikiMIA

swj0419/WikiMIA