15 datasets found

f
Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...
figshare.com
txt
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14113799.v4
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
Wikipedia Talk Labels: Personal Attacks
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4054689.v6
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Wikipedia Talk Labels: Aggression
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Aggression [Dataset]. http://doi.org/10.6084/m9.figshare.4267550.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4267550.v5
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it has aggressive tone. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Z
Wiki-TabNER dataset
data.niaid.nih.gov
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koleva, Aneta (2024). Wiki-TabNER dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10794525
Explore at:
Dataset updated
Jun 14, 2024
Dataset authored and provided by
Koleva, Aneta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset described in the paper Wiki-TabNER:Integrating Named Entity Recognition into Wikipedia Tables.

It Is a dataset containing tables extracted from the Wikipedia pages and annotated with Dbpedia entity types. The file Wiki_TabNER_final_labeled.json contains the annotated tables. It can be used for solving NER within tables and for the entity linking task. The file dataset_entities_labeled_linked.csv contains all the linked entities that are mentioned in the tables and their corresponding Wikipedia IDs. More information on the creation of the dataset and instruction on how to use it is available in the github reposiotry for the paper.
Wikipedia Talk Labels: Toxicity
figshare.com
txt
Updated Feb 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nithum Thain; Lucas Dixon; Ellery Wulczyn (2017). Wikipedia Talk Labels: Toxicity [Dataset]. http://doi.org/10.6084/m9.figshare.4563973.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4563973.v2
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nithum Thain; Lucas Dixon; Ellery Wulczyn
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
C
Representation in Wikipedia: Intersectional Insights on Gender and Diversity...
dataverse.csuc.cat
tsv, txt
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aisa Serra Gil; Aisa Serra Gil; Núria Ferran Ferrer; Núria Ferran Ferrer; Miquel Centelles Velilla; Miquel Centelles Velilla (2024). Representation in Wikipedia: Intersectional Insights on Gender and Diversity in Main Page Featured Biographies (2013–2024) [Dataset]. http://doi.org/10.34810/data1634
Explore at:
tsv(1694772), tsv(689348), tsv(1121944), txt(12646), tsv(296993)Available download formats
Unique identifier
https://doi.org/10.34810/data1634
Dataset updated
Dec 3, 2024
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Aisa Serra Gil; Aisa Serra Gil; Núria Ferran Ferrer; Núria Ferran Ferrer; Miquel Centelles Velilla; Miquel Centelles Velilla
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This dataset shows values taken from biography articles that have appeared in the "From today's featured article", "Did you know..." and "On this day" sections of the Front Page in the English edition of Wikipedia among years 2013 and 2024. The values contained in this dataset have been obtained from the crossing of Wikidata properties with the unique identifiers of the articles. These data provides information about the people described in the articles, such as gender, ethnicity, sexual orientation, native language, among other properties, so that an analysis can be made, from an intersectional perspective , of the representation of diversity in Wikipedia. The document Joint-data contains all the joint data without making a distinction based on the gender of the person biographed, while the other documents have the information divided based on the gender of the people in the articles: "Women" to encompass the data of cisgender women, "Men" for the data of cisgender men, and "Dissident" to collect the data of people whose gender is dissident from which they were assigned at birth. Therefore, you can find four documents: Joint-data; Dissident_Gender-categorized-data; Men_Gender-categorized-data; Women_Gender-categorized-data. In each document, odd columns state the Wikidata properties analized and even columns specify the number of results for each value of the property, that is the occurrences of each value.
Representation at Social Security Hearings
catalog.data.gov
datasets.ai
Updated Jan 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Representation at Social Security Hearings [Dataset]. https://catalog.data.gov/dataset/representation-at-social-security-hearings
Explore at:
Dataset updated
Jan 24, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
This dataset provides information on the percentage of claimants at OHO hearings who were represented either by an attorney or by a non-attorney representative. This data is at the national level by fiscal year for the period of 1979 through 2015.
f
Structured knowledge bases for the inference of computational trust of...
figshare.com
pdf
Updated May 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Rizzo; luca longo (2020). Structured knowledge bases for the inference of computational trust of Wikipedia editors [Dataset]. http://doi.org/10.6084/m9.figshare.12249770.v4
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12249770.v4
Dataset updated
May 5, 2020
Dataset provided by
figshare
Authors
Lucas Rizzo; luca longo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge bases structured around IF-THEN rules and defined for the inference of computational trust in the Wikipedia context.
Alert Display Distribution (ADD)
catalog.data.gov
s.cnmilf.com
+1more
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Alert Display Distribution (ADD) [Dataset]. https://catalog.data.gov/dataset/alert-display-distribution-add
Explore at:
Dataset updated
Jul 4, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
Repository that contains alerts that will be sent to SSA employees when certain conditions exist, to inform them of work that needs to be done, is being reviewed, or has been completed.
f
English Wikipedia labeled mid-level wikiprojects set
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Asthana; Aaron Halfaker (2023). English Wikipedia labeled mid-level wikiprojects set [Dataset]. http://doi.org/10.6084/m9.figshare.5640526.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5640526.v1
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Authors
Sumit Asthana; Aaron Halfaker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains a set of 93,449 observations providing wikiproject mid-level category labels associated with talk pages for respective Wikipedia articles.Each observation includes a talk page title, talk page id, latest revision id when the extraction was done, associated wikiproject templates and mid-level wikiproject categories the corresponding article page belongs to.The dataset was generated using a python script that ran mysql queries on Wikimedia PAWS.To ensure a balanced set, the script extracts a random set of 2000 page-ids per mid-level category totaling about 93,449 observationsThis dataset opens up immense possibilities for topic oriented research around Wikipedia as it exposes high level topic data associated with Wikipedia pages.
BLM National Rights-of-Way Public Display Polygons
data.doi.gov
Updated Oct 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Land Management (Point of Contact) (2021). BLM National Rights-of-Way Public Display Polygons [Dataset]. https://data.doi.gov/dataset/blm-national-rights-of-way-public-display-polygons1
Explore at:
Dataset updated
Oct 21, 2021
Dataset provided by
Bureau of Land Managementhttp://www.blm.gov/
Description
This dataset is pertinent to ROW data programmatically extracted from LR2000, joined to PLSS Polygons, and then dissolved by case serial numbers.
Z
Data from: Representation of crowd accidents in popular media
data.niaid.nih.gov
zenodo.org
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corbetta, Alessandro (2023). Representation of crowd accidents in popular media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8347228
Explore at:
Dataset updated
Dec 26, 2023
Dataset provided by
Feliciani, Claudio
Corbetta, Alessandro
Haghani, Milad
Nishinari, Katsuhiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains results related to the analysis of a corpus of news reports covering the topic of crowd accidents. To facilitate online visualization and offline analysis, the files are organized by assigning a number to each. The number system and the details of each set of files are described as follows:

Class 0 – This contains the same files provided in this repository, but they are organized into folders to make analysis easier. If you intend to analyze the data from our lexical analysis, we suggest using this file since it is better organized and can be directly downloaded. Please note that due to a mistake when creating the newest version Wikipedia files were not included in this file so they need to be downloaded separetely. This will be fixed in the next version.

Class 1 – This contains the sources and relevant information for people who are interested in replicating our dataset or accessing the news reports used in our analysis. Please note that due to copyright regulations, the texts cannot be shared. However, you can refer to the links provided in these files to access the news articles and Wikipedia pages. Some links have stopped working during the time we were working on this study, and others may be unreachable in the future.

Class 2 – This contains the results from a lexical analysis of the corpus. The HTML page allows you to visualize each result interactively through the online VOSviewer app (you need to download the file and open it using a browser since Zenodo does not recognize this as a link). It is possible that this service (VOSviewer app) may be discontinued at some point in the future. PNG images of lexical maps are, therefore, available for download through the ZIP archive, although they do not allow interactive access. If you plan to read our results using the offline VOSviewer software or perform a more systematic analysis, JSON files are available for each category (time period, geographical area of the reporting institution, and purpose of gathering). The same files can be also find in the ZIP archive in class 0.

Class 3 – These are the results of the sentiment analysis. For each report, a single result is generated for the title. However, for the body, the text is divided into parts, which are analyzed independently.

Class 4 – These two files contains the corpus of Wikipedia relative to 68 crowd accidents which occurred between 1990 and 2019. The text for all accidents were scraped on October 15th, 2022 (before the tragedy in Itaewon) and on May 25th, 2023 (after the tragedy). Sources relative to the content in Wikipedia are listed in the file contained in Class 1 ("1_list_wiki_report.csv"). More generally, accidents listed on dedicated Wikipedia pages on https://en.wikipedia.org/wiki/List_of_fatal_crowd_crushes are reported in the corpus provided here (the period 1900-2019 is considered here).

The format of CSV and JSON files should be self-explanatory after reading our publication. For specific questions or queries, please contact one of the authors, and we will try to assist you.
Representation Listing
catalog.data.gov
s.cnmilf.com
+1more
Updated Mar 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Representation Listing [Dataset]. https://catalog.data.gov/dataset/representation-listing
Explore at:
Dataset updated
Mar 8, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
Listing of Social Security Administration LEGAL REFERRALS.
Dataset and Image Inventory
figshare.com
pdf
Updated Jan 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Kämpf (2016). Dataset and Image Inventory [Dataset]. http://doi.org/10.6084/m9.figshare.1619639.v6
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1619639.v6
Dataset updated
Jan 20, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Mirko Kämpf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the figure and dataset inventory for the article: The detection of emerging trends using Wikipedia traffic data and context networks
Data from: Learning multilingual named entity recognition from Wikipedia
figshare.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran (2023). Learning multilingual named entity recognition from Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5462500.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5462500.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data associated with Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy and James R. Curran (2013), "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006). A preprint is included here as wikiner-preprint.pdfThis data was originally available at http://schwa.org/resources (which linked to http://schwa.org/projects/resources/wiki/Wikiner).The .bz2 files are NER training corpora produced as reported in the Artificial Intelligence paper. wp2 and wp3 are differentiated by wp3 using a higher level of link inference. They use a pipe-delimited format that can be converted to CoNLL 2003 format with system2conll.pl.nothman08types.tsv is a manual classification of articles first used in Joel Nothman, James R. Curran and Tara Murphy (2008), "Transforming Wikipedia into Named Entity Training Data", In Proceedings of the Australasian Language Technology Association Workshop 2008. http://aclanthology.coli.uni-saarland.de/pdf/U/U08/U08-1016.pdfpopular.tsv and random.tsv are manual article classifications developed for the Artifiical Intelligence paper based on different strategies for sampling articles from Wikipedia in order to account for Wikipedia's biased distribution (see that paper). scheme.tsv maps these fine-grained labels to coarser annotations including CoNLL 2003-style.wikigold.conll.txt is a manual NER annotation of some Wikipedia text as presented in Dominic Balasuriya and Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran (2009), in Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources (http://www.aclweb.org/anthology/W/W09/W09-3302).See also corpora produced similarly in an enhanced version of this work work (Pan et al., "Cross-lingual Name Tagging and Linking for 282 Languages", ACL 2017) at http://nlp.cs.rpi.edu/wikiann/.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.14113799.v4

Dataset updated

Mar 14, 2021

Dataset provided by

figshare

Authors

KayYen Wong; Diego Saez-Trumper; Miriam Redi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

Clear search

Close search

Google apps

Main menu

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

Wikipedia Talk Labels: Personal Attacks

Wikipedia Talk Labels: Aggression

Wiki-TabNER dataset

Wikipedia Talk Labels: Toxicity

Representation in Wikipedia: Intersectional Insights on Gender and Diversity...

Representation at Social Security Hearings

Structured knowledge bases for the inference of computational trust of...

Alert Display Distribution (ADD)

English Wikipedia labeled mid-level wikiprojects set

BLM National Rights-of-Way Public Display Polygons

Data from: Representation of crowd accidents in popular media

Representation Listing

Dataset and Image Inventory

Data from: Learning multilingual named entity recognition from Wikipedia

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia