15 datasets found
  1. f

    Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

    • figshare.com
    txt
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    figshare
    Authors
    KayYen Wong; Diego Saez-Trumper; Miriam Redi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

  2. Wikipedia Talk Labels: Personal Attacks

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  3. Wikipedia Talk Labels: Aggression

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Aggression [Dataset]. http://doi.org/10.6084/m9.figshare.4267550.v5
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it has aggressive tone. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  4. Z

    Wiki-TabNER dataset

    • data.niaid.nih.gov
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koleva, Aneta (2024). Wiki-TabNER dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10794525
    Explore at:
    Dataset updated
    Jun 14, 2024
    Dataset authored and provided by
    Koleva, Aneta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset described in the paper Wiki-TabNER:Integrating Named Entity Recognition into Wikipedia Tables.

    It Is a dataset containing tables extracted from the Wikipedia pages and annotated with Dbpedia entity types. The file Wiki_TabNER_final_labeled.json contains the annotated tables. It can be used for solving NER within tables and for the entity linking task. The file dataset_entities_labeled_linked.csv contains all the linked entities that are mentioned in the tables and their corresponding Wikipedia IDs. More information on the creation of the dataset and instruction on how to use it is available in the github reposiotry for the paper.

  5. Wikipedia Talk Labels: Toxicity

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nithum Thain; Lucas Dixon; Ellery Wulczyn (2017). Wikipedia Talk Labels: Toxicity [Dataset]. http://doi.org/10.6084/m9.figshare.4563973.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nithum Thain; Lucas Dixon; Ellery Wulczyn
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  6. C

    Representation in Wikipedia: Intersectional Insights on Gender and Diversity...

    • dataverse.csuc.cat
    tsv, txt
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aisa Serra Gil; Aisa Serra Gil; Núria Ferran Ferrer; Núria Ferran Ferrer; Miquel Centelles Velilla; Miquel Centelles Velilla (2024). Representation in Wikipedia: Intersectional Insights on Gender and Diversity in Main Page Featured Biographies (2013–2024) [Dataset]. http://doi.org/10.34810/data1634
    Explore at:
    tsv(1694772), tsv(689348), tsv(1121944), txt(12646), tsv(296993)Available download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Aisa Serra Gil; Aisa Serra Gil; Núria Ferran Ferrer; Núria Ferran Ferrer; Miquel Centelles Velilla; Miquel Centelles Velilla
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This dataset shows values ​​taken from biography articles that have appeared in the "From today's featured article", "Did you know..." and "On this day" sections of the Front Page in the English edition of Wikipedia among years 2013 and 2024. The values contained in this dataset have been obtained from the crossing of Wikidata properties with the unique identifiers of the articles. These data provides information about the people described in the articles, such as gender, ethnicity, sexual orientation, native language, among other properties, so that an analysis can be made, from an intersectional perspective , of the representation of diversity in Wikipedia. The document Joint-data contains all the joint data without making a distinction based on the gender of the person biographed, while the other documents have the information divided based on the gender of the people in the articles: "Women" to encompass the data of cisgender women, "Men" for the data of cisgender men, and "Dissident" to collect the data of people whose gender is dissident from which they were assigned at birth. Therefore, you can find four documents: Joint-data; Dissident_Gender-categorized-data; Men_Gender-categorized-data; Women_Gender-categorized-data. In each document, odd columns state the Wikidata properties analized and even columns specify the number of results for each value of the property, that is the occurrences of each value.

  7. Representation at Social Security Hearings

    • catalog.data.gov
    • datasets.ai
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Representation at Social Security Hearings [Dataset]. https://catalog.data.gov/dataset/representation-at-social-security-hearings
    Explore at:
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    This dataset provides information on the percentage of claimants at OHO hearings who were represented either by an attorney or by a non-attorney representative. This data is at the national level by fiscal year for the period of 1979 through 2015.

  8. f

    Structured knowledge bases for the inference of computational trust of...

    • figshare.com
    pdf
    Updated May 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Rizzo; luca longo (2020). Structured knowledge bases for the inference of computational trust of Wikipedia editors [Dataset]. http://doi.org/10.6084/m9.figshare.12249770.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 5, 2020
    Dataset provided by
    figshare
    Authors
    Lucas Rizzo; luca longo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge bases structured around IF-THEN rules and defined for the inference of computational trust in the Wikipedia context.

  9. Alert Display Distribution (ADD)

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Alert Display Distribution (ADD) [Dataset]. https://catalog.data.gov/dataset/alert-display-distribution-add
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    Repository that contains alerts that will be sent to SSA employees when certain conditions exist, to inform them of work that needs to be done, is being reviewed, or has been completed.

  10. f

    English Wikipedia labeled mid-level wikiprojects set

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Asthana; Aaron Halfaker (2023). English Wikipedia labeled mid-level wikiprojects set [Dataset]. http://doi.org/10.6084/m9.figshare.5640526.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Authors
    Sumit Asthana; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains a set of 93,449 observations providing wikiproject mid-level category labels associated with talk pages for respective Wikipedia articles.Each observation includes a talk page title, talk page id, latest revision id when the extraction was done, associated wikiproject templates and mid-level wikiproject categories the corresponding article page belongs to.The dataset was generated using a python script that ran mysql queries on Wikimedia PAWS.To ensure a balanced set, the script extracts a random set of 2000 page-ids per mid-level category totaling about 93,449 observationsThis dataset opens up immense possibilities for topic oriented research around Wikipedia as it exposes high level topic data associated with Wikipedia pages.

  11. BLM National Rights-of-Way Public Display Polygons

    • data.doi.gov
    Updated Oct 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of Land Management (Point of Contact) (2021). BLM National Rights-of-Way Public Display Polygons [Dataset]. https://data.doi.gov/dataset/blm-national-rights-of-way-public-display-polygons1
    Explore at:
    Dataset updated
    Oct 21, 2021
    Dataset provided by
    Bureau of Land Managementhttp://www.blm.gov/
    Description

    This dataset is pertinent to ROW data programmatically extracted from LR2000, joined to PLSS Polygons, and then dissolved by case serial numbers.

  12. Z

    Data from: Representation of crowd accidents in popular media

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corbetta, Alessandro (2023). Representation of crowd accidents in popular media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8347228
    Explore at:
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Feliciani, Claudio
    Corbetta, Alessandro
    Haghani, Milad
    Nishinari, Katsuhiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains results related to the analysis of a corpus of news reports covering the topic of crowd accidents. To facilitate online visualization and offline analysis, the files are organized by assigning a number to each. The number system and the details of each set of files are described as follows:

    Class 0 – This contains the same files provided in this repository, but they are organized into folders to make analysis easier. If you intend to analyze the data from our lexical analysis, we suggest using this file since it is better organized and can be directly downloaded. Please note that due to a mistake when creating the newest version Wikipedia files were not included in this file so they need to be downloaded separetely. This will be fixed in the next version.

    Class 1 – This contains the sources and relevant information for people who are interested in replicating our dataset or accessing the news reports used in our analysis. Please note that due to copyright regulations, the texts cannot be shared. However, you can refer to the links provided in these files to access the news articles and Wikipedia pages. Some links have stopped working during the time we were working on this study, and others may be unreachable in the future.

    Class 2 – This contains the results from a lexical analysis of the corpus. The HTML page allows you to visualize each result interactively through the online VOSviewer app (you need to download the file and open it using a browser since Zenodo does not recognize this as a link). It is possible that this service (VOSviewer app) may be discontinued at some point in the future. PNG images of lexical maps are, therefore, available for download through the ZIP archive, although they do not allow interactive access. If you plan to read our results using the offline VOSviewer software or perform a more systematic analysis, JSON files are available for each category (time period, geographical area of the reporting institution, and purpose of gathering). The same files can be also find in the ZIP archive in class 0.

    Class 3 – These are the results of the sentiment analysis. For each report, a single result is generated for the title. However, for the body, the text is divided into parts, which are analyzed independently.

    Class 4 – These two files contains the corpus of Wikipedia relative to 68 crowd accidents which occurred between 1990 and 2019. The text for all accidents were scraped on October 15th, 2022 (before the tragedy in Itaewon) and on May 25th, 2023 (after the tragedy). Sources relative to the content in Wikipedia are listed in the file contained in Class 1 ("1_list_wiki_report.csv"). More generally, accidents listed on dedicated Wikipedia pages on https://en.wikipedia.org/wiki/List_of_fatal_crowd_crushes are reported in the corpus provided here (the period 1900-2019 is considered here).

    The format of CSV and JSON files should be self-explanatory after reading our publication. For specific questions or queries, please contact one of the authors, and we will try to assist you.

  13. Representation Listing

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Representation Listing [Dataset]. https://catalog.data.gov/dataset/representation-listing
    Explore at:
    Dataset updated
    Mar 8, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    Listing of Social Security Administration LEGAL REFERRALS.

  14. Dataset and Image Inventory

    • figshare.com
    pdf
    Updated Jan 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Kämpf (2016). Dataset and Image Inventory [Dataset]. http://doi.org/10.6084/m9.figshare.1619639.v6
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mirko Kämpf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the figure and dataset inventory for the article: The detection of emerging trends using Wikipedia traffic data and context networks

  15. Data from: Learning multilingual named entity recognition from Wikipedia

    • figshare.com
    bz2
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran (2023). Learning multilingual named entity recognition from Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5462500.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data associated with Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy and James R. Curran (2013), "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006). A preprint is included here as wikiner-preprint.pdfThis data was originally available at http://schwa.org/resources (which linked to http://schwa.org/projects/resources/wiki/Wikiner).The .bz2 files are NER training corpora produced as reported in the Artificial Intelligence paper. wp2 and wp3 are differentiated by wp3 using a higher level of link inference. They use a pipe-delimited format that can be converted to CoNLL 2003 format with system2conll.pl.nothman08types.tsv is a manual classification of articles first used in Joel Nothman, James R. Curran and Tara Murphy (2008), "Transforming Wikipedia into Named Entity Training Data", In Proceedings of the Australasian Language Technology Association Workshop 2008. http://aclanthology.coli.uni-saarland.de/pdf/U/U08/U08-1016.pdfpopular.tsv and random.tsv are manual article classifications developed for the Artifiical Intelligence paper based on different strategies for sampling articles from Wikipedia in order to account for Wikipedia's biased distribution (see that paper). scheme.tsv maps these fine-grained labels to coarser annotations including CoNLL 2003-style.wikigold.conll.txt is a manual NER annotation of some Wikipedia text as presented in Dominic Balasuriya and Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran (2009), in Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources (http://www.aclweb.org/anthology/W/W09/W09-3302).See also corpora produced similarly in an enhanced version of this work work (Pan et al., "Cross-lingual Name Tagging and Linking for 282 Languages", ACL 2017) at http://nlp.cs.rpi.edu/wikiann/.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

Search
Clear search
Close search
Google apps
Main menu