27 datasets found
  1. Wikipedia Talk Labels: Toxicity

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nithum Thain; Lucas Dixon; Ellery Wulczyn (2017). Wikipedia Talk Labels: Toxicity [Dataset]. http://doi.org/10.6084/m9.figshare.4563973.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nithum Thain; Lucas Dixon; Ellery Wulczyn
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  2. Wikipedia Pageviews

    • kaggle.com
    zip
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vlad (2025). Wikipedia Pageviews [Dataset]. https://www.kaggle.com/datasets/vladtasca/wikipedia-pageviews/code
    Explore at:
    zip(3888245 bytes)Available download formats
    Dataset updated
    Nov 28, 2025
    Authors
    Vlad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset aggregates the 100 most popular Wikipedia articles by pageviews - enabling the tracking of trending topics on Wikipedia.

    The data begins in the year 2016 and the textual data is presented as it is found on the website of Wikipedia.

    Column description

    • rank- Rank of the article (out of 100).
    • article - Title of the article.
    • views - Number of pageviews (across all platforms).
    • date - Date of the pageviews.

    Update schedule

    This dataset is updated on a daily basis with new data sourced from the WikiMedia API.

  3. Event representation on Wikidata and Wikipedia with, and without the...

    • data.europa.eu
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, Event representation on Wikidata and Wikipedia with, and without the analysis of vernacular languages [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4733507?locale=el
    Explore at:
    unknown(1594252)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This projects aims at proving with data that it is necessary to analyze vernacular languages when dealing with events that are described using public sources likes Wikidata and Wikipedia. In order to retrieve and analyze events, it uses the wikivents Python package. We provide in the project directory the Jupyter Notebook that processed (and/or generate) the dataset directory content. Statistics from this analysis is located in the stats directory. The main statistics are reported in the associated paper.

  4. Structured knowledge bases for the inference of computational trust of...

    • figshare.com
    pdf
    Updated May 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Rizzo; luca longo (2020). Structured knowledge bases for the inference of computational trust of Wikipedia editors [Dataset]. http://doi.org/10.6084/m9.figshare.12249770.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 5, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Lucas Rizzo; luca longo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge bases structured around IF-THEN rules and defined for the inference of computational trust in the Wikipedia context.

  5. h

    wikipedia-pageviews

    • huggingface.co
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vlad Tasca (2025). wikipedia-pageviews [Dataset]. https://huggingface.co/datasets/vtasca/wikipedia-pageviews
    Explore at:
    Dataset updated
    Feb 28, 2025
    Authors
    Vlad Tasca
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Wikipedia Article Pageviews

    This repository automatically fetches and aggregates the 100 most popular Wikipedia articles by pageviews - creating a dataset that enables tracking trending topics on Wikipedia. It works by polling the WikiMedia API on a daily basis and fetching the top 100 most popular articles from two days ago. The fetcher runs in a scheduled GitHub Actions workflow, which is available here. The dataset begins in the year 2016 and the textual data is presented as it… See the full description on the dataset page: https://huggingface.co/datasets/vtasca/wikipedia-pageviews.

  6. Teahouse corpus

    • data.wu.ac.at
    .txt, csv
    Updated Apr 12, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2015). Teahouse corpus [Dataset]. https://data.wu.ac.at/schema/datahub_io/MmZiZjJmNWEtM2E2OS00NGZmLTgyMjUtMDk1MmVhNTQ0NGU1
    Explore at:
    .txt, csvAvailable download formats
    Dataset updated
    Apr 12, 2015
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    License

    http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa

    Description

    The Teahouse corpus is a set of questions asked at the Wikipedia Teahouse, a peer support forum for new Wikipedia editors. This corpus contains data from its first two years of operation.

    The Teahouse started as an editor engagement initiative and Fellowship project. It was launched in February 2012 by a small team working with the Wikimedia Foundation. Our intention was to pilot a new, scalable model for teaching Wikipedia newcomers the ropes of editing in a friendly and engaging environment.

    The ultimate goal of the pilot project was to increase the retention of new Wikipedia editors (most of whom give up and leave within their first 24 hours post-registration) through early proactive outreach. The project was particularly focused on retaining female newcomers, who are woefully underrepresented among the regular contributors to the encyclopedia.

    The Teahouse lives on as an vibrant, self-sustaining and community-driven project. All Teahouse participants are volunteers: no one is told when, how, or how much they must contribute.

    See the README files associated with each datafile for a schema of the data fields in that file.

    Read on for more info on potential applications, the provenance of these data, and links to related resources.

    Potential Applications

    or, what is it good for?

    The Teahouse corpus consists of good quality data and rich metadata around social Q&A interactions in a particular setting: new user help requests in a large, collaborative online community.

    More generally, this corpus is a valuable resource for research on conversational dynamics in online, asynchronous discussions.

    Qualitative textual analysis could yield insights into the kinds of issues faced by newcomers in established online collaborations.

    Linguisitc analysis could examine the impact of syntactic and semantic features related to politeness, sentiment, question framing, or other rhetorical strategies on discussion outcomes.

    Response patterns (questioner replies and answers) within each thread could be used to map network relationships, or to investigate correlations between participation by the initiator of a thread, or the number of participants, on thread length or interactivity (the interval of time between posts).

    The corpus is large and rich enough to provide training both training and test data for machine learning applications.

    Finally, the data provide here can be extended and compared with other publicly-available datasets of Wikipedia, allowing researchers to examine relationships between editors' participation within the Teahouse Q&A forum and their previous, concurrent, and subsequent editing activities within millions of other articles, meta-content, and discussion spaces on Wikipedia.

    Data hygiene

    or, how the research sausage was made

    Parsing wikitext presents many challenges: the mediawiki editing interface is deliberately underspecified in order to maximize flexibility for contributors. This can make it difficult to tell the difference between different types of contribution--say, fixing a typo or answering a question.

    The Teahouse Q&A board was designed to provide a more structured workflow than normal wiki talk pages, and instrumented to identify certain kinds of contributions (questions and answers) and isolate them from the 'noisy' background datastream of incidental edits to the Q&A page. The post-processing of the data presented here favored precision over recall: to provide a good quality set of questions, rather than a complete one.

    In cases where it wasn't easy to identify whether an edit contained a question or answer, these data have not been included. However, it is hard to account for all ambiguous or invalid cases: caveat quaesitor!

    Our approach to data inclusion was conservative. The number of questioner replies and answers to any given question may be under-counted, but is unlikely to be over-counted. However, our spot checks and analysis of the data suggest that the majority of responses are accounted for, and that the distribution of "missed" responses is randomly distributed.

    The Teahouse corpus only contains questions and answers by registered users of Wikipedia who were logged in when they participated. IP addresses can be linked to an individual's physical location. On Wikipedia, edits by logged out and unregistered users are identified by the user's current IP address. Although all edits to Wikipedia are legally public and free licenced, we have redacted IP edits from this dataset in deference to user privacy. Researchers interested in those data can find them in other public Wikipedia datasets.

    Possible future additions

    Additional data about these Q&A interactions has been collected, and other data are retrievable. Examples of data that could be included in future revisions of the corpus at low cost include:

    • more metadata about the people asking questions:
      • how many edits had they made before asking their (first) question?
      • when did they join Wikipedia?
      • were they explicitly invited to participate in the Teahouse, or did they locate the forum by other means?
      • did the questioner also create a guest profile on the Teahouse introductions page?
    • more metadata about the people answering the questions:
      • were they a Teahouse host at the time they answered a question?

    Examples of data that could be included in future revisions of the corpus at reasonable cost:

    • full text of answers to questions, including replies by original questioner
    • full text of profiles created by Teahouse guests and hosts (some privacy considerations here; contact corpus maintainer directly if interested in these data)

    See also

  7. Wikipedia Talk Labels: Aggression

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Aggression [Dataset]. http://doi.org/10.6084/m9.figshare.4267550.v5
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it has aggressive tone. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  8. History From Wikipedia

    • kaggle.com
    zip
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budianto (2024). History From Wikipedia [Dataset]. https://www.kaggle.com/datasets/budibudi/history-from-wikipedia
    Explore at:
    zip(25827951 bytes)Available download formats
    Dataset updated
    Apr 23, 2024
    Authors
    Budianto
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Sample Historical Data from Wikipedia (PDF format)

    This dataset provides some historical information for various countries, including Indonesia, Greece, Rome, France, Vietnam, Korea, Peru, England, Germany, Mexico, Iran, India, China, Egypt, and Japan. The data is sourced from Wikipedia and presented in a single PDF file for each country.

    Source: Wikipedia Content: Historical data for various countries Format: Individual PDF files per country

  9. English Wikipedia labeled mid-level wikiprojects set

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Asthana; Aaron Halfaker (2023). English Wikipedia labeled mid-level wikiprojects set [Dataset]. http://doi.org/10.6084/m9.figshare.5640526.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Sumit Asthana; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains a set of 93,449 observations providing wikiproject mid-level category labels associated with talk pages for respective Wikipedia articles.Each observation includes a talk page title, talk page id, latest revision id when the extraction was done, associated wikiproject templates and mid-level wikiproject categories the corresponding article page belongs to.The dataset was generated using a python script that ran mysql queries on Wikimedia PAWS.To ensure a balanced set, the script extracts a random set of 2000 page-ids per mid-level category totaling about 93,449 observationsThis dataset opens up immense possibilities for topic oriented research around Wikipedia as it exposes high level topic data associated with Wikipedia pages.

  10. C

    Representation in Wikipedia: Intersectional Insights on Gender and Diversity...

    • dataverse.csuc.cat
    tsv, txt
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aisa Serra Gil; Aisa Serra Gil; Núria Ferran Ferrer; Núria Ferran Ferrer; Miquel Centelles Velilla; Miquel Centelles Velilla (2024). Representation in Wikipedia: Intersectional Insights on Gender and Diversity in Main Page Featured Biographies (2013–2024) [Dataset]. http://doi.org/10.34810/data1634
    Explore at:
    tsv(1694772), tsv(689348), tsv(1121944), txt(12646), tsv(296993)Available download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Aisa Serra Gil; Aisa Serra Gil; Núria Ferran Ferrer; Núria Ferran Ferrer; Miquel Centelles Velilla; Miquel Centelles Velilla
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This dataset shows values ​​taken from biography articles that have appeared in the "From today's featured article", "Did you know..." and "On this day" sections of the Front Page in the English edition of Wikipedia among years 2013 and 2024. The values contained in this dataset have been obtained from the crossing of Wikidata properties with the unique identifiers of the articles. These data provides information about the people described in the articles, such as gender, ethnicity, sexual orientation, native language, among other properties, so that an analysis can be made, from an intersectional perspective , of the representation of diversity in Wikipedia. The document Joint-data contains all the joint data without making a distinction based on the gender of the person biographed, while the other documents have the information divided based on the gender of the people in the articles: "Women" to encompass the data of cisgender women, "Men" for the data of cisgender men, and "Dissident" to collect the data of people whose gender is dissident from which they were assigned at birth. Therefore, you can find four documents: Joint-data; Dissident_Gender-categorized-data; Men_Gender-categorized-data; Women_Gender-categorized-data. In each document, odd columns state the Wikidata properties analized and even columns specify the number of results for each value of the property, that is the occurrences of each value.

  11. ruWiki Elections

    • kaggle.com
    zip
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caerno (2022). ruWiki Elections [Dataset]. https://www.kaggle.com/datasets/caerno/ruwiki-elections
    Explore at:
    zip(134502 bytes)Available download formats
    Dataset updated
    Feb 13, 2022
    Authors
    Caerno
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Out of pure interest, I analyzed the clustering of voters by votes and drew dendrograms on this occasion, which seemed interesting to many participants.

    Content

    In the source, votes are given without a date, so those users who changed their votes during the election of arbitrators (both are saved in the source) and left them in both the "for" and "against" sections will be counted as voting against.

    Acknowledgements

    Many thanks to MBH, who collected a data. You can visit his tool (on Russian).

    Inspiration

    It is not uncommon on Wikipedia for participants to create virtual accounts and (in violation of the rules), participate in elections by several accounts simultaneously. Some of such cases have been identified, others you can help identify.

    In 2021 Russian Wikipedia elections was attacked by a group of conspirators. Can you spot them directly from the data presented?

    For a deeper analysis, each of the elections is described - in what year they took place, who was the candidate, what type of vote it was (for an arbitrator, administrator or bureaucrat). Based on these data, it is possible to identify candidates whose views are likely to coincide.

  12. Data from: Learning multilingual named entity recognition from Wikipedia

    • figshare.com
    bz2
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran (2023). Learning multilingual named entity recognition from Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5462500.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data associated with Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy and James R. Curran (2013), "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006). A preprint is included here as wikiner-preprint.pdfThis data was originally available at http://schwa.org/resources (which linked to http://schwa.org/projects/resources/wiki/Wikiner).The .bz2 files are NER training corpora produced as reported in the Artificial Intelligence paper. wp2 and wp3 are differentiated by wp3 using a higher level of link inference. They use a pipe-delimited format that can be converted to CoNLL 2003 format with system2conll.pl.nothman08types.tsv is a manual classification of articles first used in Joel Nothman, James R. Curran and Tara Murphy (2008), "Transforming Wikipedia into Named Entity Training Data", In Proceedings of the Australasian Language Technology Association Workshop 2008. http://aclanthology.coli.uni-saarland.de/pdf/U/U08/U08-1016.pdfpopular.tsv and random.tsv are manual article classifications developed for the Artifiical Intelligence paper based on different strategies for sampling articles from Wikipedia in order to account for Wikipedia's biased distribution (see that paper). scheme.tsv maps these fine-grained labels to coarser annotations including CoNLL 2003-style.wikigold.conll.txt is a manual NER annotation of some Wikipedia text as presented in Dominic Balasuriya and Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran (2009), in Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources (http://www.aclweb.org/anthology/W/W09/W09-3302).See also corpora produced similarly in an enhanced version of this work work (Pan et al., "Cross-lingual Name Tagging and Linking for 282 Languages", ACL 2017) at http://nlp.cs.rpi.edu/wikiann/.

  13. C

    Gender and Intersectional Disparities in Biographies on English and Spanish...

    • dataverse.csuc.cat
    csv +3
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrés Bejarano Randazzo; Miquel Centelles Velilla; Miquel Centelles Velilla; Núria Ferran-Ferrer; Núria Ferran-Ferrer; Laura Fernández Aguilera; Laura Fernández Aguilera; Andrés Bejarano Randazzo (2024). Gender and Intersectional Disparities in Biographies on English and Spanish Wikipedia Front Pages (2013-2023) [Dataset]. http://doi.org/10.34810/data1427
    Explore at:
    tsv(28325), text/comma-separated-values(1133), csv(277034), tsv(26112), tsv(60), txt(3485), text/comma-separated-values(29816)Available download formats
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Andrés Bejarano Randazzo; Miquel Centelles Velilla; Miquel Centelles Velilla; Núria Ferran-Ferrer; Núria Ferran-Ferrer; Laura Fernández Aguilera; Laura Fernández Aguilera; Andrés Bejarano Randazzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    https://ror.org/003x0zc53
    Description

    El següent dataset conté dos carpetes amb dades diferents, les quals inclouen: El conjunt de dades de la carpeta amb nom "Gender" proporciona la distribució per gènere de les persones que han estat destacades a les portades de les versions anglesa i espanyola de Wikipedia, durant el període 2013-2023. Pel que fa a l'edició en castellà, les dades s'han recollit de les seccions "Artículos buenos" i "Artículos destacados" i es mostren en forma agregada. El conjunt de dades de la carpeta amb nom "Intersectionality" proporciona la distribució per diferents atributs sociodemogràfics de les persones que han estat destacades a les portades de les versions en anglès i en espanyol de Wikipedia, en el període del 2013 al 2023. Està estructurat en quatre CSV. Tres d'aquests CSV corresponen a l'edició de Wikipedia en anglès: el CSV English 3C que conté les dades de les seccions "Did you know...", "In the news" i "On this day..."; un CSV dedicat a "English Featured Article", i un altre a "English Featured Picture". El quart CSV conté les dades de l'edició en castellà de la Wikipedia, extretes de les seccions "Artículo Destacado" i "Artículo Bueno". A cada CSV, les dades es presenten en columnes, cadascuna dedicada a un atribut sociodemogràfic. The following dataset contains two folders with different data, which include: The data set of the folder with name "Gender" provides the gender distribution of individuals featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. For the Spanish edition, data has been collected from the "Artículos buenos" and "Artículos destacados" sections and is displayed in an aggregated format. The data set of the folder with name "Intersectionality" provides the distribution based on various sociodemographic attributes of individuals who have been featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. It is structured into four CSV. Three of these CSV correspond to the English Wikipedia edition: the English 3C CSV containing data from the sections "Did you know...", "In the news," and "On this day..."; a CSV dedicated to "English Featured Article," and another to "English Featured Picture." The fourth CSV contains data from the Spanish edition of Wikipedia, extracted from the sections "Artículo Destacado" and "Artículo Bueno." Within each CSV, the data is presented in columns, each dedicated to a sociodemographic attribute.

  14. Dataset and Image Inventory

    • figshare.com
    pdf
    Updated Jan 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Kämpf (2016). Dataset and Image Inventory [Dataset]. http://doi.org/10.6084/m9.figshare.1619639.v6
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Mirko Kämpf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the figure and dataset inventory for the article: The detection of emerging trends using Wikipedia traffic data and context networks

  15. Z

    Wiki-TabNER dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koleva, Aneta (2024). Wiki-TabNER dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10794525
    Explore at:
    Dataset updated
    Jun 14, 2024
    Authors
    Koleva, Aneta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset described in the paper Wiki-TabNER:Integrating Named Entity Recognition into Wikipedia Tables.

    It Is a dataset containing tables extracted from the Wikipedia pages and annotated with Dbpedia entity types. The file Wiki_TabNER_final_labeled.json contains the annotated tables. It can be used for solving NER within tables and for the entity linking task. The file dataset_entities_labeled_linked.csv contains all the linked entities that are mentioned in the tables and their corresponding Wikipedia IDs. More information on the creation of the dataset and instruction on how to use it is available in the github reposiotry for the paper.

  16. SSHOC - National Gallery - Grounds Database CIDOC CRM Mapped Dataset

    • zenodo.org
    • dataverse.nl
    • +2more
    xml
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orla Delaney; Orla Delaney; Joseph Padfield; Joseph Padfield (2024). SSHOC - National Gallery - Grounds Database CIDOC CRM Mapped Dataset [Dataset]. http://doi.org/10.5281/zenodo.6478780
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Orla Delaney; Orla Delaney; Joseph Padfield; Joseph Padfield
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    In 2018 the IPERION-CH Grounds Database was presented to examine how the data produced through the scientific examination of historic painting preparation or grounds samples, from multiple institutions could be combined in a flexible digital form. Exploring the presentation of interrelated high resolution images, text, complex metadata and procedural documentation. The original main user interface is live, though password protected at this time. Work within the SSHOC project aimed to reformat the data to create a more FAIR data-set, so in addition to mapping it to a standard ontology, to increase Interoperability, it has also been made available in the form of open linkable data combined with a SPARQL end-point. A draft version of this live data presentation can been found Here.

    This is a draft data-set and further work is planned to debug and improve its semantic structure.This deposit contains the CIDOC-CRM mapped data formatted in XML and an example model diagram representing some of the key relationships covered in the data-set.

  17. Presentation Of The City Bikes Program

    • hub.tumidata.org
    url
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TUMI (2025). Presentation Of The City Bikes Program [Dataset]. https://hub.tumidata.org/dataset/presentation_of_the_city_bikes_program_salvador_de_bahia
    Explore at:
    urlAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    Tumi Inc.http://www.tumi.com/
    Description

    Presentation Of The City Bikes Program
    This dataset falls under the category Individual Transport Other.
    It contains the following data: Presentation of the bicycle city project
    This dataset was scouted on 2022-02-14 as part of a data sourcing project conducted by TUMI. License information might be outdated: Check original source for current licensing. The data can be accessed using the following URL / API Endpoint: http://www.planmob.salvador.ba.gov.br/index.php/13-estudos-projetos-e-programas?ml=1 URL for data access and license information. Please note: This link leads to an external resource. If you experience any issues with its availability, please try again later.

  18. Bahia State Government Mobility Program

    • hub.tumidata.org
    url
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TUMI (2025). Bahia State Government Mobility Program [Dataset]. https://hub.tumidata.org/dataset/bahia_state_government_mobility_program_salvador_de_bahia
    Explore at:
    urlAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    Tumi Inc.http://www.tumi.com/
    Area covered
    State of Bahia
    Description

    Bahia State Government Mobility Program
    This dataset falls under the category Public Transport Other.
    It contains the following data: Presentation of the Mobility Program of the State of Bahia
    This dataset was scouted on 2022-02-14 as part of a data sourcing project conducted by TUMI. License information might be outdated: Check original source for current licensing. The data can be accessed using the following URL / API Endpoint: http://www.planmob.salvador.ba.gov.br/index.php/13-estudos-projetos-e-programas?ml=1 URL for data access and license information. Please note: This link leads to an external resource. If you experience any issues with its availability, please try again later.

  19. LGBTQ representation in animated shows in the us

    • kaggle.com
    zip
    Updated Jun 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A_N_Wilson (2022). LGBTQ representation in animated shows in the us [Dataset]. https://www.kaggle.com/datasets/anwilson/lgbtq-representation-in-animated-shows-in-the-us
    Explore at:
    zip(31986 bytes)Available download formats
    Dataset updated
    Jun 15, 2022
    Authors
    A_N_Wilson
    Description

    Context Corner

    While looking for a Capstone Project for the Google Data Analytics Program I came across a dataset compiled by Bradd Carey (LGBTQ Characters in Youth Cartoons). This dataset was specific to data parsed from an Insider.com article published on 06/2021. I decided I wanted to expand this dataset to include characters from any animated show regardless of target audience.

    Modifications to the Dataset

    I initially scraped information regarding LGBTQ characters from the following wikipedia pages: https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters#1990s https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_1990%E2%80%931994 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_1995%E2%80%931999 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2000%E2%80%932004 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2005%E2%80%932009 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2010%E2%80%932014 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2015%E2%80%932019#2018 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2020%E2%80%93present

    I removed data on disability representation (for now) to narrow my project as that information was not included on the wikipedia pages and the Insider dataset was specific to youth cartoons.

    I removed studio information.

    If there was a difference between IMDB and Wikipedia for seasons, number of episodes, or start and end dates I went with what was on IMDB.

    Removed shows that did not have an IMDB or Wikipedia page.

    Removed characters that appeared in spin-off shows and only included them on the first show they appeared.

    Split data into two separate datasets for ease of queries: general show information and specific character information

    Acknowledgements

    Bradd Carey for the dataset he created from the Insider database

    Original Insider Article: Abbey White and Kalai Chik -- Reporting Joi-Marie McKenzie, Brea Cubit, Emma LeGault, and Megan Willett-Wei -- Editing Sawyer Click, Skye Gould, Taylor Tyson, and Joanna Lin Su -- Design and Development Chris Snyder, Jess Chou, A.C. Fowler, Kyle Desiderio, and Kuwilileni Hauwanga -- Video

    Inspiration

    I wanted to complete a Capstone that was personal to me

  20. Z

    Data from: Representation of crowd accidents in popular media

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feliciani, Claudio; Corbetta, Alessandro; Haghani, Milad; Nishinari, Katsuhiro (2023). Representation of crowd accidents in popular media [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8347228
    Explore at:
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    The University of Tokyo
    Eindhoven University of Technology
    The University of New South Wales
    Authors
    Feliciani, Claudio; Corbetta, Alessandro; Haghani, Milad; Nishinari, Katsuhiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains results related to the analysis of a corpus of news reports covering the topic of crowd accidents. To facilitate online visualization and offline analysis, the files are organized by assigning a number to each. The number system and the details of each set of files are described as follows:

    Class 0 – This contains the same files provided in this repository, but they are organized into folders to make analysis easier. If you intend to analyze the data from our lexical analysis, we suggest using this file since it is better organized and can be directly downloaded. Please note that due to a mistake when creating the newest version Wikipedia files were not included in this file so they need to be downloaded separetely. This will be fixed in the next version.

    Class 1 – This contains the sources and relevant information for people who are interested in replicating our dataset or accessing the news reports used in our analysis. Please note that due to copyright regulations, the texts cannot be shared. However, you can refer to the links provided in these files to access the news articles and Wikipedia pages. Some links have stopped working during the time we were working on this study, and others may be unreachable in the future.

    Class 2 – This contains the results from a lexical analysis of the corpus. The HTML page allows you to visualize each result interactively through the online VOSviewer app (you need to download the file and open it using a browser since Zenodo does not recognize this as a link). It is possible that this service (VOSviewer app) may be discontinued at some point in the future. PNG images of lexical maps are, therefore, available for download through the ZIP archive, although they do not allow interactive access. If you plan to read our results using the offline VOSviewer software or perform a more systematic analysis, JSON files are available for each category (time period, geographical area of the reporting institution, and purpose of gathering). The same files can be also find in the ZIP archive in class 0.

    Class 3 – These are the results of the sentiment analysis. For each report, a single result is generated for the title. However, for the body, the text is divided into parts, which are analyzed independently.

    Class 4 – These two files contains the corpus of Wikipedia relative to 68 crowd accidents which occurred between 1990 and 2019. The text for all accidents were scraped on October 15th, 2022 (before the tragedy in Itaewon) and on May 25th, 2023 (after the tragedy). Sources relative to the content in Wikipedia are listed in the file contained in Class 1 ("1_list_wiki_report.csv"). More generally, accidents listed on dedicated Wikipedia pages on https://en.wikipedia.org/wiki/List_of_fatal_crowd_crushes are reported in the corpus provided here (the period 1900-2019 is considered here).

    The format of CSV and JSON files should be self-explanatory after reading our publication. For specific questions or queries, please contact one of the authors, and we will try to assist you.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nithum Thain; Lucas Dixon; Ellery Wulczyn (2017). Wikipedia Talk Labels: Toxicity [Dataset]. http://doi.org/10.6084/m9.figshare.4563973.v2
Organization logo

Wikipedia Talk Labels: Toxicity

Explore at:
16 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Feb 22, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nithum Thain; Lucas Dixon; Ellery Wulczyn
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

Search
Clear search
Close search
Google apps
Main menu