Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset aggregates the 100 most popular Wikipedia articles by pageviews - enabling the tracking of trending topics on Wikipedia.
The data begins in the year 2016 and the textual data is presented as it is found on the website of Wikipedia.
rank- Rank of the article (out of 100).article - Title of the article.views - Number of pageviews (across all platforms).date - Date of the pageviews.This dataset is updated on a daily basis with new data sourced from the WikiMedia API.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This projects aims at proving with data that it is necessary to analyze vernacular languages when dealing with events that are described using public sources likes Wikidata and Wikipedia. In order to retrieve and analyze events, it uses the wikivents Python package. We provide in the project directory the Jupyter Notebook that processed (and/or generate) the dataset directory content. Statistics from this analysis is located in the stats directory. The main statistics are reported in the associated paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Knowledge bases structured around IF-THEN rules and defined for the inference of computational trust in the Wikipedia context.
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Wikipedia Article Pageviews
This repository automatically fetches and aggregates the 100 most popular Wikipedia articles by pageviews - creating a dataset that enables tracking trending topics on Wikipedia. It works by polling the WikiMedia API on a daily basis and fetching the top 100 most popular articles from two days ago. The fetcher runs in a scheduled GitHub Actions workflow, which is available here. The dataset begins in the year 2016 and the textual data is presented as it… See the full description on the dataset page: https://huggingface.co/datasets/vtasca/wikipedia-pageviews.
Facebook
Twitterhttp://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
The Teahouse corpus is a set of questions asked at the Wikipedia Teahouse, a peer support forum for new Wikipedia editors. This corpus contains data from its first two years of operation.
The Teahouse started as an editor engagement initiative and Fellowship project. It was launched in February 2012 by a small team working with the Wikimedia Foundation. Our intention was to pilot a new, scalable model for teaching Wikipedia newcomers the ropes of editing in a friendly and engaging environment.
The ultimate goal of the pilot project was to increase the retention of new Wikipedia editors (most of whom give up and leave within their first 24 hours post-registration) through early proactive outreach. The project was particularly focused on retaining female newcomers, who are woefully underrepresented among the regular contributors to the encyclopedia.
The Teahouse lives on as an vibrant, self-sustaining and community-driven project. All Teahouse participants are volunteers: no one is told when, how, or how much they must contribute.
See the README files associated with each datafile for a schema of the data fields in that file.
Read on for more info on potential applications, the provenance of these data, and links to related resources.
or, what is it good for?
The Teahouse corpus consists of good quality data and rich metadata around social Q&A interactions in a particular setting: new user help requests in a large, collaborative online community.
More generally, this corpus is a valuable resource for research on conversational dynamics in online, asynchronous discussions.
Qualitative textual analysis could yield insights into the kinds of issues faced by newcomers in established online collaborations.
Linguisitc analysis could examine the impact of syntactic and semantic features related to politeness, sentiment, question framing, or other rhetorical strategies on discussion outcomes.
Response patterns (questioner replies and answers) within each thread could be used to map network relationships, or to investigate correlations between participation by the initiator of a thread, or the number of participants, on thread length or interactivity (the interval of time between posts).
The corpus is large and rich enough to provide training both training and test data for machine learning applications.
Finally, the data provide here can be extended and compared with other publicly-available datasets of Wikipedia, allowing researchers to examine relationships between editors' participation within the Teahouse Q&A forum and their previous, concurrent, and subsequent editing activities within millions of other articles, meta-content, and discussion spaces on Wikipedia.
or, how the research sausage was made
Parsing wikitext presents many challenges: the mediawiki editing interface is deliberately underspecified in order to maximize flexibility for contributors. This can make it difficult to tell the difference between different types of contribution--say, fixing a typo or answering a question.
The Teahouse Q&A board was designed to provide a more structured workflow than normal wiki talk pages, and instrumented to identify certain kinds of contributions (questions and answers) and isolate them from the 'noisy' background datastream of incidental edits to the Q&A page. The post-processing of the data presented here favored precision over recall: to provide a good quality set of questions, rather than a complete one.
In cases where it wasn't easy to identify whether an edit contained a question or answer, these data have not been included. However, it is hard to account for all ambiguous or invalid cases: caveat quaesitor!
Our approach to data inclusion was conservative. The number of questioner replies and answers to any given question may be under-counted, but is unlikely to be over-counted. However, our spot checks and analysis of the data suggest that the majority of responses are accounted for, and that the distribution of "missed" responses is randomly distributed.
The Teahouse corpus only contains questions and answers by registered users of Wikipedia who were logged in when they participated. IP addresses can be linked to an individual's physical location. On Wikipedia, edits by logged out and unregistered users are identified by the user's current IP address. Although all edits to Wikipedia are legally public and free licenced, we have redacted IP edits from this dataset in deference to user privacy. Researchers interested in those data can find them in other public Wikipedia datasets.
Additional data about these Q&A interactions has been collected, and other data are retrievable. Examples of data that could be included in future revisions of the corpus at low cost include:
Examples of data that could be included in future revisions of the corpus at reasonable cost:
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it has aggressive tone. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Sample Historical Data from Wikipedia (PDF format)
This dataset provides some historical information for various countries, including Indonesia, Greece, Rome, France, Vietnam, Korea, Peru, England, Germany, Mexico, Iran, India, China, Egypt, and Japan. The data is sourced from Wikipedia and presented in a single PDF file for each country.
Source: Wikipedia Content: Historical data for various countries Format: Individual PDF files per country
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains a set of 93,449 observations providing wikiproject mid-level category labels associated with talk pages for respective Wikipedia articles.Each observation includes a talk page title, talk page id, latest revision id when the extraction was done, associated wikiproject templates and mid-level wikiproject categories the corresponding article page belongs to.The dataset was generated using a python script that ran mysql queries on Wikimedia PAWS.To ensure a balanced set, the script extracts a random set of 2000 page-ids per mid-level category totaling about 93,449 observationsThis dataset opens up immense possibilities for topic oriented research around Wikipedia as it exposes high level topic data associated with Wikipedia pages.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset shows values ​​taken from biography articles that have appeared in the "From today's featured article", "Did you know..." and "On this day" sections of the Front Page in the English edition of Wikipedia among years 2013 and 2024. The values contained in this dataset have been obtained from the crossing of Wikidata properties with the unique identifiers of the articles. These data provides information about the people described in the articles, such as gender, ethnicity, sexual orientation, native language, among other properties, so that an analysis can be made, from an intersectional perspective , of the representation of diversity in Wikipedia. The document Joint-data contains all the joint data without making a distinction based on the gender of the person biographed, while the other documents have the information divided based on the gender of the people in the articles: "Women" to encompass the data of cisgender women, "Men" for the data of cisgender men, and "Dissident" to collect the data of people whose gender is dissident from which they were assigned at birth. Therefore, you can find four documents: Joint-data; Dissident_Gender-categorized-data; Men_Gender-categorized-data; Women_Gender-categorized-data. In each document, odd columns state the Wikidata properties analized and even columns specify the number of results for each value of the property, that is the occurrences of each value.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Out of pure interest, I analyzed the clustering of voters by votes and drew dendrograms on this occasion, which seemed interesting to many participants.
In the source, votes are given without a date, so those users who changed their votes during the election of arbitrators (both are saved in the source) and left them in both the "for" and "against" sections will be counted as voting against.
Many thanks to MBH, who collected a data. You can visit his tool (on Russian).
It is not uncommon on Wikipedia for participants to create virtual accounts and (in violation of the rules), participate in elections by several accounts simultaneously. Some of such cases have been identified, others you can help identify.
In 2021 Russian Wikipedia elections was attacked by a group of conspirators. Can you spot them directly from the data presented?
For a deeper analysis, each of the elections is described - in what year they took place, who was the candidate, what type of vote it was (for an arbitrator, administrator or bureaucrat). Based on these data, it is possible to identify candidates whose views are likely to coincide.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data associated with Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy and James R. Curran (2013), "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006). A preprint is included here as wikiner-preprint.pdfThis data was originally available at http://schwa.org/resources (which linked to http://schwa.org/projects/resources/wiki/Wikiner).The .bz2 files are NER training corpora produced as reported in the Artificial Intelligence paper. wp2 and wp3 are differentiated by wp3 using a higher level of link inference. They use a pipe-delimited format that can be converted to CoNLL 2003 format with system2conll.pl.nothman08types.tsv is a manual classification of articles first used in Joel Nothman, James R. Curran and Tara Murphy (2008), "Transforming Wikipedia into Named Entity Training Data", In Proceedings of the Australasian Language Technology Association Workshop 2008. http://aclanthology.coli.uni-saarland.de/pdf/U/U08/U08-1016.pdfpopular.tsv and random.tsv are manual article classifications developed for the Artifiical Intelligence paper based on different strategies for sampling articles from Wikipedia in order to account for Wikipedia's biased distribution (see that paper). scheme.tsv maps these fine-grained labels to coarser annotations including CoNLL 2003-style.wikigold.conll.txt is a manual NER annotation of some Wikipedia text as presented in Dominic Balasuriya and Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran (2009), in Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources (http://www.aclweb.org/anthology/W/W09/W09-3302).See also corpora produced similarly in an enhanced version of this work work (Pan et al., "Cross-lingual Name Tagging and Linking for 282 Languages", ACL 2017) at http://nlp.cs.rpi.edu/wikiann/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
El següent dataset conté dos carpetes amb dades diferents, les quals inclouen: El conjunt de dades de la carpeta amb nom "Gender" proporciona la distribució per gènere de les persones que han estat destacades a les portades de les versions anglesa i espanyola de Wikipedia, durant el perÃode 2013-2023. Pel que fa a l'edició en castellà , les dades s'han recollit de les seccions "ArtÃculos buenos" i "ArtÃculos destacados" i es mostren en forma agregada. El conjunt de dades de la carpeta amb nom "Intersectionality" proporciona la distribució per diferents atributs sociodemogrà fics de les persones que han estat destacades a les portades de les versions en anglès i en espanyol de Wikipedia, en el perÃode del 2013 al 2023. Està estructurat en quatre CSV. Tres d'aquests CSV corresponen a l'edició de Wikipedia en anglès: el CSV English 3C que conté les dades de les seccions "Did you know...", "In the news" i "On this day..."; un CSV dedicat a "English Featured Article", i un altre a "English Featured Picture". El quart CSV conté les dades de l'edició en castellà de la Wikipedia, extretes de les seccions "ArtÃculo Destacado" i "ArtÃculo Bueno". A cada CSV, les dades es presenten en columnes, cadascuna dedicada a un atribut sociodemogrà fic. The following dataset contains two folders with different data, which include: The data set of the folder with name "Gender" provides the gender distribution of individuals featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. For the Spanish edition, data has been collected from the "ArtÃculos buenos" and "ArtÃculos destacados" sections and is displayed in an aggregated format. The data set of the folder with name "Intersectionality" provides the distribution based on various sociodemographic attributes of individuals who have been featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. It is structured into four CSV. Three of these CSV correspond to the English Wikipedia edition: the English 3C CSV containing data from the sections "Did you know...", "In the news," and "On this day..."; a CSV dedicated to "English Featured Article," and another to "English Featured Picture." The fourth CSV contains data from the Spanish edition of Wikipedia, extracted from the sections "ArtÃculo Destacado" and "ArtÃculo Bueno." Within each CSV, the data is presented in columns, each dedicated to a sociodemographic attribute.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the figure and dataset inventory for the article: The detection of emerging trends using Wikipedia traffic data and context networks
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset described in the paper Wiki-TabNER:Integrating Named Entity Recognition into Wikipedia Tables.
It Is a dataset containing tables extracted from the Wikipedia pages and annotated with Dbpedia entity types. The file Wiki_TabNER_final_labeled.json contains the annotated tables. It can be used for solving NER within tables and for the entity linking task. The file dataset_entities_labeled_linked.csv contains all the linked entities that are mentioned in the tables and their corresponding Wikipedia IDs. More information on the creation of the dataset and instruction on how to use it is available in the github reposiotry for the paper.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
In 2018 the IPERION-CH Grounds Database was presented to examine how the data produced through the scientific examination of historic painting preparation or grounds samples, from multiple institutions could be combined in a flexible digital form. Exploring the presentation of interrelated high resolution images, text, complex metadata and procedural documentation. The original main user interface is live, though password protected at this time. Work within the SSHOC project aimed to reformat the data to create a more FAIR data-set, so in addition to mapping it to a standard ontology, to increase Interoperability, it has also been made available in the form of open linkable data combined with a SPARQL end-point. A draft version of this live data presentation can been found Here.
This is a draft data-set and further work is planned to debug and improve its semantic structure.This deposit contains the CIDOC-CRM mapped data formatted in XML and an example model diagram representing some of the key relationships covered in the data-set.
Facebook
TwitterPresentation Of The City Bikes Program
This dataset falls under the category Individual Transport Other.
It contains the following data: Presentation of the bicycle city project
This dataset was scouted on 2022-02-14 as part of a data sourcing project conducted by TUMI. License information might be outdated: Check original source for current licensing.
The data can be accessed using the following URL / API Endpoint: http://www.planmob.salvador.ba.gov.br/index.php/13-estudos-projetos-e-programas?ml=1 URL for data access and license information. Please note: This link leads to an external resource. If you experience any issues with its availability, please try again later.
Facebook
TwitterBahia State Government Mobility Program
This dataset falls under the category Public Transport Other.
It contains the following data: Presentation of the Mobility Program of the State of Bahia
This dataset was scouted on 2022-02-14 as part of a data sourcing project conducted by TUMI. License information might be outdated: Check original source for current licensing.
The data can be accessed using the following URL / API Endpoint: http://www.planmob.salvador.ba.gov.br/index.php/13-estudos-projetos-e-programas?ml=1 URL for data access and license information. Please note: This link leads to an external resource. If you experience any issues with its availability, please try again later.
Facebook
TwitterWhile looking for a Capstone Project for the Google Data Analytics Program I came across a dataset compiled by Bradd Carey (LGBTQ Characters in Youth Cartoons). This dataset was specific to data parsed from an Insider.com article published on 06/2021. I decided I wanted to expand this dataset to include characters from any animated show regardless of target audience.
I initially scraped information regarding LGBTQ characters from the following wikipedia pages: https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters#1990s https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_1990%E2%80%931994 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_1995%E2%80%931999 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2000%E2%80%932004 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2005%E2%80%932009 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2010%E2%80%932014 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2015%E2%80%932019#2018 https://en.wikipedia.org/wiki/List_of_animated_series_with_LGBT_characters:_2020%E2%80%93present
I removed data on disability representation (for now) to narrow my project as that information was not included on the wikipedia pages and the Insider dataset was specific to youth cartoons.
I removed studio information.
If there was a difference between IMDB and Wikipedia for seasons, number of episodes, or start and end dates I went with what was on IMDB.
Removed shows that did not have an IMDB or Wikipedia page.
Removed characters that appeared in spin-off shows and only included them on the first show they appeared.
Split data into two separate datasets for ease of queries: general show information and specific character information
Bradd Carey for the dataset he created from the Insider database
Original Insider Article: Abbey White and Kalai Chik -- Reporting Joi-Marie McKenzie, Brea Cubit, Emma LeGault, and Megan Willett-Wei -- Editing Sawyer Click, Skye Gould, Taylor Tyson, and Joanna Lin Su -- Design and Development Chris Snyder, Jess Chou, A.C. Fowler, Kyle Desiderio, and Kuwilileni Hauwanga -- Video
I wanted to complete a Capstone that was personal to me
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains results related to the analysis of a corpus of news reports covering the topic of crowd accidents. To facilitate online visualization and offline analysis, the files are organized by assigning a number to each. The number system and the details of each set of files are described as follows:
Class 0 – This contains the same files provided in this repository, but they are organized into folders to make analysis easier. If you intend to analyze the data from our lexical analysis, we suggest using this file since it is better organized and can be directly downloaded. Please note that due to a mistake when creating the newest version Wikipedia files were not included in this file so they need to be downloaded separetely. This will be fixed in the next version.
Class 1 – This contains the sources and relevant information for people who are interested in replicating our dataset or accessing the news reports used in our analysis. Please note that due to copyright regulations, the texts cannot be shared. However, you can refer to the links provided in these files to access the news articles and Wikipedia pages. Some links have stopped working during the time we were working on this study, and others may be unreachable in the future.
Class 2 – This contains the results from a lexical analysis of the corpus. The HTML page allows you to visualize each result interactively through the online VOSviewer app (you need to download the file and open it using a browser since Zenodo does not recognize this as a link). It is possible that this service (VOSviewer app) may be discontinued at some point in the future. PNG images of lexical maps are, therefore, available for download through the ZIP archive, although they do not allow interactive access. If you plan to read our results using the offline VOSviewer software or perform a more systematic analysis, JSON files are available for each category (time period, geographical area of the reporting institution, and purpose of gathering). The same files can be also find in the ZIP archive in class 0.
Class 3 – These are the results of the sentiment analysis. For each report, a single result is generated for the title. However, for the body, the text is divided into parts, which are analyzed independently.
Class 4 – These two files contains the corpus of Wikipedia relative to 68 crowd accidents which occurred between 1990 and 2019. The text for all accidents were scraped on October 15th, 2022 (before the tragedy in Itaewon) and on May 25th, 2023 (after the tragedy). Sources relative to the content in Wikipedia are listed in the file contained in Class 1 ("1_list_wiki_report.csv"). More generally, accidents listed on dedicated Wikipedia pages on https://en.wikipedia.org/wiki/List_of_fatal_crowd_crushes are reported in the corpus provided here (the period 1900-2019 is considered here).
The format of CSV and JSON files should be self-explanatory after reading our publication. For specific questions or queries, please contact one of the authors, and we will try to assist you.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.