This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We o... Visit https://dataone.org/datasets/sha256%3Ab1240bda398e8fa311ac15dbcc04880333d5f3fbe67a7a951786da2d44e33018 for complete metadata about this dataset.
A team's mean seasons statistics can be used as predictors for their performance in future games. However, these statistics gain additional meaning when placed in the context of their opponents' (and opponents' opponents') performance. This dataset provides this context for each team. Furthermore, predicting games based on post-season stats causes data leakage, which from experience can be significant in this context (15-20% loss in accuracy). Thus, this dataset provides each of these statistics prior to each game of the regular season, preventing any source of data leakage.
All data is derived from the March Madness competition data. Each original column was renamed to "A" and "B" instead of "W" and "L," and the mirrored to represent both orderings of opponents. Each team's mean stats are computed (both their stats, and the mean "allowed" or "forced" statistics by their opponents). To compute the mean opponents' stats, we analyze the games played by each opponent (excluding games played against the team in question), and compute the mean statistics for those games. We then compute the mean of these mean statistics, weighted by the number of times the team in question played each opponent. The opponents' opponent's stats are computed as a weighted average of the opponents' average. This results in statistics similar to those used to compute strength of schedule or RPI, just that they go beyond win percentages (See: https://en.wikipedia.org/wiki/Rating_percentage_index)
The per game statistics are computed by pretending we don't have any of the data on or after the day in question.
Currently, the data isn't computed particularly efficiently. Computing the per game averages for every day of the season is necessary to compute fully accurate opponents' opponents' average, but takes about 90 minutes to obtain. It is probably possible to parallelize this, and the per-game averages involve a lot of repeated computation (basically computing the final averages over and over again for each day). Speeding this up will make it more convenient to make changes to the dataset.
I would like to transform these statistics to be per-possession, add shooting percentages, pace, and number of games played (to give an idea of the amount uncertainty that exists in the per-game averages). Some of these can be approximated with the given data (but the results won't be exact), while others will need to be computed from scratch.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘List of Top Data Breaches (2004 - 2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hishaamarmghan/list-of-top-data-breaches-2004-2021 on 14 February 2022.
--- Dataset description provided by original source is as follows ---
This is a dataset containing all the major data breaches in the world from 2004 to 2021
As we know, there is a big issue related to the privacy of our data. Many major companies in the world still to this day face this issue every single day. Even with a great team of people working on their security, many still suffer. In order to tackle this situation, it is only right that we must study this issue in great depth and therefore I pulled this data from Wikipedia to conduct data analysis. I would encourage others to take a look at this as well and find as many insights as possible.
This data contains 5 columns: 1. Entity: The name of the company, organization or institute 2. Year: In what year did the data breach took place 3. Records: How many records were compromised (can include information like email, passwords etc.) 4. Organization type: Which sector does the organization belong to 5. Method: Was it hacked? Were the files lost? Was it an inside job?
Here is the source for the dataset: https://en.wikipedia.org/wiki/List_of_data_breaches
Here is the GitHub link for a guide on how it was scraped: https://github.com/hishaamarmghan/Data-Breaches-Scraping-Cleaning
--- Original source retains full ownership of the source dataset ---
This is an overview of the soundscape recording datasets that have been contributed to the Global Soundscapes Project, as well as associated meta-data. The audio recording criteria justifying inclusion into the current meta-dataset are: Stationary (no towed sensors or microphones mounted on cars) Passive (no human disturbance by the recordist) Ambient (no focus on a particular species or direction) Recorded over multiple sites of a region and/or days The individual columns are described as follows. General: ID: primary key name: name of the dataset subset: incremental integer that can be used to distinguish sub-datasets collaborators: full names of people deemed responsible for the dataset, separated by commas date_added: when the dataset was added Space: realm_IUCN: realm from IUCN Global Ecosystem Typology (v2.0) (https://global-ecosystems.org/) medium: the physical medium the microphone is situated in GADM0: for terrestrial locations, Database of Global Administrative Areas level 0 unit as per https://gadm.org/ GADM1: for terrestrial locations, Database of Global Administrative Areas level 1 unit as per https://gadm.org/ GADM2: for terrestrial locations, Database of Global Administrative Areas level 2 unit as per https://gadm.org/ IHO: International Hydrographic Organisation sea area as per https://iho.int/ latitude_numeric_region: study region approximate centroid latitude in WGS84 decimal degrees longitude_numeric_region: study region approximate centroid longitude in WGS84 decimal degrees topography_min_m: minimum elevation of sites from sea level topography_max_m: maximum elevation of sites from sea level ground_distance_m: vertical distance of microphone from land ground or ocean floor freshwater_depth_m: vertical distance from water surface for freshwater datasets sites_number: number of sites sampled Time: days_number_per_site: typical number of days sampled per site (or minimum if too variable) day: whether the sites were sampled during daytime night: whether the sites were sampled during nighttime twilight: whether the sites were sampled during twilight warm_season: whether the warm season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) cold_season: whether the cold season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) dry_season: whether the dry season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) wet_season: whether the wet season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) year_start: starting year of the sampling year_end: ending year of the sampling schedule: description of the sampling schedule, free text recording_selection: criteria used to temporally select recordings (e.g., discarded rainy days) Audio: high_pass_filter_Hz: lower frequency of the high-pass filter sampling_frequency_kHz: frequency the microphone was sampled at audio_bit_depth: bit depth used for encoding audio recorder_model: recorder model used microphone: microphone used recordist_position: position of the recordist relative to the microphone during sampling Others: comments: free-text field URL_project: internet link for further information URL_publication: internet link of the corresponding publication More information on the project can be found here: https://ecosound-web.uni-goettingen.de/ecosound_web/project/gsp adding IHO data
Unfortunately, the API this dataset used to pull the stock data isn't free anymore. Instead of having this auto-updating, I dropped the last version of the data files in here, so at least the historic data is still usable.
This dataset provides free end of day data for all stocks currently in the Dow Jones Industrial Average. For each of the 30 components of the index, there is one CSV file named by the stock's symbol (e.g. AAPL for Apple). Each file provides historically adjusted market-wide data (daily, max. 5 years back). See here for description of the columns: https://iextrading.com/developer/docs/#chart
Since this dataset uses remote URLs as files, it is automatically updated daily by the Kaggle platform and automatically represents the latest data.
List of stocks and symbols as per https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average
Thanks to https://iextrading.com for providing this data for free!
Data provided for free by IEX. View IEX’s Terms of Use.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kaggle has fixed the issue with gzip files and Version 510 should now reflect properly working files
Please use the version 508 of the dataset, as 509 is broken. See link below of the dataset that is properly working https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows/versions/508
The context and history of the current ongoing conflict can be found https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine.
[Jun 16] (🌇Sunset) Twitter has finally pulled the plug on all of my remaining TWITTER API accounts as part of their efforts for developers to migrate to the new API. The last tweets that I pulled was dated last Jun 14, and no more data from Jun 15 onwards. It was fun til it lasted and I hope that this dataset was able and will continue to help a lot. I'll just leave the dataset here for future download and reference. Thank you all!
[Apr 19] Two additional developer accounts have been permanently suspended, expect a lower throughtput in the next few weeks. I will pull data til they ban my last account.
[Apr 08] I woke up this morning and saw that Twitter has banned/ permanently suspended 4 of my developer accounts, I have around a few more but it is just a matter of time till all my accounts will most likely get banned as well. This was a fun project that I maintained for as long as I can. I will pull data til my last account gets banned.
[Feb 26] I've started to pull in RETWEETS again, so I am expecting a significant amount of throughput in tweets again on top of the dedicated processes that I have that gets NONRETWEETS. If you don't want RETWEETS, just filter them out.
[Feb 24] It's been a year since I started getting tweets of this conflict and had no idea that a year later this is still ongoing. Almost everyone assumed that Ukraine will crumble in a matter of days, but it is not the case. To those who have been using my dataset, i hope that I am helping all of you in one way or another. Ill do my best to maintain updating this dataset as long as I can.
[Feb 02] I seem to be getting less tweets as my crawlers are getting throttled, i used to get 2500 tweets per 15 mins but around 2-3 of my crawlers are getting throttling limit errors. There may be some kind of update that Twitter has done about rate limits or something similar. Will try to find ways to increase the throughput again.
[Jan 02] For all new datasets, it will now be prefixed by a year, so for Jan 01, 2023, it will be 20230101_XXXX.
[Dec 28] For those looking for a cleaned version of my dataset, with the retweets removed from before Aug 08, here is a dataset by @@vbmokin https://www.kaggle.com/datasets/vbmokin/russian-invasion-ukraine-without-retweets
[Nov 19] I noticed that one of my developer accounts, which ISNT TWEETING ANYTHING and just pulling data out of twitter has been permanently banned by Twitter.com, thus the decrease of unique tweets. I will try to come up with a solution to increase my throughput and signup for a new developer account.
[Oct 19] I just noticed that this dataset is finally "GOLD", after roughly seven months since I first uploaded my gzipped csv files.
[Oct 11] Sudden spike in number of tweets revolving around most recent development(s) about the Kerch Bridge explosion and the response from Russia.
[Aug 19- IMPORTANT] I raised the missing dataset issue to Kaggle team and they confirmed it was a bug brought by a ReactJs upgrade, the conversation and details can be seen here https://www.kaggle.com/discussions/product-feedback/345915 . It has been fixed already and I've reuploaded all the gzipped files that were lost PLUS the new files that were generated AFTER the issue was identified.
[Aug 17] Seems the latest version of my dataset lost around 100+ files, good thing this dataset is versioned so one can just go back to the previous version(s) and download them. Version 188 HAS ALL THE LOST FILES, I wont be reuploading all datasets as it will be tedious and I've deleted them already in my local and I only store the latest 2-3 days.
[Aug 10] 3/5 of my Python processes errored out and resulted to around 10-12 hours of NO data gathering for those processes thus the sharp decrease of tweets for Aug 09 dataset. I've applied an exception/ error checking to prevent this from happening.
[Aug 09] Significant drop in tweets extracted, but I am now getting ORIGINAL/ NON-RETWEETS.
[Aug 08] I've noticed that I had a spike of Tweets extracted, but they are literally thousands of retweets of a single original tweet. I also noticed that my crawlers seem to deviate because of this tactic being used by some Twitter users where they flood Twitter w...
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The geography of India is extremely diverse, from snowy mountains in the north to Coastal Plains in the south, it contains dense rain forests and the Thar Desert. Along with all this India is the second most populated country in the world ( 1.3 billion people). Such diversity brings a lot of different natural disasters from floods, earthquakes to hurricanes and cyclones. Not to mentation, different diseases spread that happen quickly due to the dance population.
https://lh3.googleusercontent.com/proxy/Lt8NqrJYWp8OLAuh-H-ZwB76KHUIqAghCCi5smfILcxUMNw8hMB-C_t-Ljn7UThw5iu69NeZ5ZYHIJYHMet5y1meUHMfXQcoE4-lHWw1r1C0_PeeONOQh16rhyKWlDujFqnlyJKI" alt="im">
This dataset contains all the disasters that happened in India from 1990 to 2021 with their information.
The dataset has been acquired from Wikipedia. The text is extracted from the Wikipedia articles and then the text is cleaned, processed, and sorted according to the date.
The dataset contains the following columns: * Title: Title of the disaster * Duration: includes day and month as we as intervals for some disasters that lasted more than a day * Year: year of the disaster * Disaster_Info: Information about the disaster ( contains the long and short text describing the disaster) * Date: Date in the specific format ( for some disasters that lasted more than a day we have added the first day of disaster)
More specific information can be extracted from the text using natural language processing techniques.
This dataset has been created using Wikipedia articles: Ref
Study and understand the disasters in India using Natural language processing (NLP) and Natural language understanding (NLU) techniques
The Eurovision Song Contest is an annual music competition that began in 1956. It is one of the longest-running television programmes in the world and is watched by millions of people every year. The contest's winner is determined using numerous voting techniques, including points awarded by juries or televoters.
Since 2004, the contest has included a televised semi-final::— In 2004 held on the Wednesday before the final:— Between 2005 and 2007 held on the Thursday of Eurovision Week n2 - Since 2008 the contest has included two semi-finals, held on the Tuesday and Thursday before the final.
The Eurovision Song Contest is a truly global event, with countries from all over Europe (and beyond) competing for the coveted prize. Over the years, some truly amazing performers have taken to the stage, entertaining audiences with their catchy songs and stunning stage performances.
So who will be crowned this year's winner? Tune in to find out!
This dataset contains information on all of the winners of the Eurovision Song Contest from 1956 to the present day. The data includes the year that the contest was held, the city that hosted it, the winning song and performer, the margin of points between the winning song and runner-up, and the runner-up country.
This dataset can be used to study patterns in Eurovision voting over time, or to compare different winning songs and performers. It could also be used to study how hosting the contest affects a country's chances of winning
- In order to studyEurovision Song Contest winners, one could use this dataset to train a machine learning model to predict the winner of the contest given a set of features about the song and the performers.
- This dataset could be used to study how different voting methods (e.g. jury vs televoters) impact the outcome of the Eurovision Song Contest.
- This dataset could be used to study trends in music over time by looking at how the style ofwinner songs has changed since the contest began in 1956
Data from eurovision_winners.csv was scraped from Wikipedia on April 4, 2020.
The dataset eurovision_winners.csv contains a list of all the winners of the Eurovision Song Contest from 1956 to the present day
License
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: eurovision_winners.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------------------------| | Year | The year in which the contest was held. (Integer) | | Date | The date on which the contest was held. (String) | | Host City | The city in which the contest was held. (String) | | Winner | The country that won the contest. (String) | | Song | The song that won the contest. (String) | | Performer | The performer of the winning song. (String) | | Points | The number of points that the winning song received. (Integer) | | Margin | The margin of victory (in points) between the winning song and the runner-up song. (Integer) | | Runner-up | The country that placed second in the contest. (String) |
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.
gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.
github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.
github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.
natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.
shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.
trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.
wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.
Fork this kernel to get started.
Data Source: https://cloud.google.com/bigquery/sample-tables
Banner Photo by Mervyn Chan from Unplash.
How many babies were born in New York City on Christmas Day?
How many words are in the play Hamlet?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is in celebration of Juneteenth, the latest US federal holiday!
[Juneteenth National Independence Day] Commemorates the emancipation of enslaved people in the United States on the anniversary of the 1865 date when emancipation was announced in Galveston, Texas. Celebratory traditions often include readings of the Emancipation Proclamation, singing traditional songs, rodeos, street fairs, family reunions, cookouts, park parties, historical reenactments, and Miss Juneteenth contests.
Juneteenth became a federal holiday in the United States on June 17, 2021. To commemorate this newest U.S. Federal Holiday, we're exploring the Wikipedia page about Federal holidays in the United States.
Which days of the week do federal holidays fall on this year? What is the longest gap between holidays this year? Is it different in other years?
federal_holidays.csv
variable | class | description |
---|---|---|
date | character | The month and day or days when the holiday is celebrated. |
date_definition | character | Whether the date is a "fixed date" or follows some other pattern. |
official_name | character | The official name of the holiday. |
year_established | numeric | The year in which the holiday was officially established as a federal holiday. |
date_established | Date | The date on which the holiday was officially established as a federal holiday, if known. |
details | character | Additional details about the holiday, from the Wikipedia article. |
proposed_federal_holidays.csv
variable | class | description |
---|---|---|
date | character | The month and day or days when the holiday would be celebrated. |
date_definition | character | Whether the date is a "fixed date" or follows some other pattern. |
official_name | character | The proposed official name of the holiday. |
details | character | Additional details about the holiday, from the Wikipedia article. |
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F128750%2F66baee67b3e35bf9656ff816e692527e%2Fsnapshot_worldometer_july4.png?generation=1593988535797227&alt=media" alt="">
The dataset contains data about the numbers of tests, cases, deaths, serious/critical cases, active cases and recovered cases in each country for every day since April 18, and also contains the population of each country to calculate per-capita penetration of the virus
I've removed data from the "Diamond Princess" and "MS Zaandam" since they are not countries
Additionally, an auxiliray table with information about the fraction of the general population at different age groups for every country is added (taken from Wikipedia). This is specifically relevant since COVID-19 death rate is very much age dependent.
The people at "www.worldometers.info" collecting and maintaining this site really are doing very important work "https://www.worldometers.info/coronavirus/#countries">https://www.worldometers.info/coronavirus/#countries
Data about age structure for every country comes from wikipedia
It's possible to use this dataset for various purposes and analyses My goal will be to use the additional data about the number of tests performed in each country to estimate the true death and infection rates of COVID-19
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset allows you to try your hand at detecting fake images from real images. I trained a model on images that I collected from the Minecraft video game. From the provided link, you have access to my trained model, and can generate more fake data, if you like. However, if you would like additional real data, you will need to capture it from Minecraft yourself.
The following is a real image from Minecraft:
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-34.jpg?raw=true" alt="Real Minecraft">
This Minecraft image is obviously fake:
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-202.jpg?raw=true" alt="alt">
Some images are not as easily guessed, such as this fake image:
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-493.jpg?raw=true" alt="alt">
You will also have to contend with multiple times of the day. Darker images will be more difficult for your model.
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-477.jpg?raw=true" alt="alt">
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We o... Visit https://dataone.org/datasets/sha256%3Ab1240bda398e8fa311ac15dbcc04880333d5f3fbe67a7a951786da2d44e33018 for complete metadata about this dataset.