In March 2025, video streaming platform Twitch had approximately *** million active streamers, down from a peak of **** million in January 2021. The platform experienced a boom during the COVID-19 pandemic, when many new users used the platform to connect with friends or try their hand at livestreaming. However, this trend normalized again towards the end of the year, and the streaming space has also grown more competitive as platforms apart from Twitch have evolved to attract streamers and viewers. Popular content categories on Twitch In 2024, most of the leading content categories on Twitch were all gaming-related – except for the top spot: Just Chatting. The general conversation category accumulated *** billion hours of viewing time in the measured period. In March 2025, global Twitch audiences spent around *** million hours watching Just Chatting content on Twitch, with the average viewer count of such content reaching *** thousand. HasanAbi was the most popular Just Chatting streamer on Twitch in the most recently measured month. Game streamers Twitch is very popular with gamers and gaming audiences, and the ranking of the most popular Twitch streamers reflects this. Ninja (real name: Richard Tyler Blevins), the top-ranked streamer on Twitch, had **** million followers in April 2025. Ninja saw a meteoric rise to fame when he was one of the first top-ranked players to stream the then-newly released Fortnite Battle Royale at the end of 2017. Second-ranked ibai (real name: Ibai Llanos Garatea) was ranked second with ***** million followers on Twitch. With more than **** million followers, Imane Anys, better known as Pokimane, was the only woman among the most-followed Twitch streamers worldwide. Overall, women only accounted for **** percent of the top-ranked Twitch channels.
These datasets used for node classification and transfer learning are Twitch user-user networks of gamers who stream in a certain language. Nodes are the users themselves and the links are mutual friendships between them. Vertex features are extracted based on the games played and liked, location and streaming habits. Datasets share the same set of node features, this makes transfer learning across networks possible. These social networks were collected in May 2018. The supervised task related to these networks is binary node classification - one has to predict whether a streamer uses explicit language.
These datasets used for node classification and transfer learning are Twitch user-user networks of gamers who stream in a certain language. Nodes are the users themselves and the links are mutual friendships between them. Vertex features are extracted based on the games played and liked, location and streaming habits. Datasets share the same set of node features, this makes transfer learning across networks possible. These social networks were collected in May 2018. The supervised task related to these networks is binary node classification - one has to predict whether a streamer uses explicit language.
DE | EN | ES | FR | PT | RU | |
---|---|---|---|---|---|---|
Nodes | 9,498 | 7,126 | 4,648 | 6,549 | 1,912 | 4,385 |
Edges | 153,138 | 35,324 | 59,382 | 112,666 | 31,299 | 37,304 |
Density | 0.003 | 0.002 | 0.006 | 0.005 | 0.017 | 0.004 |
Transitvity | 0.047 | 0.042 | 0.084 | 0.054 | 0.131 | 0.049 |
Paper: Multi-scale Attributed Node Embedding. Benedek Rozemberczki, Carl Allen, and Rik Sarkar. arXiv, 2019. https://arxiv.org/abs/1909.13021
Twitch.tv boasts over 2 million unique user views per day, and more than 100 thousand channels that entertain the users. How should new streamers stand out from more established names and gather a larger audience?
Using viewership data from Twitch.tv, I develop a model to help streamers make informed choices on choice of time, game and target language audience. I specifically consider the interaction between these choices, answering such as "When is the best time to stream League of Legends for a given language?" or "I am a Russian language streamer, what game attracts most audience?"
Additionally, I describe the whether streamers should stream when avoids time slots with more existing channels. This involves studying whether streamers has synergy with each other, despite acting as competitors by choosing to streaming similar content, together they might attract more viewers than when they stream different types of content.
Final project target is an application which is trained using historical twitch data, powered by immediate data from the Twitch API. The application offers the best selection of streaming choices under current twitch environment. Answering the questions "I want to gather the most viewships. What game in what language and when should i stream?"
twitch_panel_fixedeffect.py : Panel Regression Model. Data Source 250 MB> 25MB limit, not included. creates regression data results 'twitch_small_panel_results.txt'
twitch_plot.py : Plots graphs using 'twitch_small_panel_results.txt'
twitch_small_panel_results.tx : contains regression results generated from twitch_panel_fixedeffect.py
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset, titled the Twitch Plays Pokemon Dataset, contains 37.8 million IRC chat messages. It contains IRC chat log data for messages made between February 2, 2014 and April 23, 2014 (68 days). Each line denotes a single IRC chat message.
Sample of the dataset:
2014-02-1408:17:32medicbluea 2014-02-1408:17:32murderousburgerrare candy, RARE CANDY 2014-02-1408:17:32milk2978B 2014-02-1408:17:32mrtiktalikb 2014-02-1408:17:32dualhammersb 2014-02-1408:17:32shares5YES 2014-02-1408:17:32orangeruststart 2014-02-1408:17:32snowieea 2014-02-1408:17:33duroatedown 2014-02-1408:17:33crypticcraigup 2014-02-1408:17:33doug2725LOL HELIX FOSSIL WENT BACK THAT FAR
Abstract
With the increasing importance of online communities, discussion forums, and customer reviews, Internet “trolls” have proliferated thereby making it difficult for information seekers to find relevant and correct information. In this paper, we consider the problem of detecting and identifying Internet trolls, almost all of which are human agents. Identifying a human agent among a human population presents significant challenges compared to detecting automated spam or computerized robots. To learn a troll’s behavior, we use contextual anomaly detection to profile each chat user. Using clustering and distance-based methods, we use contextual data such as the group’s current goal, the current time, and the username to classify each point as an anomaly. A user whose features significantly differ from the norm will be classified as a troll. We collected 38 million data points from the viral Internet fad, Twitch Plays Pokemon. Using clustering and distance-based methods, we develop heuristics for identifying trolls. Using MapReduce techniques for preprocessing and user profiling, we are able to classify trolls based on 10 features extracted from a user’s lifetime history.
You can view the full technical paper here: https://arxiv.org/abs/1902.06208
Source Code
Code related to this dataset can be found at: https://github.com/ahaque/twitch-troll-detection
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the data and resources used for a Twitch Emote recommendation system using a Word2Vec model. The nature and exploration of the data is described in Emotes-2-Vec: A Large Scale Embedding of Twitch Chat Data. To protect the privacy of the users whose messages were scraped to build this corpus, names and timestamps have been removed and only the message bodies are included. However, a tutorial for this project is included on the project GitHub: https://github.com/KoroshM/Emote-Recommender.
embeddings.tsv and labeled_metadata.tsv may be used in TensorFlow's embedding projector to visualize the embedding space.
Note: Model files are the following:
embeddings.tsv
labeled_metadata.tsv
model
model.model**
model.wv.vectors.npy
**Located here: https://drive.google.com/drive/folders/1RZC4JA4CpAcwoo6dOwq_jobTd6dNi_n2?usp=sharing
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.
Metadata includes
reviews
add-to-shelf, read, review actions
book attributes: title, isbn
graph of similar books
Basic Statistics:
Items: 1,561,465
Users: 808,749
Interactions: 225,394,930
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.
Metadata includes
reviews
purchases, plays, recommends (likes)
product bundles
pricing information
Basic Statistics:
Reviews: 7,793,069
Users: 2,567,538
Items: 15,474
Bundles: 615
These datasets contain 1.48 million question and answer pairs about products from Amazon.
Metadata includes
question and answer text
is the question binary (yes/no), and if so does it have a yes/no answer?
timestamps
product ID (to reference the review dataset)
Basic Statistics:
Questions: 1.48 million
Answers: 4,019,744
Labeled yes/no questions: 309,419
Number of unique products with questions: 191,185
This is a mutli-modal dataset for restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as metadata for each restaurant.
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.
Metadata includes
appreciates (likes)
timestamps
extracted image features
Basic Statistics:
Users: 63,497
Items: 178,788
Appreciates (likes): 1,000,000
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.
Metadata includes
ratings
product images
user identities
item sizes, user genders
These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).
Metadata includes
reviews
price paid (epinions)
helpfulness votes (librarything)
flags (librarything)
These datasets contain peer-to-peer trades from various recommendation platforms.
Metadata includes
peer-to-peer trades
have and want lists
image data (tradesy)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Como jogador, estudante de Estatística e curioso, sempre que posso busco maneiras de aplicar meus conhecimentos em projetos práticos. Mais do que isso, tenho muita paixão em compartilhar minhas descobertas e aprendizados com a comunidade por meio de minhas lives na Twitch e vídeos no YouTube.
Tendo em vista recapitular um projeto que desenvolvi durante minha graduação, resolvemos fazer em live a obtenção dos dados de Partidas profissionais de Dota 2 a partir da API Open Dota. Os dados foram salvos em bancos de dados NoSQL (MongoDB) e também processados em diversas camadas de dados usando o conceito de Data Lake com a engine de processamento Apache Spark.
Você pode conferir nosso projeto em seu repositório no GitHub.
Este dataset está longe de ser um dado crú, uma vez que passou por diversas etapas de transformações, cruzamentos e agregações. As informações presentes são estatísticas de cada time um dia antes da partida em questão ter início. Tais estatísticas são calculadas a partir das informações das partidas de cada jogador no 6 meses anteriores à partida em questão.
Assim, cada linha deste dataset possui a informação de qual time ganhou a partida, bem como estatísticas sumarizadas e 'não normalizadas' de cada time.
Muito obrigado a todos que acompanharam o desenvolvimento deste projeto em nossas lives e nos apoiaram com as inscrições na Twitch. O apoio de voc6es possibilita que levemos Data Science adiante, como por exemplo, compartilhando este dataset com mais pessoas que têm interesse em se desenvolver na área.
Nosso desejo enquanto comunidade é fazer com que o ensino chegue cada dia mais próximo das pessoas. E entendo que isso começa no Brasil. Por isso a descrição em pt-br, dando maior foco ao nosso público nacional.
Se tiver interesse em conhecer mais sobre nosso trabalho, nos acompanhe na Twitch: Téo Me Why .
As a player, Statistics student and a curious person, I am always looking for ways to apply my skills in real time problems. I also am passionate about sharing my findings and learnings with others through my streaming sessions on Twitch or my Youtube channel.
With the goal of reusing a project that I worked on during my undergrad, we decided to stream the data acquisition of professional matches of Dota 2 through the Open Dota API. The dataset has been stored in a NoSQL (MongoDB) and it has been processed in several data layers using the Data Lake concept with the Apache Spark processing engine.
You can check out the project in this repository on GitHub.
This dataset is far from being raw data, since it went through several stages of transformations, crossings and aggregations. The information present is each team's statistics one day before the match in question starts. Such statistics are calculated from each player's match information in the 6 months preceding the match in question.
Thus, each row of this dataset contains information on which team won the match, as well as summarized and 'non-normalized' statistics for each team.
Many thanks to everyone who followed the development of this project in our lives and supported us with registration at Twitch. Your support enables us to take Data Science forward, such as sharing this dataset with more people who are interested in developing in the area.
Our desire as a community is to bring teaching closer to people every day. And I understand that this starts in Brazil. That's why the description in pt-br, giving greater focus to our national audience.
If you are interested in learning more about our work, follow us on Twitch: Téo Me Why .
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
Metadata includes
product IDs
bounding boxes
Basic Statistics:
Scenes: 47,739
Products: 38,111
Scene-Product Pairs: 93,274
This is a collection recipes paired with variants, e.g. a recipe matched with a vegan version of the same recipe.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
In March 2025, video streaming platform Twitch had approximately *** million active streamers, down from a peak of **** million in January 2021. The platform experienced a boom during the COVID-19 pandemic, when many new users used the platform to connect with friends or try their hand at livestreaming. However, this trend normalized again towards the end of the year, and the streaming space has also grown more competitive as platforms apart from Twitch have evolved to attract streamers and viewers. Popular content categories on Twitch In 2024, most of the leading content categories on Twitch were all gaming-related – except for the top spot: Just Chatting. The general conversation category accumulated *** billion hours of viewing time in the measured period. In March 2025, global Twitch audiences spent around *** million hours watching Just Chatting content on Twitch, with the average viewer count of such content reaching *** thousand. HasanAbi was the most popular Just Chatting streamer on Twitch in the most recently measured month. Game streamers Twitch is very popular with gamers and gaming audiences, and the ranking of the most popular Twitch streamers reflects this. Ninja (real name: Richard Tyler Blevins), the top-ranked streamer on Twitch, had **** million followers in April 2025. Ninja saw a meteoric rise to fame when he was one of the first top-ranked players to stream the then-newly released Fortnite Battle Royale at the end of 2017. Second-ranked ibai (real name: Ibai Llanos Garatea) was ranked second with ***** million followers on Twitch. With more than **** million followers, Imane Anys, better known as Pokimane, was the only woman among the most-followed Twitch streamers worldwide. Overall, women only accounted for **** percent of the top-ranked Twitch channels.