57 datasets found

Common languages used for web content 2025, by share of websites
statista.com
ai-chatbox.pro
Updated Feb 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
Explore at:
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
English Word Frequency
kaggle.com
Updated Sep 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/rtatman/english-word-frequency/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
Description
Context:

How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

Content:

This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

Acknowledgements:

Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

The code used to generate this dataset is distributed under the MIT License.

Inspiration:

Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?

What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
E
Credibility Corpus with several datasets (Twitter, Web database) in French...
live.european-language-grid.eu
txt
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Credibility Corpus with several datasets (Twitter, Web database) in French and English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7468
Explore at:
txtAvailable download formats
Dataset updated
Apr 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
The set of these datasets are made to analyze information credibility in general (rumor and disinformation for English and French documents), and occuring on the social web. Target databases about rumor, hoax and disinformation helped to collect obviously misinformation. Some topic (with keywords) helps us to made corpora from the micrroblogging platform Twitter, great provider of rumors and disinformation.1 corpus describes Texts from the web database about rumors and disinformation. 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French). 4 corpora from Social Media Twitter randomly built (2 in English, 2 in French). 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French).Size of different corpora :Social Web Rumorous corpus: 1,612French Hollande Rumorous corpus (Twitter): 371 French Lemon Rumorous corpus (Twitter): 270 English Pin Rumorous corpus (Twitter): 679 English Swine Rumorous corpus (Twitter): 1024French 1st Random corpus (Twitter): 1000 French 2st Random corpus (Twitter): 1000 English 3st Random corpus (Twitter): 1000 English 4st Random corpus (Twitter): 1000French Rihanna Event corpus (Twitter): 543 English Rihanna Event corpus (Twitter): 1000 French Euro2016 Event corpus (Twitter): 1000 English Euro2016 Event corpus (Twitter): 1000A matrix links tweets with most 50 frequent wordsText data :_id : message id body text : string text dataMatrix data :52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 11,102 lines (each line is a message)Hidalgo corpus: lines range 1:75 Lemon corpus : lines range 76:467 Pin rumor : lines range 468:656 swine : lines range 657:1311random messages : lines range 1312:11103Sample contains : French Pin Rumorous corpus (Twitter): 679 Matrix data :52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 189 lines (each line is a message)
P
English Web Treebank Dataset
paperswithcode.com
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bies (2023). English Web Treebank Dataset [Dataset]. https://paperswithcode.com/dataset/english-web-treebank
Explore at:
Dataset updated
Oct 20, 2023
Authors
Bies
Description
English Web Treebank is a dataset containing 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Z
#PraCegoVer dataset
data.niaid.nih.gov
zenodo.org
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Oliveira dos Santos (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Esther Luna Colombini
Sandra Avila
Gabriel Oliveira dos Santos
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
d
WIPNZ2013: World Internet Project New Zealand - Dataset - data.govt.nz -...
catalogue.data.govt.nz
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WIPNZ2013: World Internet Project New Zealand - Dataset - data.govt.nz - discover and use data [Dataset]. https://catalogue.data.govt.nz/dataset/oai-figshare-com-article-2003307
Explore at:
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
New Zealand
Description
From 2007, the Institute of Culture, Discourse and Communication (ICDC) at AUT University is conducting a long-term survey to track trends in Internet use, and to document the role and impact of the Internet in New Zealand society. The Internet has changed how business and trade deals are made; how schools and other academic institutions, councils, media and advertisers operate. The Internet also impacts on family interaction, the ways in which people form new friendships, and the communities to which people belong.The World Internet Project New Zealand is an extensive research project that aims to provide important information about the social, cultural, political and economic influence of the Internet and related digital technologies. As part of the World Internet Project, an international collaborative research effort, WIP NZ enables valid and rigorous comparison between New Zealand and 30 other countries around the world. Each partner country in WIP shares a set of 30 common questions.ICDC’s longitudinal survey includes a cross-section of participants aged 12 and up across New Zealand. A quota ensures that people of Māori, Pasifika and Asian descent, and the range of age-groups, are not underrepresented. The survey investigates Internet access and targets Internet users as well as non-users; who uses this technology and what they do online. It also considers offline activities such as how much time is spent with friends and family. Other questions address issues such as the effects of the Internet on language use and cultural development; the role of the Internet in accessing information or purchasing products; and how the Internet affects the educational and social development of New Zealand children. In addition to studying the impact of the Internet, the survey tracks the effectiveness of strategies to address issues such as the digital divide between rich and poor; urban and rural.Universe: People 12 years and over with a landline phone.Data Collection: Phoenix Research Ltd; Buzz Channel.Sampling: The sample design involved the following strata:Recontact of those in the 2011 (and earlier) samples who had indicated that they were prepared to consider answering a further wave of the WIP study. Of these, those who had provided an email address in a previous sample were invited to complete the survey online; the remainder were contacted using CATI telephone interviewing.A fresh CATI telephone sample drawn to provide adequate coverage (in conjunction with the recontact and online components) of the New Zealand populationFresh simple random sample of phone numbers.Three further simple random targeted booster samples of phone numbers within mesh blocks known to have:>30% Māori people;>30% Pasifika people;>30% Asian people.An online panel sample drawn to provide adequate coverage (in conjunction with the recontact and fresh telephone components) of the New Zealand population.An online sample of people without landlines, also members of the same panel.The sampling frames for the CATI telephone fresh simple random sample and the three targeted booster samples were calculated by using 2006 census data on the number of households with access to a telephone (using a database of phone numbers purchased from Yellow Ltd). This sampling strategy incorporates over-sampling of Māori, Pasifika and Asian people (often under-represented populations) to ensure adequate numbers of respondents in these cells.Representative coverage of geographic areas and gender was ensured by the setting of quota based on census data.Exclusions: non-users of the internet without landlines; non-English speakers; those refusing.Mode: Telephone interview.
Countries with the highest number of internet users 2025
statista.com
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Countries with the highest number of internet users 2025 [Dataset]. https://www.statista.com/statistics/262966/number-of-internet-users-in-selected-countries/
Explore at:
Dataset updated
Feb 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
World
Description
As of February 2025, China ranked first among the countries with the most internet users worldwide. The world's most populated country had 1.11 billion internet users, more than triple the third-ranked United States, with just around 322 million internet users. Overall, all BRIC markets had over two billion internet users, accounting for four of the ten countries with more than 100 million internet users. Worldwide internet usage As of October 2024, there were more than five billion internet users worldwide. There are, however, stark differences in user distribution according to region. Eastern Asia is home to 1.34 billion internet users, while African and Middle Eastern regions had lower user figures. Moreover, the urban areas showed a higher percentage of internet access than rural areas. Internet use in China China ranks first in the list of countries with the most internet users. Due to its ongoing and fast-paced economic development and a cultural inclination towards technology, more than a billion of the estimated 1.4 billion population in China are online. As of the third quarter of 2023, around 87 percent of Chinese internet users stated using WeChat, the most popular social network in the country. On average, Chinese internet users spent five hours and 33 minutes online daily.
P
FineWeb Dataset
paperswithcode.com
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). FineWeb Dataset [Dataset]. https://paperswithcode.com/dataset/fineweb
Explore at:
Dataset updated
May 27, 2025
Description
The FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and runs on the datatrove library, our large-scale data processing library.

FineWeb was originally meant to be a fully open replication of RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of FineWeb well above that of the original RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high-quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of benchmark tasks.
F
General domain Human-Human conversation chats in English
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in English [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native English people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
English language Web pages dataset
figshare.com
txt
Updated Jan 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lulwah Alkwai (2017). English language Web pages dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4588729.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4588729.v2
Dataset updated
Jan 29, 2017
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Lulwah Alkwai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains 8,576 URIs with content determined to be in the English language. The URIs were collected from DMOZ. All 8,576 URIs were available on the live Web as of December 2015.This data is used and further described in the journal article:Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2017. Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages. ACM Transactions on Information Systems (TOIS).This work was an extension of the paper:

Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2015. How Well Are Arabic Websites Archived?. In Proceedings of the 15th IEEE/ACM Joint Conference on Digital Libraries (JCDL). ACM
Cost of Living | +144k Tweets - ENG | Aug/Sep 2022
kaggle.com
Updated Sep 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tleonel (2022). Cost of Living | +144k Tweets - ENG | Aug/Sep 2022 [Dataset]. http://doi.org/10.34740/kaggle/ds/2438280
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/2438280
Dataset updated
Sep 9, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tleonel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
💸💸💸 Cost of Living - 144k Tweets in English | Aug - Sept 2022 💸💸💸

UPDATED Sept 9th

The cost of living is a scorching topic. This dataset is composed of tweets sent from August 20 to Sept 9 2022, with over 144k tweets. All tweets are in English and are from different countries. Below is a breakdown of columns and the data in them.

https://images.unsplash.com/photo-1553729459-efe14ef6055d?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1770&q=80" alt="">

Columns Description

[x] date_time - Date and Time tweet was sent

[x] username - Username that sent the tweet

[x] user_location - Location entered in the account location info on Twitter

[x] user_description - Text added to "about" in account

[x] verified - If the user has the "verified by Twitter" blue tick

[x] followers_count - Number of Followers

[x] following_count - Number of accounts followed by the person who sent the tweet

[x] tweet_like_count - How many people liked the tweet

[x] tweet_retweet_count - How many people retweeted the tweet

[x] tweet_reply_count - How many people replied to that tweet

[x] source - Where was the tweet sent from. The link has info if using iPhone, Android and others

[x] tweet_text - Text sent in the tweet
h
homeo-dataset
huggingface.co
hf-proxy-cf.effarig.site
Updated Oct 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Singh (2021). homeo-dataset [Dataset]. https://huggingface.co/datasets/akhilhsingh/homeo-dataset
Explore at:
Dataset updated
Oct 15, 2021
Authors
Akhil Singh
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full… See the full description on the dataset page: https://huggingface.co/datasets/akhilhsingh/homeo-dataset.
P
MassiveText Dataset
paperswithcode.com
library.toponeai.link
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving (2025). MassiveText Dataset [Dataset]. https://paperswithcode.com/dataset/massivetext
Explore at:
Dataset updated
May 23, 2025
Authors
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving
Description
MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).

Find Datasheets in the Gopher paper.
Z
Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...
data.niaid.nih.gov
zenodo.org
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thakur, Nirmalya (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13738597
Explore at:
Dataset updated
Sep 20, 2024
Dataset authored and provided by
Thakur, Nirmalya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset:

N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292

Abstract

The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset.

After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post into

one of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral,

hate or not hate

anxiety/stress detected or no anxiety/stress detected.

These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.

The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.

The following table represents the data description for this dataset

Attribute Name

Attribute Description

Post ID

Unique ID of each Instagram post

Post Description

Complete description of each post in the language in which it was originally published

Date

Date of publication in MM/DD/YYYY format

Language

Language of the post as detected using the Google Translate API

Translated Post Description

Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.

Sentiment

Results of sentiment analysis (using translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutral

Hate

Results of hate speech detection (using translated Post Description) where each post was classified as hate or not hate

Anxiety or Stress

Results of anxiety or stress detection (using translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.
S
Democracy and English Indicators
scidb.cn
Updated Apr 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah AlKhuraibet (2024). Democracy and English Indicators [Dataset]. http://doi.org/10.57760/sciencedb.16236
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.16236
Dataset updated
Apr 12, 2024
Dataset provided by
Science Data Bank
Authors
Abdullah AlKhuraibet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data collected aim to test whether English proficiency levels in a country are positively associated with higher democratic values in that country. English proficiency is sourced from statistics by Education First’s "EF English Proficiency Index" which covers countries' scores for the calendar year 2022 and 2021. The EF English Proficiency Index ranks 111 countries in five different categories based on their English proficiency scores that were calculated from the test results of 2.1 million adults. While democratic values are operationalized through the liberal democracy index from the V-Dem Institute annual report for 2022 and 2021. Additionally, the data is utilized to test whether English language media consumption acts as a mediating variable between English proficiency and democracy levels in a country, while also looking at other possible regression variables. In order to conduct the linear regression analyses for the dats, the software that was utilized for this research was Microsoft Excel.The raw data set consists of 90 nation states in two years from 2022 and 2021. The raw data is utilized for two separate data sets the first of which is democracy indicators which has the regression variables of EPI, HDI, and GDP. For this table set there is a total of 360 data entries. HDI scores are a statistical summary measure that is developed by the United Nations Development Programme (UNDP) which measures the levels of human development in 190 countries. The data for nominal gross domestic product scores (GDP) are sourced from the World Bank. Having strong regression variables that have been proven to have a positive link with democracy in the data analysis such as GDP and HDI, would allow the regression analysis to identify whether there is a true relationship between English proficiency and democracy levels in a country. While the second data set has a total of 720 data entries and aims to identify English proficiency indicators the data set has 7 various regression variables which include, LDI scores, Years of Mandatory English Education, Heads of States Publicly speaking English, GDP PPP (2021USD), Common Wealth, BBC web traffic and CNN web traffic. The data for years of mandatory English education is sourced from research at the University of Winnipeg and is coded in the data set based on the number of years a country has English as a mandatory subject. The range of this data is from 0 to 13 years of English being mandatory. It is important to note that this data only concerns public schools and does not extend to the private school systems in each country. The data for heads of state publicly speaking English was done through a video data analysis of all heads of state. The data was only used for heads of state who had been in their position for at least a year to ensure the accuracy of the data collected; with a year in power, for heads of state that had not been in their position for a year, data was taken from the previous head of state. This data only takes into account speeches and interviews that were conducted during their incumbency. The data for each country’s GDP PPP scores are sourced from the World Bank, which was last updated for a majority of the countries in 2021 and is tied to the US dollar. Data for the commonwealth will only include members of the commonwealth that have been historically colonized by the United Kingdom. Any country that falls under that category will be coded as 1 and any country that does not will be coded as 0. For BBC and CNN web traffic that data is sourced by using tools in Semrush which provide a rough estimate of how much web traffic each news site generates in each country. Which will be utilized to identify the average number of web traffic for BBC News and CNN World News for both the 2021 and 2022 calendar. The traffic for each country will also be measured per capita, per 10 thousand people to ensure that the population density of a country does not influence the results. The population of each country for both 2021 and 2022 is sourced from the United Nations revision of World Population Prospects of both 2021 and 2022 respectively.
E
Data from: Macedonian-English parallel corpus MaCoCu-mk-en 2.0
live.european-language-grid.eu
binary format
Updated Apr 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Macedonian-English parallel corpus MaCoCu-mk-en 2.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21560
Explore at:
binary formatAvailable download formats
Dataset updated
Apr 25, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Macedonian-English parallel corpus MaCoCu-mk-en 2.0 was built by crawling the “.mk” and “.мкд” internet top-level domains in 2021, extending the crawl dynamically to other domains as well.

All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus.

The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level.

Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document.

The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American).

As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner corpus. The new version also provides additional metadata, such as the position of the sentence in the paragraph and document, and information whether the sentence is related to a DSI. Moreover, the corpus is now also provided in a document-level format.

Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
h
ccaligned_multilingual
huggingface.co
opendatalab.com
Updated Mar 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed El-Kishky (2021). ccaligned_multilingual [Dataset]. https://huggingface.co/datasets/ahelk/ccaligned_multilingual
Explore at:
Dataset updated
Mar 27, 2021
Authors
Ahmed El-Kishky
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
CCAligned consists of parallel or comparable web-document pairs in 137 languages aligned with English. These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding in the URLs of web documents. This pattern matching approach yielded more than 100 million aligned documents paired with English. Recognizing that each English document was often aligned to mulitple documents in different target language, we can join on English documents to obtain aligned documents that directly pair two non-English documents (e.g., Arabic-French).
R
WageIndicator Survey
datasets.iza.org
dataverse.iza.org
zip
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Data Center of IZA (IDSC) (2024). WageIndicator Survey [Dataset]. http://doi.org/10.15185/wif.1
Explore at:
zip(4109134), zip(1429922892)Available download formats
Unique identifier
https://doi.org/10.15185/wif.1
Dataset updated
Jan 29, 2024
Dataset provided by
Research Data Center of IZA (IDSC)
License
https://www.iza.org/wc/dataverse/IIL-1.0.pdfhttps://www.iza.org/wc/dataverse/IIL-1.0.pdf
Time period covered
2000 - 2021
Area covered
India, Angola, Indonesia, Azerbaijan, Zambia, Sweden, Italy, France, Zimbabwe, Kazakhstan
Description
The WageIndicator Survey is a continuous, multilingual, multi-country web-survey, counducted across 65 countries since 2000. The web-survey generates cross sectional and longitudinal data which might provide data especially about wages, benefits, working hours, working conditions and industrial relations. The survey has detailed questions about earnings, benefits, working conditions, employment contracts and training, as well as questions about education, occupation, industry and household characteristics. The WageIndicator Survey is a multilingual questionnaire and aims to collect information on wages and working conditions. As labour markets and wage setting processes vary across countries, country specific translations have been favoured over literal translations. The WageIndicator Survey includes regularly extra survey questions for project targeting specific countries, for specific groups or about specific events. These projects usually address a specific audience (employees of a company, employees in an industry, readers of a magazine, members of a trade union or an occupational association, and alike). The data of the project questions are included in the dataset. Bias: Non-Probability web based surveys are problematic because not every individual has the same probability of being selected into the survey. The probability of being selected depends on national or regional internet access rates and on numbers of visitors accessing the webiste. Data of such surveys form a convenience rather than a probability sample. Due to the non-probability based nature of the survey and its selectivity the obtained results cannot be generalized for the population of interest; i.e. the labor force. Comparisons with representative studies found an underrepresentation of male labour force, part-timers, older age groups, and low educated persons. Besides other strategies to reduce the bias the WageIndicators provides different weighting schemes in order to correct for selection bias. Data Characteristics: The data is organised in annual releases. The data of the period 2000-2005 is released as one dataset. Each data release consists of a dataset with continuous variables and one with project variables. The continuous variables can be merged across years. All variable and value labels are in English. The data does not include the text variables and verbatims form open-ended survey questions, these are available in Excel-Format upon request. Spatial Coverage: The survey started in 2000 in the Netherlands. Since 2004, websites have been launched in many European countries, in North and South America and in countries in Asia. From 2008 on web sites have been launched in more African countries, as well as in Indonesia and in a number of post-Soviet countries. For each country, the questions have been translated. Multilingual countries employ multilingual questionnaires. Country-specific translations and locally accepted terminology have been favored over literal translations.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/

Common languages used for web content 2025, by share of websites

Explore at:

63 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Feb 11, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Feb 2025

Area covered

Worldwide

Description

As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

Clear search

Close search

Google apps

Main menu

Common languages used for web content 2025, by share of websites

English Word Frequency

Context:

Content:

Acknowledgements:

Inspiration:

Credibility Corpus with several datasets (Twitter, Web database) in French...

English Web Treebank Dataset

fineweb

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

The most spoken languages worldwide 2025

WIPNZ2013: World Internet Project New Zealand - Dataset - data.govt.nz -...

Countries with the highest number of internet users 2025

FineWeb Dataset

General domain Human-Human conversation chats in English

What’s Included

English language Web pages dataset

Cost of Living | +144k Tweets - ENG | Aug/Sep 2022

💸💸💸 Cost of Living - 144k Tweets in English | Aug - Sept 2022 💸💸💸

UPDATED Sept 9th

Columns Description

homeo-dataset

MassiveText Dataset

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

Democracy and English Indicators

Data from: Macedonian-English parallel corpus MaCoCu-mk-en 2.0

ccaligned_multilingual

WageIndicator Survey

Common languages used for web content 2025, by share of websitesSee More Versions

Common languages used for web content 2025, by share of websites