The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
This parent dataset (collection of datasets) describes the general organization of data in the datasets for each growing season (two-year period) when winter wheat (Triticum aestivum L.) was grown for grain at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU), Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). Winter wheat was grown on two large, precision weighing lysimeters, calibrated to NIST standards (Howell et al., 1995). Each lysimeter was in the center of a 4.44 ha square field on which wheat was also grown (Evett et al., 2000). The two fields were contiguous and arranged with one directly north of the other. See the resource titled "Geographic Coordinates, USDA, ARS, Bushland, Texas" for UTM geographic coordinates for field and lysimeter locations. Wheat was planted in Autumn and grown over the winter in 1989-1990, 1991-1992, and 1992-1993. Agronomic calendar for the each of the three growing seasons list by date the agronomic practices applied, severe weather, and activities (e.g., planting, thinning, fertilization, pesticide application, lysimeter maintenance, harvest) in and on lysimeters that could influence crop growth, water use, and lysimeter data. These include fertilizer and pesticide applications. Irrigation was by linear move sprinkler system equipped with pressure regulated low pressure sprays (mid-elevation spray application, MESA). Irrigations were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings made with a field-calibrated (Evett and Steiner, 1995) neutron probe from 0.10- to 2.4-m depth in the field. The lysimeters and fields were planted to the same plant density, row spacing, tillage depth (by hand on the lysimeters and by machine in the fields), and fertilizer and pesticide applications. The weighing lysimeters were used to measure relative soil water storage to 0.05 mm accuracy at 5-min intervals, and the 5-min change in soil water storage was used along with precipitation, dew and frost accumulation, and irrigation amounts to calculate crop evapotranspiration (ET), which is reported at 15-min intervals. Each lysimeter was equipped with a suite of instruments to sense wind speed, air temperature and humidity, radiant energy (incoming and reflected, typically both shortwave and longwave), surface temperature, soil heat flux, and soil temperature, all of which are reported at 15-min intervals. Instruments used changed from season to season, which is another reason that subsidiary datasets and data dictionaries for each season are required. The Bushland weighing lysimeter research program was described by Evett et al. (2016), and lysimeter design is described by Marek et al. (1988). Important conventions concerning the data-time correspondence, sign conventions, and terminology specific to the USDA ARS, Bushland, TX, field operations are given in the resource titled "Conventions for Bushland, TX, Weighing Lysimeter Datasets". There are six datasets in this collection. Common symbols and abbreviations used in the datasets are defined in the resource titled, "Symbols and Abbreviations for Bushland, TX, Weighing Lysimeter Datasets". Datasets consist of Excel (xlsx) files. Each xlsx file contains an Introductory tab that explains the other tabs, lists the authors, describes conventions and symbols used and lists any instruments used. The remaining tabs in a file consist of dictionary and data tabs. The six datasets are as follows: Agronomic Calendars for the Bushland, Texas Winter Wheat Datasets Growth and Yield Data for the Bushland, Texas Winter Wheat Datasets Weighing Lysimeter Data for The Bushland, Texas Winter Wheat Datasets Soil Water Content Data for The Bushland, Texas, Large Weighing Lysimeter Experiments Evapotranspiration, Irrigation, Dew/frost - Water Balance Data for The Bushland, Texas Winter Wheat Datasets Standard Quality Controlled Research Weather Data – USDA-ARS, Bushland, Texas See the README for descriptions of each dataset. The soil is a Pullman series fine, mixed, superactive, thermic Torrertic Paleustoll. Soil properties are given in the resource titled "Soil Properties for the Bushland, TX, Weighing Lysimeter Datasets". The land slope in the lysimeter fields is <0.3% and topography is flat. The mean annual precipitation is ~470 mm, the 20-year pan evaporation record indicates ~2,600 mm Class A pan evaporation per year, and winds are typically from the South and Southwest. The climate is semi-arid with ~70% (350 mm) of the annual precipitation occurring from May to September, during which period the pan evaporation averages ~1520 mm. These datasets originate from research aimed at determining crop water use (ET), crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have described the facilities and research methods (Evett et al., 2016), and have focused on winter wheat ET (Howell et al., 1995, 1997, 1998), and crop coefficients (Howell et al., 2006; Schneider and Howell, 1997, 2001) that have been used by ET networks for irrigation management. The data have utility for developing, calibrating, and testing simulation models of crop ET, growth, and yield (Evett et al., 1994; Kang et al., 2009), and have been used by several universities and for testing, and calibrating models of ET that use satellite and/or weather data. Resources in this dataset: Resource Title: Geographic Coordinates of Experimental Assets, Weighing Lysimeter Experiments, USDA, ARS, Bushland, Texas. File Name: Geographic Coordinates, USDA, ARS, Bushland, Texas.xlsx. Resource Description: The file gives the UTM latitude and longitude of important experimental assets of the Bushland, Texas, USDA, ARS, Conservation & Production Research Laboratory (CPRL). Locations include weather stations [Soil and Water Management Research Unit (SWMRU) and CPRL], large weighing lysimeters, and corners of fields within which each lysimeter was centered. There were four fields designated NE, SE, NW, and SW, and a weighing lysimeter was centered in each field. The SWMRU weather station was adjacent to and immediately east of the NE and SE lysimeter fields. Resource Title: Conventions for Bushland, TX, Weighing Lysimeter Datasets. File Name: Conventions for Bushland, TX, Weighing Lysimeter Datasets.xlsx. Resource Description: Descriptions of conventions and terminology used in the Bushland, TX, weighing lysimeter research program. Resource Title: Symbols and Abbreviations for Bushland, TX, Weighing Lysimeter Datasets. File Name: Symbols and Abbreviations for Bushland, TX, Weighing Lysimeter Datasets.xlsx. Resource Description: Definitions of symbols and abbreviations used in the Bushland, TX, weighing lysimeter research datasets. Resource Title: Soil Properties for the Bushland, TX, Weighing Lysimeter Datasets. File Name: Bushland_TX_soil_properties.xlsx. Resource Description: Soil properties useful for simulation modeling and for describing the soil are given for the Pullman soil series at the USDA, ARS, Conservation & Production Research Laboratory, Bushland, TX, USA. For each soil layer, soil horizon designation and texture according to USDA Soil Taxonomy, bulk density, porosity, water content at field capacity (33 kPa) and permanent wilting point (1500 kPa), percent sand, percent silt, percent clay, percent organic matter, pH, and van Genuchten-Mualem characteristic curve parameters describing the soil hydraulic properties are given. A separate table describes the soil horizon thicknesses, designations, and textures according to USDA Soil Taxonomy. Another table describes important aspects of the soil hydrologic and rooting behavior. Resource Title: README - Bushland Texas Winter Wheat collection. File Name: README_Bushland_winter_wheat_collection.pdf. Resource Description: Descriptions of the datasets in the Bushland Texas Winter Wheat collection
This parent dataset (collection of datasets) describes the general organization of data in the datasets for each growing season (year) when alfalfa (Medicago sativa L.) was grown as a reference evapotranspiration (ETr) crop at the USDA-ARS Conservation and Production Laboratory (CPRL), Soil and Water Management Research Unit (SWMRU), Bushland, Texas (Lat. 35.186714°, Long. -102.094189°, elevation 1170 m above MSL). Alfalfa was grown on two large, precision weighing lysimeters, calibrated to NIST standards (Howell et al., 1995). Each lysimeter was in the center of a 4.44 ha square field on which alfalfa was also grown (Evett et al., 2000). The two fields were contiguous and arranged with one (labeled northeast, NE) directly north of the other (labeled southeast, SE). See the resource "Geographic Coordinates, USDA, ARS, Bushland, Texas" for UTM geographic coordinates for field and lysimeter locations. Alfalfa was planted in Autumn 1995 and grown for hay in 1996, 1997, 1998, and 1999. The resource "Agronomic Calendar for the Bushland, Texas Alfalfa Datasets", gives a calendar listing by date the agronomic practices applied, severe weather, and activities (e.g. planting, thinning, fertilization, pesticide application, lysimeter maintenance, harvest) in and on lysimeters that could influence crop growth, water use, and lysimeter data. These include fertilizer and pesticide applications. There is one calendar, from before planting in autumn 1995 to after final harvest in 1999, for the NE and SE lysimeters and fields. There were 4 harvests each year except 1998 when 5 harvests were taken. Irrigation was by linear move sprinkler system equipped with pressure regulated low pressure sprays (mid-elevation spray application, MESA). Irrigations were managed to replenish soil water used by the crop on a weekly or more frequent basis as determined by soil profile water content readings via field-calibrated (Evett and Steiner, 1995) neutron probe from 0.10- to 2.4-m depth in the field. Lysimeters and fields were planted to the same plant density, row spacing, tillage depth (by hand on the lysimeters and by machine in the fields), and fertilizer and pesticide applications. Weighing lysimeters measured relative soil water storage to 0.05 mm accuracy at 5-min intervals, and the 5-min change in soil water storage was used along with precipitation, dew and frost accumulation, and irrigation amounts to calculate crop evapotranspiration (ET), reported at 15-min intervals. Each lysimeter was instrumented to sense wind speed, air temperature and humidity, radiant energy (incoming and reflected, typically both shortwave and longwave), surface temperature, soil heat flux, and soil temperature, all at 15-min intervals. Instruments used changed from season to season, thus subsidiary datasets and data dictionaries for each season are required. The Bushland weighing lysimeter research program is described by Evett et al. (2016), and lysimeter design is described by Marek et al. (1988). Important conventions concerning the data-time correspondence, sign conventions, and terminology specific to the USDA ARS, Bushland, TX, field operations are given in the resource "Conventions for Bushland, TX, Weighing Lysimeter Datasets". There are 5 datasets in this collection. Common symbols and abbreviations used are defined in the resource "Symbols and Abbreviations for Bushland, TX, Weighing Lysimeter Datasets". Datasets consist of Excel (xlsx) files. Each xlsx file contains an Introductory tab that explains the other tabs, lists the authors, describes conventions and symbols used, and lists instruments used. The remaining tabs in a file consist of dictionary and data tabs. The 5 datasets are: Growth and Yield Data for the Bushland, Texas Alfalfa Datasets Weighing Lysimeter Data for The Bushland, Texas Alfalfa Datasets Soil Water Content Data for The Bushland, Texas, Large Weighing Lysimeter Experiments Evapotranspiration, Irrigation, Dew/frost - Water Balance Data for The Bushland, Texas Alfalfa Datasets Standard Quality Controlled Research Weather Data – USDA-ARS, Bushland, Texas See README for descriptions of each dataset. The soil is a Pullman series fine, mixed, superactive, thermic Torrertic Paleustoll. Soil properties are given in the resource titled "Soil Properties for the Bushland, TX, Weighing Lysimeter Datasets". Land slope in the lysimeter fields is <0.3% and topography is flat. Mean annual precipitation is ~470 mm, the 20-year pan evaporation record indicates ~2,600 mm Class A pan evaporation per year, and winds are typically from the South and Southwest. Climate is semi-arid with ~70% (350 mm) of the annual precipitation occurring from May to Sep, during which period the pan evaporation averages ~1520 mm. These datasets originate from research on crop water use (ET), reference ET methods for irrigation scheduling, crop coefficients for use in ET-based irrigation scheduling based on a reference ET, crop growth, yield, harvest index, and crop water productivity as affected by irrigation method, timing, amount (full or some degree of deficit), agronomic practices, cultivar, and weather. Prior publications have described the facilities and research methods (Evett et al., 2016), and have focused on alfalfa ET (Todd et al., 1998), the Bowen ratio method applied to alfalfa ET (Todd et al., 2000), comparison of alfalfa ET to ETr calculated from weather data (Evett et al., 1998, 2000), new methods of calculating ETr (Evett et al., 2010, 2012; Lascano and Evett, 2007, 2010; Lascano et al, 2010), and crop water productivity. Crop coefficients (Howell et al., 2006) have been used by ET networks for irrigation management. The data have utility for testing simulation models of crop ET (Colaizzi et al., 2005; Thorp et al., 2019), growth, and yield by several universities and for testing, and calibrating models of ET that use satellite and/or weather data. Resources in this dataset: Resource Title: Geographic Coordinates of Experimental Assets, Weighing Lysimeter Experiments, USDA, ARS, Bushland, Texas. File Name: Geographic Coordinates, USDA, ARS, Bushland, Texas.xlsx Resource Description: The file gives the UTM latitude and longitude of important experimental assets of the CPRL. Locations include weather stations (SWMRU and CPRL), large weighing lysimeters, and corners of fields within which each lysimeter was centered. There were four fields designated NE, SE, NW, and SW, and a weighing lysimeter was centered in each field. The SWMRU weather station was adjacent to and immediately east of the NE and SE lysimeter fields. Resource Title: Conventions for Bushland, TX, Weighing Lysimeter Datasets. File Name: Conventions for Bushland, TX, Weighing Lysimeter Datasets.xlsx Resource Description: Descriptions of conventions and terminology used in the Bushland, TX, weighing lysimeter research program. Resource Title: Symbols and Abbreviations for Bushland, TX, Weighing Lysimeter Datasets. File Name: Symbols and Abbreviations for Bushland, TX, Weighing Lysimeter Datasets.xlsx Resource Description: Definitions of symbols and abbreviations used in the Bushland, TX, weighing lysimeter research datasets. Resource Title: Agronomic Calendar for the Bushland, Texas Alfalfa Datasets. File Name: 1995-1999 Alfalfa Calendar.xlsx Resource Description: The calendar lists by date the agronomic practices applied, severe weather, and activities (e.g., planting, thinning, fertilization, pesticide application, lysimeter maintenance, harvest) in and on lysimeters that could influence crop growth, water use, and lysimeter data. These include fertilizer and pesticide applications. There is one calendar, from before planting in autumn 1995 to after final harvest in 1999, for the NE and SE lysimeters and fields. Resource Title: Soil Properties for the Bushland, TX, Weighing Lysimeter Datasets. File Name: Bushland_TX_soil_properties.xlsx Resource Description: Soil properties useful for simulation modeling and describing the soil are given for the Pullman soil series at the CPRL. For each soil layer, soil horizon designation and texture according to USDA Soil Taxonomy, bulk density, porosity, water content at field capacity (33 kPa) and permanent wilting point (1500 kPa), percent sand, percent silt, percent clay, percent organic matter, pH, and van Genuchten-Mualem characteristic curve parameters describing the soil hydraulic properties are given. A separate table describes the soil horizon thicknesses, designations, and textures according to USDA Soil Taxonomy. Another table describes important aspects of the soil hydrology and rooting behavior. Resource Title: README - Bushland Texas Alfalfa collection. File Name: README_Bushland_alfalfa_collection.pdf Resource Description: Descriptions of the datasets in the Bushland Texas Alfalfa collection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom. The target words are listed in the associated report and are mostly words that denote prejudice or are often associated with social justice discourse. A few additional words not denoting prejudice are also available since they are used in the report for illustration purposes of the method.
The textual content of news and opinion articles from the outlets is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. This threshold was chosen to maximize inclusion in our analysis of outlets with sparse amounts of articles text per year.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.
Most of the incorrect frequency counts are often minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text.To conclude, in a data analysis of over 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 2 of main manuscript for supporting evidence of the temporal precision of the method).
The Bureau of the Census has released Census 2000 Summary File 1 (SF1) 100-Percent data. The file includes the following population items: sex, age, race, Hispanic or Latino origin, household relationship, and household and family characteristics. Housing items include occupancy status and tenure (whether the unit is owner or renter occupied). SF1 does not include information on incomes, poverty status, overcrowded housing or age of housing. These topics will be covered in Summary File 3. Data are available for states, counties, county subdivisions, places, census tracts, block groups, and, where applicable, American Indian and Alaskan Native Areas and Hawaiian Home Lands. The SF1 data are available on the Bureau's web site and may be retrieved from American FactFinder as tables, lists, or maps. Users may also download a set of compressed ASCII files for each state via the Bureau's FTP server. There are over 8000 data items available for each geographic area. The full listing of these data items is available here as a downloadable compressed data base file named TABLES.ZIP. The uncompressed is in FoxPro data base file (dbf) format and may be imported to ACCESS, EXCEL, and other software formats. While all of this information is useful, the Office of Community Planning and Development has downloaded selected information for all states and areas and is making this information available on the CPD web pages. The tables and data items selected are those items used in the CDBG and HOME allocation formulas plus topics most pertinent to the Comprehensive Housing Affordability Strategy (CHAS), the Consolidated Plan, and similar overall economic and community development plans. The information is contained in five compressed (zipped) dbf tables for each state. When uncompressed the tables are ready for use with FoxPro and they can be imported into ACCESS, EXCEL, and other spreadsheet, GIS and database software. The data are at the block group summary level. The first two characters of the file name are the state abbreviation. The next two letters are BG for block group. Each record is labeled with the code and name of the city and county in which it is located so that the data can be summarized to higher-level geography. The last part of the file name describes the contents . The GEO file contains standard Census Bureau geographic identifiers for each block group, such as the metropolitan area code and congressional district code. The only data included in this table is total population and total housing units. POP1 and POP2 contain selected population variables and selected housing items are in the HU file. The MA05 table data is only for use by State CDBG grantees for the reporting of the racial composition of beneficiaries of Area Benefit activities. The complete package for a state consists of the dictionary file named TABLES, and the five data files for the state. The logical record number (LOGRECNO) links the records across tables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains automated sentiment and emotionality annotations of 23 million headlines from 47 popular news media outlets popular in the United States.
The set of 47 news media outlets analysed (listed in Figure 1 of the main manuscript) was derived from the AllSides organization 2019 Media Bias Chart v1.1. The human ratings of outlets’ ideological leanings were also taken from this chart and are listed in Figure 2 of the main manuscript.
News articles headlines from the set of outlets analyzed in the manuscript are available in the outlets’ online domains and/or public cache repositories such as The Internet Wayback Machine, Google cache and Common Crawl. Articles headlines were located in articles’ HTML raw data using outlet-specific XPath expressions.
The temporal coverage of headlines across news outlets is not uniform. For some media organizations, news articles availability in online domains or Internet cache repositories becomes sparse for earlier years. Furthermore, some news outlets popular in 2019, such as The Huffington Post or Breitbart, did not exist in the early 2000’s. Hence, our data set is sparser in headlines sample size and representativeness for earlier years in the 2000-2019 timeline. Nevertheless, 20 outlets in our data set have chronologically continuous partial or full headline data availability since the year 2000. Figure S 1 in the SI reports the number of headlines per outlet and per year in our analysis.
In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the headline due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. After manual testing, we determined that the percentage of headlines following in this category is very small. Additionally, our method might miss detecting some articles in the online domains of news outlets. To conclude, in a data analysis of over 23 million headlines, we cannot manually check the correctness of every single data instance and hundred percent accuracy at capturing headlines’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our headlines set is representative of headlines in print news media content for the studied time period and outlets analyzed.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript as well as aggregated data of sentiment and emotionality automated annotations of the headlines and human annotations of a subset of headlines sentiment and emotionality used as ground truth.
-models.rar contains the Transformer sentiment and emotion annotation models used in the analysis. Namely:
Siebert/sentiment-roberta-large-english from https://huggingface.co/siebert/sentiment-roberta-large-english. This model is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). See more information from the original authors at https://huggingface.co/siebert/sentiment-roberta-large-english
DistilbertSST2.rar is the default sentiment classification model of the HuggingFace Transformer library https://huggingface.co/ This model is only used to replicate the results of the sentiment analysis with sentiment-roberta-large-english
DistilRoberta j-hartmann/emotion-english-distilroberta-base from https://huggingface.co/j-hartmann/emotion-english-distilroberta-base. The model is a fine-tuned checkpoint of DistilRoBERTa-base. The model allows annotation of English text with Ekman's 6 basic emotions, plus a neutral class. The model was trained on 6 diverse datasets. Please refer to the original author at https://huggingface.co/j-hartmann/emotion-english-distilroberta-base for an overview of the data sets used for fine tuning. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
-headlinesDataWithSentimentLabelsAnnotationsFromSentimentRobertaLargeModel.rar URLs of headlines analyzed and the sentiment annotations of the siebert/sentiment-roberta-large-english Transformer model. https://huggingface.co/siebert/sentiment-roberta-large-english
-headlinesDataWithSentimentLabelsAnnotationsFromDistilbertSST2.rar URLs of headlines analyzed and the sentiment annotations of the default HuggingFace sentiment analysis model fine-tuned on the SST-2 dataset. https://huggingface.co/
-headlinesDataWithEmotionLabelsAnnotationsFromDistilRoberta.rar URLs of headlines analyzed and the emotion categories annotations of the j-hartmann/emotion-english-distilroberta-base Transformer model. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 175 million academic abstracts published in all fields of knowledge. We quantify the prevalence of words denoting prejudice against ethnicity, gender, sexual orientation, gender identity, minority religious sentiment, age, body weight and disability in SSORC abstracts over the period 1970-2020. We then examine the relationship between the prevalence of such terms in the academic literature and their concomitant prevalence in news media content. We also analyze the temporal dynamics of an additional set of terms associated with social justice discourse in both the scholarly literature and in news media content. A few additional words not denoting prejudice are also available since they are used in the manuscript for illustration purposes.
The list of academic abstracts analyzed in this work was taken from the Semantic Scholar Open Research Corpus (SSORC). The corpus contains, as of 2020, over 175 million academic abstracts, and associated metadata, published in all fields of knowledge. The raw data is provided by Semantic Scholar in accessible JSON format.
Textual content included in our analysis is circumscribed to the scholarly articles’ titles and abstracts and does not include other article elements such as main body of text or references section. Thus, we use frequency counts derived from academic articles’ titles and abstracts as a proxy for word prevalence in those articles. This proxy was used because the SSORC corpus does not provide the entire text body of the indexed articles. Targeted textual content was located in JSON data and sorted by year to facilitate chronological analysis. Tokens were lowercased prior to estimating frequency counts.
Yearly relative frequencies of a target word or n-gram in the SSORC corpus were estimated by dividing the number of occurrences of the target word/n-gram in all scholarly articles within a given year by the total number of all words in all articles of that year. This method of estimating word frequencies accounts for variable volume of total scientific output over time. This approach has been shown before to accurately capture the temporal dynamics of historical events and social trends in news media corpora.
It is possible that a small percentage of scholarly articles in the SSORC corpus contain incorrect or missing data. For earlier years in the SSORC corpus, abstract information is sometimes missing and only article’s title information is available. As a result, the total and target word count metrics for a small subset of academic abstracts might not be precise. In a data analysis of 175 million scientific abstracts, manually checking the accuracy of frequency counts for every single academic abstract is unfeasible and hundred percent accuracy at capturing abstracts’ content might be elusive due to a small number of erroneous outlier cases in the raw data. Overall, however, we are confident that our frequency metrics are representative of word prevalence in academic content as illustrated by Figure 2 in the main manuscript, which shows the chronological prevalence in the SSORC corpus of several terms associated with different disciplines of scientific/academic knowledge.
Factor analysis of frequency counts time series was carried out only after Bartlett’s test of sphericity and Kaiser-Meyer-Olkin (KMO) test confirmed the suitability of the data for factor analysis. A single factor derived from the frequency counts time series of prejudice-denoting terms was extracted from each corpus (academic abstracts and news media content). The same procedure was applied for the terms denoting social justice discourse. A factor loading cutoff of 0.5 was used to ascribe terms to a factor. Chronbach alphas to determine if the resulting factors appeared coherent were extremely high (>0.95).
The textual content of news and opinion articles from the outlets listed in Figure 5 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from original sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript and raw data metrics
-scholarlyArticlesContainingTargetWords.rar contains the IDs of each analyzed abstract in the SSORC corpus and the counts of target words and total words for each scholarly article
-targetWordsInMediaArticlesCounts.rar contains counts of target words in news outlets articles as well as total counts of words in articles
In a small percentage of news articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles might not be precise.
In a data analysis of millions of news articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Rozado, Al-Gharbi, and Halberstadt, “Prevalence of Prejudice-Denoting Words in News Media Discourse" for supporting evidence).
31/08/2022 Update: There is a new way to download the Semantic Scholar Open Research Corpus (see https://github.com/allenai/s2orc). This updated version states that the corpus contains 136M+ paper nodes. However, when I downloaded a previous version of the corpus in 2021 from http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I counted 175M unique identifiers. The URL of the previous version of the corpus is no longer active, but it has been cached by the Internet Archive at https://web.archive.org/web/20201030131959/http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I haven't had the time to look at the specific reason for the mismatch but perhaps the newer version of the corpus has cleaned a lot of noisy entries in the previous version which often contained entries with missing abstracts. Filtering out entries in low prevalence languages other than English might be another reason. In any case, Figure 2 of the main manuscript of this work (at https://www.nas.org/academic-questions/35/2/themes-in-academic-literature-prejudice-and-social-justice) should provide support for the validity of the frequency counts.
Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.
The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.
How popular is Instagram?
Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.
Who uses Instagram?
Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.
Celebrity influencers on Instagram
Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Covid-19 pandemic has been one of the most disruptive and painful phenomena of the last few decades. As of July 2021, the origins of the SARS-CoV-2 virus that caused the outbreak remain a mystery. This work analyzes the prevalence in news media articles of two popular hypotheses about SARS-CoV-2 virus origins: the natural emergence and the lab-leak hypotheses.
This data set contains frequency counts of target words in news and opinion articles from 12 popular news media outlets. The target words are listed in the associated manuscript and are mostly words associated with the Covid-19 pandemic.
The list of compressed files in this data set is listed next:
targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles
targetWordsFrequencies.rar daily, weekly, monthly word frequencies
wordEmbeddingModels.rar monthly embedding models of news outlets content
analysisScripts.rar analysis notebooks
The textual content of news and opinion articles from the outlets is available in the outlet's online domains and/or public cache repositories such as Google cache, The Internet Wayback Machine, and Common Crawl. We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific XPath expressions. Tokens were lowercased prior to estimating frequency counts.
Yearly frequency usage of a target word in an outlet in any given temporal interval ( daily, weekly, monthly) was estimated by dividing the total number of occurrences of the target word in all articles of a given temporal interval by the number of all words in all articles of that temporal interval. This method of estimating frequency accounts for variable volume of total article output over time.
In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles. Most of the incorrect frequency counts are minor deviations from the actual counts such as for instance counting a word in an article footnote encouraging article readers to find related articles and that the XPath expression might mistakenly include as the content of the article main text. Some additional outlet-specific inaccuracies that we could identify occurred in the WSJ where in less than 5% of the articles XPath expressions failed to capture the article's main text content. Other outlets articles samples sizes might not be comprehensive but, to the best of our knowledge, they are representative and include tens of thousands of articles per outlet/year. To conclude, in a data analysis of over 1.5 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 of main manuscript for supporting evidence).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have developed an application and solution approach (using this dataset) for automatically generating and suggesting short email responses to support queries in a university environment. Our proposed solution can be used as one tap or one click solution for responding to various types of queries raised by faculty members and students in a university. Office of Academic Affairs (OAA), Office of Student Life (OSL) and Information Technology Helpdesk (ITD) are support functions within a university which receives hundreds of email messages on the daily basis. Email communication is still the most frequently used mode of communication by these departments. A large percentage of emails received by these departments are frequent and commonly used queries or request for information. Responding to every query by manually typing is a tedious and time consuming task. Furthermore a large percentage of emails and their responses are consists of short messages. For example, an IT support department in our university receives several emails on Wi-Fi not working or someone needing help with a projector or requires an HDMI cable or remote slide changer. Another example is emails from students requesting the office of academic affairs to add and drop courses which they cannot do it directly. The dataset consists of emails messages which are generally received by ITD, OAA and OSL in Ashoka University. The dataset also contains intermediate results while conducting machine learning experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The survey was conducted by the Bureau of Statistics (BOS) and the Ministry of Health (MOH) of Guyana. ICF Macro of Calverton, Maryland, provided technical assistance to the project through its contract with the U.S. Agency for International Development (USAID). Funding to cover technical assistance by ICF Macro and local costs was provided in its entirety by the USAID Mission in Georgetown, Guyana. The primary objective of the 2009 GDHS was to collect information on characteristics of the households and their members, including exposure to malaria and tuberculosis; infant and child mortality; fertility and family planning; pregnancy and postnatal care; childhood immunization, health, and nutrition; marriage and sexual activity; and HIV/AIDS indicators. Other objectives of the 2009 GDHS included (1) supporting the dissemination and utilization of the results in planning, managing, and improving family planning and health services in the country and (2) enhancing the survey capabilities of the institutions involved to facilitate surveys of this type in the future. The 2009 GDHS sampled 5,632 households and completed interviews with 4,996 women age 15-49 and 3,522 men age 15-49. Three questionnaires were used for the 2009 GDHS: the Household Questionnaire, the Women's Questionnaire, and the Men's Questionnaire. The content of these questionnaires was based on the model questionnaires developed by the MEASURE DHS program of ICF Macro. The primary objective of the 2009 GDHS was to collect information on the following topics: Characteristics of households and household members Fertility and reproductive preferences, infant and child mortality, and family planning Health-related matters, such as breastfeeding, antenatal care, children's immunizations, and childhood diseases Marriage, sexual activity, and awareness and behavior regarding HIV and other sexually transmitted infections (STIs) The nutritional status of mothers and children, including anthropometry measurements and anemia testing Other complementary objectives of the 2009 GDHS were: To support dissemination and utilization of the results in planning, managing, and improving family planning and health services in the country To enhance the survey capabilities of the institutions involved to facilitate their use of surveys of this type in the future MAIN RESULTS FERTILITY Fertility Levels and Differentials If fertility were to remain constant in Guyana, women would bear, on average, 2.8 children by the end of their reproductive lifespan. The total fertility rate (TFR) is close to replacement level in urban areas (2.1 children per woman), and higher in the rural areas (3.0 children per woman). The TFR in the Interior area (6.0 children) is more than twice as high as the TFR in the Coastal area (2.4 children per woman) and is three times the fertility in the Georgetown (urban) area (2.0 children). The TFRs for women in the Interior area are significantly higher for all age groups. Fertility Preferences Fifty-six percent of currently married women reported that they don't want to have a/another child, and five percent are already sterilized. The figures for men are 51 and 1 percent, respectively. The desire to stop childbearing increases rapidly as the number of children increases. Among respondents with one child, around one in five wants no more children. Among those with three children, about eight in ten women and seven in ten men want no more children. FAMILY PLANNING Use of Contraception Forty-three percent of women who are currently married or in union are currently using a contraceptive method, mainly a modern method (40 percent). The methods most commonly used by currently married women are the male condom (13 percent), the pill (9 percent), and the IUD (7 percent). Female sterilization and injectables are each used by 5 percent of women. The 2009 GDHS prevalence rate of 43 percent represents an increase of 8 percentage points since the 2005 GAIS (35 percent). Most of the increase was in condom use, injectables, and female sterilization. Unmet Need for Family Planning Twenty-nine percent of currently married women have an unmet need for family planning, mostly for limiting births (19 percent) compared with spacing (10 percent). Because 43 percent of married women are currently using a contraceptive method (met need), the total demand for family planning is estimated at 71 percent of married women (22 percent for spacing, 49 percent for limiting). As a result, only 60 percent of the total demand for family planning is met. MATERNAL HEALTH Antenatal Care Among women who had a birth in the five years preceding the survey, 92 percent received antenatal care (ANC) from a skilled health provider for their most recent birth (51 percent from a nurse/midwife and 35 percent from a doctor). Older mothers (35-49 years) are less likely to receive antenatal care by a skilled health provider than younger mothers. Eighty-six percent of women with no education received ANC from a skilled health provider compared with 95 percent of women with more than secondary education. Delivery Care Overall, 92 percent of births in the five years preceding the survey were assisted by a skilled birth provider, mainly by a nurse or midwife (56 percent), followed by a doctor (31 percent). Births to mothers under age 35 and lower order births are more likely to have assistance at delivery by a skilled provider than births to older mothers and higher order births. By residence, births in Urban areas are more likely than those in Rural areas, and births in the Coastal area are more likely than births in the Interior area, to be assisted by a skilled health provider. The percentage of births assisted by a skilled provider ranges from a low of 57 percent in Region 9 to a high of 98 percent in Region 4. Births to mothers who have more education and births in the higher wealth quintiles are more likely to be assisted by a skilled provider than other births. Almost all births to mothers with more than secondary education (98 percent) are assisted by a skilled provider compared with 71 percent of births to mothers with no education. Caesarean section One in eight births (13 percent) in the five years preceding the survey was delivered by caesarean section. The prevalence of C-section delivery increases steadily with mother's age and decreases with birth order. Regions 1, 6, 7, and 9 have the lowest levels of deliveries by C-section (2-5 percent) and Region 3 has the highest level (23 percent). The percentage of births delivered by C-section increases with a mother's education and generally increases with her wealth. CHILD HEALTH Infant and Child Mortality Childhood mortality rates in Guyana are relatively low. For every 1,000 live births, 38 children die during the first year of life (infant mortality), and 40 children die during the first five years (under-age 5 mortality). Almost two-thirds of deaths in the first five years (25 deaths per 1,000 live births) take place during the neonatal period (the first month of life). The mortality rate after the first year of life up to age 5 (child mortality) is also very low at 3 deaths per 1,000 live births. The 2009 GDHS mortality data do not show any clear trends over time. However, mortality data have to be interpreted with caution because sampling errors associated with mortality estimates are large. Vaccination Coverage Overall, 63 percent of Guyanese children age 18-29 months are fully immunized, and only 5 percent of the children received no vaccinations at all. Looking at coverage for specific vaccines, 94 percent of children received the BCG vaccination, 92 percent received the first dose of pentavalent vaccine, and 78 percent received the first polio dose (Polio 1). Coverage for the pentavalent and polio vaccinations declines with subsequent doses; 85 percent of children received the recommended three doses of pentavalent vaccine, and 70 percent received three doses of polio. These figures reflect dropout rates of 8 percent for the pentavalent vaccine and 11 percent for polio; the dropout rate represents the proportion of children who received the first dose of a vaccine but who did not get the third dose. Eighty-two percent of children are vaccinated against measles, and 79 percent of children have been vaccinated against yellow fever. Illnesses and Treatment Acute Respiratory Infections (ARI) Five percent of children under age 5 had symptoms of acute respiratory infection (ARI) in the two weeks preceding the survey. Among children with symptoms of ARI, advice or treatment was sought from a health facility or provider for 65 percent, and antibiotics were prescribed as treatment for 18 percent (data not shown). Fever Fever was found to be moderately frequent in children under age 5 in Guyana (20 percent), ranging from 17 percent in children under 6 months to about 26 percent in children 12-17 months.. Most of the children under age 5 with fever (59 percent) were taken to a health facility or a health provider for their most recent episode of fever. Overall, about one in five children with fever (21 percent) received antibiotics, and 6 percent received antimalarial drugs. Diarrhea Overall, about 10 percent of children were reported to have diarrhea in the two weeks immediately before the survey, with just 1 percent reporting bloody diarrhea. Overall, about six in ten children under age 5 with diarrhea (59 percent) were taken to a health facility or health provider for advice or treatment. Male children (55 percent) are less likely than female children (63 percent) to be taken for treatment or advice to a health facility or provider. Additionally, children living in the Coastal area are much less likely to be taken for treatment or advice (50 percent) than children in the Interior area (79 percent). NUTRITION OF CHILDREN Height and Weight Almost one in five children (18 percent) under age 5 is short for age or stunted, and one in twenty (5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set belongs to an academic manuscript examining longitudinally (2000-2019) the prevalence of terms denoting far-right and far-left political extremism in a large corpus of more than 32 million written news and opinion articles from 54 news media outlets popular in the United States and the United Kingdom.
The textual content of news and opinion articles from the 54 outlets listed in the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. This threshold was chosen to maximize inclusion in our analysis of outlets with sparse amounts of articles text per year.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript
-articlesContainingTargetWords.rar contains counts of target words in outlets articles as well as total counts of words in articles
Usage Notes
In a small percentage of articles, outlet specific XPath expressions failed to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.
Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. Some additional outlet-specific inaccuracies that we could identify occurred in "The Hill" and "Newsmax" news outlets where XPath expressions had some shortfalls at precisely capturing articles’ content. For "The Hill", in years 2007-2009, XPath expressions failed to capture the complete text of the article in about 40% of the articles. This does not necessarily result in incorrect frequency counts for that outlet but in a sample of articles’ words that is about 40% smaller than the total population of articles words for those three years. In the case of "NewsMax", the issue was that for some articles, XPath expressions captured the entire text of the article twice. Notice that this does not result in incorrect frequency counts. If a word appears x times in an article with a total of y words, the same frequency count will still be derived when our scripts count the word 2x times in the version of the article with a total of 2y words.
To conclude, in a data analysis of 32 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 in the main manuscript for illustration of the accuracy of the frequency counts).
As of April 2024, almost 32 percent of global Instagram audiences were aged between 18 and 24 years, and 30.6 percent of users were aged between 25 and 34 years. Overall, 16 percent of users belonged to the 35 to 44 year age group.
Instagram users
With roughly one billion monthly active users, Instagram belongs to the most popular social networks worldwide. The social photo sharing app is especially popular in India and in the United States, which have respectively 362.9 million and 169.7 million Instagram users each.
Instagram features
One of the most popular features of Instagram is Stories. Users can post photos and videos to their Stories stream and the content is live for others to view for 24 hours before it disappears. In January 2019, the company reported that there were 500 million daily active Instagram Stories users. Instagram Stories directly competes with Snapchat, another photo sharing app that initially became famous due to it’s “vanishing photos” feature.
As of the second quarter of 2021, Snapchat had 293 million daily active users.
As of February 2025, 5.56 billion individuals worldwide were internet users, which amounted to 67.9 percent of the global population. Of this total, 5.24 billion, or 63.9 percent of the world's population, were social media users. Global internet usage Connecting billions of people worldwide, the internet is a core pillar of the modern information society. Northern Europe ranked first among worldwide regions by the share of the population using the internet in 20254. In The Netherlands, Norway and Saudi Arabia, 99 percent of the population used the internet as of February 2025. North Korea was at the opposite end of the spectrum, with virtually no internet usage penetration among the general population, ranking last worldwide. Eastern Asia was home to the largest number of online users worldwide – over 1.34 billion at the latest count. Southern Asia ranked second, with around 1.2 billion internet users. China, India, and the United States rank ahead of other countries worldwide by the number of internet users. Worldwide internet user demographics As of 2024, the share of female internet users worldwide was 65 percent, five percent less than that of men. Gender disparity in internet usage was bigger in African countries, with around a ten percent difference. Worldwide regions, like the Commonwealth of Independent States and Europe, showed a smaller usage gap between these two genders. As of 2024, global internet usage was higher among individuals between 15 and 24 years old across all regions, with young people in Europe representing the most significant usage penetration, 98 percent. In comparison, the worldwide average for the age group 15–24 years was 79 percent. The income level of the countries was also an essential factor for internet access, as 93 percent of the population of the countries with high income reportedly used the internet, as opposed to only 27 percent of the low-income markets.
As of January 2024, Instagram was slightly more popular with men than women, with men accounting for 50.6 percent of the platform’s global users. Additionally, the social media app was most popular amongst younger audiences, with almost 32 percent of users aged between 18 and 24 years.
Instagram’s Global Audience
As of January 2024, Instagram was the fourth most popular social media platform globally, reaching two billion monthly active users (MAU). This number is projected to keep growing with no signs of slowing down, which is not a surprise as the global online social penetration rate across all regions is constantly increasing.
As of January 2024, the country with the largest Instagram audience was India with 362.9 million users, followed by the United States with 169.7 million users.
Who is winning over the generations?
Even though Instagram’s audience is almost twice the size of TikTok’s on a global scale, TikTok has shown itself to be a fierce competitor, particularly amongst younger audiences. TikTok was the most downloaded mobile app globally in 2022, generating 672 million downloads. As of 2022, Generation Z in the United States spent more time on TikTok than on Instagram monthly.
As of April 2024, Bahrain was the country with the highest Instagram audience reach with 95.6 percent. Kazakhstan also had a high Instagram audience penetration rate, with 90.8 percent of the population using the social network. In the United Arab Emirates, Turkey, and Brunei, the photo-sharing platform was used by more than 85 percent of each country's population.
As of April 2024, Facebook had an addressable ad audience reach 131.1 percent in Libya, followed by the United Arab Emirates with 120.5 percent and Mongolia with 116 percent. Additionally, the Philippines and Qatar had addressable ad audiences of 114.5 percent and 111.7 percent.
As of January 2024, #love was the most used hashtag on Instagram, being included in over two billion posts on the social media platform. #Instagood and #instagram were used over one billion times as of early 2024.
As of April 2024, it was found that men between the ages of 25 and 34 years made up Facebook largest audience, accounting for 18.4 percent of global users. Additionally, Facebook's second largest audience base could be found with men aged 18 to 24 years.
Facebook connects the world
Founded in 2004 and going public in 2012, Facebook is one of the biggest internet companies in the world with influence that goes beyond social media. It is widely considered as one of the Big Four tech companies, along with Google, Apple, and Amazon (all together known under the acronym GAFA). Facebook is the most popular social network worldwide and the company also owns three other billion-user properties: mobile messaging apps WhatsApp and Facebook Messenger,
as well as photo-sharing app Instagram. Facebook usersThe vast majority of Facebook users connect to the social network via mobile devices. This is unsurprising, as Facebook has many users in mobile-first online markets. Currently, India ranks first in terms of Facebook audience size with 378 million users. The United States, Brazil, and Indonesia also all have more than 100 million Facebook users each.
Instagram’s most popular post
As of April 2024, the most popular post on Instagram was Lionel Messi and his teammates after winning the 2022 FIFA World Cup with Argentina, posted by the account @leomessi. Messi's post, which racked up over 61 million likes within a day, knocked off the reigning post, which was 'Photo of an Egg'. Originally posted in January 2021, 'Photo of an Egg' surpassed the world’s most popular Instagram post at that time, which was a photo by Kylie Jenner’s daughter totaling 18 million likes.
After several cryptic posts published by the account, World Record Egg revealed itself to be a part of a mental health campaign aimed at the pressures of social media use.
Instagram’s most popular accounts
As of April 2024, the official Instagram account @instagram had the most followers of any account on the platform, with 672 million followers. Portuguese footballer Cristiano Ronaldo (@cristiano) was the most followed individual with 628 million followers, while Selena Gomez (@selenagomez) was the most followed woman on the platform with 429 million. Additionally, Inter Miami CF striker Lionel Messi (@leomessi) had a total of 502 million. Celebrities such as The Rock, Kylie Jenner, and Ariana Grande all had over 380 million followers each.
Instagram influencers
In the United States, the leading content category of Instagram influencers was lifestyle, with 15.25 percent of influencers creating lifestyle content in 2021. Music ranked in second place with 10.96 percent, followed by family with 8.24 percent. Having a large audience can be very lucrative: Instagram influencers in the United States, Canada and the United Kingdom with over 90,000 followers made around 1,221 US dollars per post.
Instagram around the globe
Instagram’s worldwide popularity continues to grow, and India is the leading country in terms of number of users, with over 362.9 million users as of January 2024. The United States had 169.65 million Instagram users and Brazil had 134.6 million users. The social media platform was also very popular in Indonesia and Turkey, with 100.9 and 57.1, respectively. As of January 2024, Instagram was the fourth most popular social network in the world, behind Facebook, YouTube and WhatsApp.
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.