Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109
Abstract
The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description
The dataset comprises a total of 52,984 Tweet IDs (that correspond to the same number of Tweets) about online learning that were posted on Twitter from 9th November 2021 to 13th July 2022. The earliest date was selected as 9th November 2021, as the Omicron variant was detected for the first time in a sample that was collected on this date. 13th July 2022 was the most recent date as per the time of data collection and publication of this dataset.
The dataset consists of 9 .txt files. An overview of these dataset files along with the number of Tweet IDs and the date range of the associated tweets is as follows. Table 1 shows the list of all the synonyms or terms that were used for the dataset development.
Filename: TweetIDs_November_2021.txt (No. of Tweet IDs: 1283, Date Range of the associated Tweet IDs: November 1, 2021 to November 30, 2021)
Filename: TweetIDs_December_2021.txt (No. of Tweet IDs: 10545, Date Range of the associated Tweet IDs: December 1, 2021 to December 31, 2021)
Filename: TweetIDs_January_2022.txt (No. of Tweet IDs: 23078, Date Range of the associated Tweet IDs: January 1, 2022 to January 31, 2022)
Filename: TweetIDs_February_2022.txt (No. of Tweet IDs: 4751, Date Range of the associated Tweet IDs: February 1, 2022 to February 28, 2022)
Filename: TweetIDs_March_2022.txt (No. of Tweet IDs: 3434, Date Range of the associated Tweet IDs: March 1, 2022 to March 31, 2022)
Filename: TweetIDs_April_2022.txt (No. of Tweet IDs: 3355, Date Range of the associated Tweet IDs: April 1, 2022 to April 30, 2022)
Filename: TweetIDs_May_2022.txt (No. of Tweet IDs: 3120, Date Range of the associated Tweet IDs: May 1, 2022 to May 31, 2022)
Filename: TweetIDs_June_2022.txt (No. of Tweet IDs: 2361, Date Range of the associated Tweet IDs: June 1, 2022 to June 30, 2022)
Filename: TweetIDs_July_2022.txt (No. of Tweet IDs: 1057, Date Range of the associated Tweet IDs: July 1, 2022 to July 13, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.
Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development
Terminology
List of synonyms and terms
COVID-19
Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus
online learning
online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures
The geosciences are a data-rich domain where Earth materials and processes are analysed from local to global scales. However, often we only have discrete measurements at specific locations, and a limited understanding of how these features vary across the landscape. Earth system processes are inherently complex, and trans-disciplinary science will likely become increasingly important in finding solutions to future challenges associated with the environment, mineral/petroleum resources and food security. Machine learning is an important approach to synthesise the increasing complexity and sheer volume of Earth science data, and is now widely used in prediction across many scientific disciplines. In this context, we have built a machine learning pipeline, called Uncover-ML, for both supervised and unsupervised learning, prediction and classification. The Uncover-ML pipeline was developed from a partnership between CSIRO and Geoscience Australia, and is largely built around the Python scikit-learn machine learning libraries. In this paper, we briefly describe the architecture and components of Uncover-ML for feature extraction, data scaling, sample selection, predictive mapping, estimating model performance, model optimisation and estimating model uncertainties. Links to download the source code and information on how to implement the algorithms are also provided. Citation: Wilford, J., Basak, S., Hassan, R., Moushall, B., McCalman, L., Steinberg, D. and Zhang, F, 2020. Uncover-ML: a machine learning pipeline for geoscience data analysis. In: Czarnota, K., Roach, I., Abbott, S., Haynes, M., Kositcin, N., Ray, A. and Slatter, E. (eds.) Exploring for the Future: Extended Abstracts, Geoscience Australia, Canberra, 1–4.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains all the input and output data (including maps) related to Van Dijk et al. (2022), Occupations on the map: Using a super learner algorithm to downscale labor statistics. It does not contain several large (> 4GB) intermediate files, which summarize the results of the large number of machine learning models that were trained and tuned as part of the super learner algorithm. These files can be created by running the scripts in the supplementary GitHub repository: https://github.com/michielvandijk/occupations_on_the_map. All input and output maps produced as part of this study can also be accessed by means of an interactive web application: https://shiny.wur.nl/occupation-map-vnm.
In this paper, we demonstrated an approach to create fine-scale gridded occupation maps by means of downscaling district-level labor statistics informed by remote sensing and other spatial information. We applied a super-learner algorithm that combined the results of different machine learning models to predict the shares of six major occupation categories and the labor force participation rate at a resolution of 30 arc seconds (~1x1 km) in Vietnam. The results were subsequently combined with gridded information on the working-age population to produce maps of the number of workers per occupation. The proposed approach can also be applied to produce maps of other (labor) statistics, which are only available at aggregated levels.
The following dataset has been used for the paper entitled "Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements".
Abstract
This research was triggered by the identified need in literature for large-scale studies about the kind of designs that teachers create for Mobile Learning (m-learning). These studies require analyses of large datasets of learning designs. The common approach followed by researchers when analysing designs has been to manually classify them following high-level pedagogically-guided coding strategies, which demands extensive work. Therefore, the first goal of this paper is to explore the use of Supervised Machine Learning (SML) to automatically classify the textual content of m-learning designs, through pedagogically-relevant classifications, such as the cognitive level demanded by students to carry out specific designed tasks, the phases of inquiry learning represented in the designs, or the role that the situated environment has in them. As not all the SML models are transparent, while often researchers need to understand the behaviour behind them, the second goal of this paper considers the trade-off between models’ performance and interpretability in the context of design analytics for m-learning. To achieve these goals we compiled a dataset of designs deployed through two tools, Avastusrada and Smartzoos. With it, we trained and compared different models and feature extraction techniques. We further optimized andcompared the best-performing and most interpretable algorithms (EstBERT and Logistic Regression) to consider the second goal through an illustrative case. We found that SML can reliably classify designs, with accuracy>0.86and Cohen’s kappa>0.69.
The MNIST Large Scale dataset is based on the classic MNIST dataset, but contains large scale variations up to a factor of 16. The motivation behind creating this dataset was to enable testing the ability of different algorithms to learn in the presence of large scale variability and specifically the ability to generalise to new scales not present in the training set over wide scale ranges.
The dataset contains training data for each one of the relative size factors 1, 2 and 4 relative to the original MNIST dataset and testing data for relative scaling factors between 1/2 and 8, with a ratio of $\sqrt[4]{2}$ between adjacent scales.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Developing an accurate second language competence different from the mother tongue has become an essential skill in today's globalized world, and therefore it is a highly valued and demanded learning among the main educational institutions and models. However, it is a complex process that is influenced by numerous and, in many cases, unknown factors, which are not usually taken into consideration when designing second language learning processes, which tend to lead to inadequate teaching and may lead to school failure that could have been avoided.
216 in-service teachers from all non-university educational stages of the Community of Madrid, Spain, evaluate the significance of 44 factors traditionally associated with second language learning, which are grouped into four general categories (factors linked to students; factors linked to teachers; learning structure and organisation; and learning environment) through a five-point Likert scale.
The data were collected using a Google Forms questionnaire through the research described in Arigita-García et al. (2021). The sample is heterogeneous concerning different attribute variables such as age, teaching experience, gender, school ownership, and the language in which classes are taught. The sample was obtained through social networks and teacher forums.
The data collection offers essential information to better understand the process of second language learning, as it gathers the experience and learning accumulated by the teachers who took part in this work, which implies direct information from the educational reality that they are intended to improve.
Methods Computerized questionnaire through the Google Forms private server
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accuracy (%) of different algorithms for overall nutritional status.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AimMalnutrition in pregnant women significantly affects both mother and child health. This research aims to identify the best machine learning (ML) techniques for predicting the nutritional status of pregnant women in Bangladesh and detect the most essential features based on the best-performed algorithm.MethodsThis study used retrospective cross-sectional data from the Bangladeshi Demographic and Health Survey 2017–18. Different feature transformations and machine learning classifiers were applied to find the best transformation and classification model.ResultsThis investigation found that robust scaling outperformed all feature transformation methods. The result shows that the Random Forest algorithm with robust scaling outperforms all other machine learning algorithms with 74.75% accuracy, 57.91% kappa statistics, 73.36% precision, 73.08% recall, and 73.09% f1 score. In addition, the Random Forest algorithm had the highest precision (76.76%) and f1 score (71.71%) for predicting the underweight class, as well as an expected precision of 82.01% and f1 score of 83.78% for the overweight/obese class when compared to other algorithms with a robust scaling method. The respondent’s age, wealth index, region, husband’s education level, husband’s age, and occupation were crucial features for predicting the nutritional status of pregnant women in Bangladesh.ConclusionThe proposed classifier could help predict the expected outcome and reduce the burden of malnutrition among pregnant women in Bangladesh.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Contemporary theories guiding the search for neural mechanisms of learning and memory assume that associative learning results from the temporal pairing of cues and reinforcers resulting in coincident activation of associated neurons, strengthening their synaptic connection. While enduring, this framework has limitations: Temporal-pairing-based models of learning do not fit with many experimental observations and cannot be used to make quantitative predictions about behavior. Here we present behavioral data that supports an alternative, information-theoretic conception: The amount of information that cues provide about the timing of reward delivery predicts behavior. Furthermore, this approach accounts for the rate and depth of both inhibitory and excitatory learning across paradigms and species. We also show that dopamine release in the ventral striatum reflects cue–predicted changes in reinforcement rates consistent with subjects understanding temporal relationships between task events. Our results reshape the conceptual and biological framework for understanding associative learning. Methods Complete Methods are provided in the article: Learning Depends on the Information Conveyed by Temporal Relationships Between Events and is Reflected in the Dopamine Response to Cues. Briefly: Male, Sprague-Dawley rats were housed in groups of two in a colony room on a 12:12 hour light:dark cycle. Water was available ad-lib in the home cages. They were fed in their home cages for one hour after experimental sessions, which occurred 5 days per week. On weekends, they had ad-lib access to food until approximately 22 hours before the first weekday session. They were approximately two months old at the start of training and had been handled for one week before that. They were trained in eight identical experimental chambers (30.5 cm x 24.1 cm x 21.0 cm) located in ventilated and soundproof boxes. Each chamber was equipped with a speaker, a house light, and a pellet dispenser (Model ENV-203, Med Associates), which delivered 20mg pellets into a head-entry-detecting trough (Models ENV-200R7 and ENV-254-CB, Med Associates). They initially received 2 sessions of magazine training, during which 40 pellets were delivered at random times during a 20-minute session (random time 30s schedule), followed by daily sessions with one of the experimental protocols. The time of occurrence of each head entry and the time of onset and termination of all stimulus events were recorded with 0.1s resolution.
The most successful and popular machine learning models of atomic-scale properties derive their transferability from a locality ansatz. The properties of a large molecule or a bulk material are written as a sum over contributions that depend on the configurations within finite atom-centered environments. The obvious downside of this approach is that it cannot capture nonlocal, nonadditive effects such as those arising due to long-range electrostatics or quantum interference. We propose a solution to this problem by introducing nonlocal representations of the system, which are remapped as feature vectors that are defined locally and are equivariant in O(3). We consider, in particular, one form that has the same asymptotic behavior as the electrostatic potential. We demonstrate that this framework can capture nonlocal, long-range physics by building a model for the electrostatic energy of randomly distributed point-charges, for the unrelaxed binding curves of charged organic molecular dimers, and for the electronic dielectric response of liquid water. By combining a representation of the system that is sensitive to long-range correlations with the transferability of an atom-centered additive model, this method outperforms current state-of-the-art machine-learning schemes and provides a conceptual framework to incorporate nonlocal physics into atomistic machine learning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Data and Results used in the publication entitled "Impact of Interval Censoring on Data Accuracy and Machine Learning Performance in Biological High-Throughput Screening"
Data
This folder contains the raw data used during this work.
EvoEF.csv
contains information on the library used (sequences, number of mutations, etc.) and the fitness (energy) used as continuous mean values. mut.csv
contains the information about the combinatorial scaling (N vs N_norm), the number of mutations (m) and the probability of each variant using different distributions (uniform and binomial) at different $p_{WT}$.
For further details on how the fitness values were calculated and how the combinatorial scale works, please refer to our prevoius Paper.
Results
This folder contains the results (outputs) of all scripts used. Such results are included in the form of .npy
and .npz
files. To load such files with numpy you should include the option allow_pickle=True
.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Precision (%) of different algorithms for the underweight and overweight/obese.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books and is filtered where the book is Stochastic optimization for large-scale machine learning, featuring 7 columns including author, BNB id, book, book publisher, and ISBN. The preview is ordered by publication date (descending).
We present a Bayesian nonlinear mixed-effects location scale model (NL-MELSM). The NL-MELSM allows for fitting nonlinear functions to the location, or individual means, and the scale, or within-person variance. Specifically, in the context of learning, this model allows the within-person variance to follow a nonlinear trajectory, where it can be determined whether variability reduces while in the process learning. It incorporates a sub-model that can predict nonlinear parameters for the location and/or scale. This specification estimates random effects for all nonlinear location and scale parameters that are drawn from a common multivariate distribution. This allows estimation of covariances among the random effects, within and across the location and the scale. These covariances offer new insights into the interplay between individual mean structures and intra-individual variability in nonlinear parameters. We take a fully Bayesian approach, not only for ease of estimation, but also because it provides the necessary and consistent information for use in psychological applications, such as model selection and hypothesis testing. To illustrate the model, we use data from 333 individuals, consisting of three age groups, who participated in five learning trials that assessed verbal memory. In an exploratory context we demonstrate that fitting a nonlinear function to the within-person variance, and allowing for individual variation therein, improves predictive accuracy compared to customary modeling techniques (e.g., assuming constant variance). We conclude by discussing the usefulness, limitations, and future directions of the NL-MELSM.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux)
The Q-Herilearn scale is a probabilistic scale of summative estimates that measures different aspects of the learning process in Heritage Education. It consists of seven factors (Knowing, Understanding, Respecting, Valuing, Caring, Enjoying and Transmitting). Each dimension is measured by means of seven indicators scored on a 4-point frequency response scale (1 = Never or almost never; 2 = Sometimes; 3 = Quite often; 4 = Always or almost always). Sufficient evidence of content validity has been obtained through a concordance analysis —which employed multi-facet logistic models (Many Facet Rasch Model MFRM)— of the scores of 40 judges, who estimated the relevance, adequacy, and clarity of each item. The metric properties of the scores were determined using ESEM —Exploratory Structural Equation Modeling—, EGA Exploratory Graph Analysis and Network Analysis. The scale was calibrated using Item Response Theory models: the Nominal Response Model and the Graded Response Model.
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
To introduce this collection of studies, a logical first question to ask is why produce a “lessons learned” publication? The initial impetus for this initiative was an observation by an external evaluation of CARPE that there was relatively little sharing of information within the programme between numerous actors and sites concerning the conservation strategies undertaken and the results achieved (Weidemann Consortium, 2006).1 The evaluation concluded that this lack of information exchange was a threat to the success of CARPE as a large-scale regional programme, a view that was confirmed by programme partners during the CARPE Phase IIB Inception workshop that was held in Yaoundé in February 2007. 1 copy|also available online Call Number: 333.7 YAN [EL] ISBN/ISSN: 978-2-8317-1288-8 Physical Description: xvi, 262 p. ; 29 cm
The training samples of the entire year (from yr-2 of simulation) are compressed in SPCAM_ML_Han_et_al_0.tar.gz, and testing samples of the entire year (from yr-3 of simulation) are compressed in SPCAM_ML_Han_et_al_1.tar.gz. In each dataset, there are a data documentation file and 365 netCDF data files (one file for each day) that are marked by its date. The variable fields contain temperature and moisture tendencies and cloud water and cloud ice from the CRM, and vertical profiles of temperature and moisture and large-scale temperature and moisture tendencies from the dynamic core of SPCAM’s host model CAM5 and PBL diffusion. In addition, we include surface sensible and latent heat fluxes. For more details, please read the data documentation inside the tar.gz files.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Large-Scale Model Training Machine market is experiencing explosive growth, fueled by the increasing demand for advanced artificial intelligence (AI) applications across diverse sectors. The market, estimated at $15 billion in 2025, is projected to witness a robust Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $75 billion by 2033. This surge is driven by several factors, including the proliferation of big data, advancements in deep learning algorithms, and the growing need for efficient model training in applications such as natural language processing (NLP), computer vision, and recommendation systems. Key market segments include the Internet, telecommunications, and government sectors, which are heavily investing in AI infrastructure to enhance their services and operational efficiency. The CPU+GPU segment dominates the market due to its superior performance in handling complex computations required for large-scale model training. Leading companies like Google, Amazon, Microsoft, and NVIDIA are at the forefront of innovation, constantly developing more powerful hardware and software solutions to address the evolving needs of this rapidly expanding market. The market's growth trajectory is shaped by several trends. The increasing adoption of cloud-based solutions for model training is significantly lowering the barrier to entry for smaller companies. Simultaneously, the development of specialized hardware like Tensor Processing Units (TPUs) and Field-Programmable Gate Arrays (FPGAs) is further optimizing performance and reducing costs. Despite this positive outlook, challenges remain. High infrastructure costs, the complexity of managing large datasets, and the shortage of skilled AI professionals are significant restraints on the market's expansion. However, ongoing technological advancements and increased investment in AI research are expected to mitigate these challenges, paving the way for sustained growth in the Large-Scale Model Training Machine market. Regional analysis indicates North America and Asia Pacific (particularly China) as the leading markets, with strong growth anticipated in other regions as AI adoption accelerates globally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109
Abstract
The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description
The dataset comprises a total of 52,984 Tweet IDs (that correspond to the same number of Tweets) about online learning that were posted on Twitter from 9th November 2021 to 13th July 2022. The earliest date was selected as 9th November 2021, as the Omicron variant was detected for the first time in a sample that was collected on this date. 13th July 2022 was the most recent date as per the time of data collection and publication of this dataset.
The dataset consists of 9 .txt files. An overview of these dataset files along with the number of Tweet IDs and the date range of the associated tweets is as follows. Table 1 shows the list of all the synonyms or terms that were used for the dataset development.
Filename: TweetIDs_November_2021.txt (No. of Tweet IDs: 1283, Date Range of the associated Tweet IDs: November 1, 2021 to November 30, 2021)
Filename: TweetIDs_December_2021.txt (No. of Tweet IDs: 10545, Date Range of the associated Tweet IDs: December 1, 2021 to December 31, 2021)
Filename: TweetIDs_January_2022.txt (No. of Tweet IDs: 23078, Date Range of the associated Tweet IDs: January 1, 2022 to January 31, 2022)
Filename: TweetIDs_February_2022.txt (No. of Tweet IDs: 4751, Date Range of the associated Tweet IDs: February 1, 2022 to February 28, 2022)
Filename: TweetIDs_March_2022.txt (No. of Tweet IDs: 3434, Date Range of the associated Tweet IDs: March 1, 2022 to March 31, 2022)
Filename: TweetIDs_April_2022.txt (No. of Tweet IDs: 3355, Date Range of the associated Tweet IDs: April 1, 2022 to April 30, 2022)
Filename: TweetIDs_May_2022.txt (No. of Tweet IDs: 3120, Date Range of the associated Tweet IDs: May 1, 2022 to May 31, 2022)
Filename: TweetIDs_June_2022.txt (No. of Tweet IDs: 2361, Date Range of the associated Tweet IDs: June 1, 2022 to June 30, 2022)
Filename: TweetIDs_July_2022.txt (No. of Tweet IDs: 1057, Date Range of the associated Tweet IDs: July 1, 2022 to July 13, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.
Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development
Terminology
List of synonyms and terms
COVID-19
Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus
online learning
online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures