Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109
Abstract
The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description
The dataset comprises a total of 52,984 Tweet IDs (that correspond to the same number of Tweets) about online learning that were posted on Twitter from 9th November 2021 to 13th July 2022. The earliest date was selected as 9th November 2021, as the Omicron variant was detected for the first time in a sample that was collected on this date. 13th July 2022 was the most recent date as per the time of data collection and publication of this dataset.
The dataset consists of 9 .txt files. An overview of these dataset files along with the number of Tweet IDs and the date range of the associated tweets is as follows. Table 1 shows the list of all the synonyms or terms that were used for the dataset development.
Filename: TweetIDs_November_2021.txt (No. of Tweet IDs: 1283, Date Range of the associated Tweet IDs: November 1, 2021 to November 30, 2021)
Filename: TweetIDs_December_2021.txt (No. of Tweet IDs: 10545, Date Range of the associated Tweet IDs: December 1, 2021 to December 31, 2021)
Filename: TweetIDs_January_2022.txt (No. of Tweet IDs: 23078, Date Range of the associated Tweet IDs: January 1, 2022 to January 31, 2022)
Filename: TweetIDs_February_2022.txt (No. of Tweet IDs: 4751, Date Range of the associated Tweet IDs: February 1, 2022 to February 28, 2022)
Filename: TweetIDs_March_2022.txt (No. of Tweet IDs: 3434, Date Range of the associated Tweet IDs: March 1, 2022 to March 31, 2022)
Filename: TweetIDs_April_2022.txt (No. of Tweet IDs: 3355, Date Range of the associated Tweet IDs: April 1, 2022 to April 30, 2022)
Filename: TweetIDs_May_2022.txt (No. of Tweet IDs: 3120, Date Range of the associated Tweet IDs: May 1, 2022 to May 31, 2022)
Filename: TweetIDs_June_2022.txt (No. of Tweet IDs: 2361, Date Range of the associated Tweet IDs: June 1, 2022 to June 30, 2022)
Filename: TweetIDs_July_2022.txt (No. of Tweet IDs: 1057, Date Range of the associated Tweet IDs: July 1, 2022 to July 13, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.
Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development
Terminology
List of synonyms and terms
COVID-19
Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus
online learning
online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Summary
One ultimate goal of visual neuroscience is to understand how the brain processes visual stimuli encountered in the natural environment. Achieving this goal requires records of brain responses under massive amounts of naturalistic stimuli. Although the scientific community has put in a lot of effort to collect large-scale functional magnetic resonance imaging (fMRI) data under naturalistic stimuli, more naturalistic fMRI datasets are still urgently needed. We present here the Natural Object Dataset (NOD), a large-scale fMRI dataset containing responses to 57,120 naturalistic images from 30 participants. NOD strives for a balance between sampling variation between individuals and sampling variation between stimuli. This enables NOD to be utilized not only for determining whether an observation is generalizable across many individuals, but also for testing whether a response pattern is generalized to a variety of naturalistic stimuli. We anticipate that the NOD together with existing naturalistic neuroimaging datasets will serve as a new impetus for our understanding of the visual processing of naturalistic stimuli.
Data record
The data were organized according to the Brain-Imaging-Data-Structure (BIDS) Specification version 1.7.0 and can be accessed from the OpenNeuro public repository (accession number: ds004496). In short, raw data of each subject were stored in “sub-
Stimulus images The stimulus images for different fMRI experiments are deposited in separate folders: “stimuli/imagenet”, “stimuli/coco”, “stimuli/prf”, and “stimuli/floc”. Each experiment folder contains corresponding stimulus images, and the auxiliary files can be found within the “info” subfolder.
Raw MRI data Each participant folder consists of several session folders: anat, coco, imagenet, prf, floc. Each session folder in turn includes “anat”, “func”, or “fmap” folders for corresponding modality data. The scan information for each session is provided in a TSV file.
Preprocessed volume data from fMRIprep The preprocessed volume-based fMRI data are in subject's native space, saved as “sub-
Preprocessed surface-based data from ciftify The preprocessed surface-based data are in standard fsLR space, saved as “sub-
Brain activation data from surface-based GLM analyses The brain activation data are derived from GLM analyses on the standard fsLR space, saved as “sub-
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Product recognition is a task that receives continuous attention by the computer vision/deep learning community mainly with the scope of providing robust solutions for automatic checkout supermarkets. One of the main challenges is the lack of images that illustrate in realistic conditions a high number of products. Here the product recognition task is perceived slightly differently compared to the automatic checkout paradigm but the challenges encountered are the same. The setting under which this dataset is captured is with the aim to help individuals with visual impairment in doing their daily grocery in order to increase their autonomy. In particular, we propose a large-scale dataset utilized to tackle the product recognition problem in a supermarket environment. The dataset is characterized by (a) large scale in terms of unique products associated with one or more photos from different viewpoints, (b) rich textual descriptions linked to different levels of annotation and, (c) images acquired both in laboratory conditions and in a realistic supermarket scenario portrayed in various clutter and lighting conditions. A direct comparison with existing datasets of this category demonstrates the significantly higher number of the available unique products, as well as the richness of its annotation enabling different recognition scenarios. Finally, the dataset is also benchmarked using various approaches based both on visual and textual descriptors.
Overview The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. The current edition is version 11.4 (published 24 February 2025). The 11.4 release contains updated boundary lines and data refinements designed to extend the functionality of the dataset. These data and generalized derivatives are the only international boundary lines approved for U.S. Government use. The contents of this dataset reflect U.S. Government policy on international boundary alignment, political recognition, and dispute status. They do not necessarily reflect de facto limits of control. National Geospatial Data Asset This dataset is a National Geospatial Data Asset (NGDAID 194) managed by the Department of State. It is a part of the International Boundaries Theme created by the Federal Geographic Data Committee. Dataset Source Details Sources for these data include treaties, relevant maps, and data from boundary commissions, as well as national mapping agencies. Where available and applicable, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery process includes analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground. Cartographic Visualization The LSIB is a geospatial dataset that, when used for cartographic purposes, requires additional styling. The LSIB download package contains example style files for commonly used software applications. The attribute table also contains embedded information to guide the cartographic representation. Additional discussion of these considerations can be found in the Use of Core Attributes in Cartographic Visualization section below. Additional cartographic information pertaining to the depiction and description of international boundaries or areas of special sovereignty can be found in Guidance Bulletins published by the Office of the Geographer and Global Issues: https://hiu.state.gov/data/cartographic_guidance_bulletins/ Contact Direct inquiries to internationalboundaries@state.gov. Direct download: https://data.geodata.state.gov/LSIB.zip Attribute Structure The dataset uses the following attributes divided into two categories: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | Core CC1_GENC3 | Extension CC1_WPID | Extension COUNTRY1 | Core CC2 | Core CC2_GENC3 | Extension CC2_WPID | Extension COUNTRY2 | Core RANK | Core LABEL | Core STATUS | Core NOTES | Core LSIB_ID | Extension ANTECIDS | Extension PREVIDS | Extension PARENTID | Extension PARENTSEG | Extension These attributes have external data sources that update separately from the LSIB: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | GENC CC1_GENC3 | GENC CC1_WPID | World Polygons COUNTRY1 | DoS Lists CC2 | GENC CC2_GENC3 | GENC CC2_WPID | World Polygons COUNTRY2 | DoS Lists LSIB_ID | BASE ANTECIDS | BASE PREVIDS | BASE PARENTID | BASE PARENTSEG | BASE The core attributes listed above describe the boundary lines contained within the LSIB dataset. Removal of core attributes from the dataset will change the meaning of the lines. An attribute status of “Extension” represents a field containing data interoperability information. Other attributes not listed above include “FID”, “Shape_length” and “Shape.” These are components of the shapefile format and do not form an intrinsic part of the LSIB. Core Attributes The eight core attributes listed above contain unique information which, when combined with the line geometry, comprise the LSIB dataset. These Core Attributes are further divided into Country Code and Name Fields and Descriptive Fields. County Code and Country Name Fields “CC1” and “CC2” fields are machine readable fields that contain political entity codes. These are two-character codes derived from the Geopolitical Entities, Names, and Codes Standard (GENC), Edition 3 Update 18. “CC1_GENC3” and “CC2_GENC3” fields contain the corresponding three-character GENC codes and are extension attributes discussed below. The codes “Q2” or “QX2” denote a line in the LSIB representing a boundary associated with areas not contained within the GENC standard. The “COUNTRY1” and “COUNTRY2” fields contain the names of corresponding political entities. These fields contain names approved by the U.S. Board on Geographic Names (BGN) as incorporated in the ‘"Independent States in the World" and "Dependencies and Areas of Special Sovereignty" lists maintained by the Department of State. To ensure maximum compatibility, names are presented without diacritics and certain names are rendered using common cartographic abbreviations. Names for lines associated with the code "Q2" are descriptive and not necessarily BGN-approved. Names rendered in all CAPITAL LETTERS denote independent states. Names rendered in normal text represent dependencies, areas of special sovereignty, or are otherwise presented for the convenience of the user. Descriptive Fields The following text fields are a part of the core attributes of the LSIB dataset and do not update from external sources. They provide additional information about each of the lines and are as follows: ATTRIBUTE NAME | CONTAINS NULLS RANK | No STATUS | No LABEL | Yes NOTES | Yes Neither the "RANK" nor "STATUS" fields contain null values; the "LABEL" and "NOTES" fields do. The "RANK" field is a numeric expression of the "STATUS" field. Combined with the line geometry, these fields encode the views of the United States Government on the political status of the boundary line. A value of “1” in the “RANK” field corresponds to an "International Boundary" value in the “STATUS” field. Values of ”2” and “3” correspond to “Other Line of International Separation” and “Special Line,” respectively. The “LABEL” field contains required text to describe the line segment on all finished cartographic products, including but not limited to print and interactive maps. The “NOTES” field contains an explanation of special circumstances modifying the lines. This information can pertain to the origins of the boundary lines, limitations regarding the purpose of the lines, or the original source of the line. Use of Core Attributes in Cartographic Visualization Several of the Core Attributes provide information required for the proper cartographic representation of the LSIB dataset. The cartographic usage of the LSIB requires a visual differentiation between the three categories of boundary lines. Specifically, this differentiation must be between: - International Boundaries (Rank 1); - Other Lines of International Separation (Rank 2); and - Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the “Label” field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Please consult the style files in the download package for examples of this depiction. The requirement to incorporate the contents of the "LABEL" field on cartographic products is scale dependent. If a label is legible at the scale of a given static product, a proper use of this dataset would encourage the application of that label. Using the contents of the "COUNTRY1" and "COUNTRY2" fields in the generation of a line segment label is not required. The "STATUS" field contains the preferred description for the three LSIB line types when they are incorporated into a map legend but is otherwise not to be used for labeling. Use of the “CC1,” “CC1_GENC3,” “CC2,” “CC2_GENC3,” “RANK,” or “NOTES” fields for cartographic labeling purposes is prohibited. Extension Attributes Certain elements of the attributes within the LSIB dataset extend data functionality to make the data more interoperable or to provide clearer linkages to other datasets. The fields “CC1_GENC3” and “CC2_GENC” contain the corresponding three-character GENC code to the “CC1” and “CC2” attributes. The code “QX2” is the three-character counterpart of the code “Q2,” which denotes a line in the LSIB representing a boundary associated with a geographic area not contained within the GENC standard. To allow for linkage between individual lines in the LSIB and World Polygons dataset, the “CC1_WPID” and “CC2_WPID” fields contain a Universally Unique Identifier (UUID), version 4, which provides a stable description of each geographic entity in a boundary pair relationship. Each UUID corresponds to a geographic entity listed in the World Polygons dataset. These fields allow for linkage between individual lines in the LSIB and the overall World Polygons dataset. Five additional fields in the LSIB expand on the UUID concept and either describe features that have changed across space and time or indicate relationships between previous versions of the feature. The “LSIB_ID” attribute is a UUID value that defines a specific instance of a feature. Any change to the feature in a lineset requires a new “LSIB_ID.” The “ANTECIDS,” or antecedent ID, is a UUID that references line geometries from which a given line is descended in time. It is used when there is a feature that is entirely new, not when there is a new version of a previous feature. This is generally used to reference countries that have dissolved. The “PREVIDS,” or Previous ID, is a UUID field that contains old versions of a line. This is an additive field, that houses all Previous IDs. A new version of a feature is defined by any change to the feature—either line geometry or attribute—but it is still conceptually the same feature. The “PARENTID” field
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset aims to quantify the contribution of small-scale and large-scale agriculture to food-related virtual water flows. Small-scale and large-scale agriculture were explicitly disaggregated in Environmentally-Extended Multi-Regional Input-Output analysis (EE-MRIO), which was later used to calculate virtual water flows. The EE-MRIO consists of three tables to increase the resolution of food-related sectors, considering the importance of these sectors in water consumption (FABIO, GLORIA, and the linking table). Gridded crop production and water consumption data are used, and their production allocation to trade and non-food uses was estimated based on the farming system. Different crop water footprints were used for virtual water flows, non-food uses and for domestic purposes.
This dataset contains country-level virtual water flow divided by green or blue water, type of final use, year, from water-scarce or water-abundant regions; and gridded data describing where the virtual water flow comes from at the 30-arcmin grid cell level per crop
A detailed method description and analysis are under preparation and review. We will update it here as soon as possible. All the code and other data will be available upon request.
WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).
Version 11.1 Release Date: August 22, 2022
The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. These data and their derivatives are the only international boundary lines approved for U.S. Government use. They reflect U.S. Government policy, and not necessarily de facto limits of control. This dataset is a National Geospatial Data Asset.
Sources for these data include treaties, relevant maps, and data from boundary commissions and national mapping agencies. Where available, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery of the data involves analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground.
The dataset uses the following attributes: Attribute Name Explanation Country Code Country-level codes are from the Geopolitical Entities, Names, and Codes Standard (GENC). The Q2 code denotes a line representing a boundary associated with an area not in GENC. Country Names Names approved by the U.S. Board on Geographic Names (BGN). Names for lines associated with a Q2 code are descriptive and are not necessarily BGN-approved. Label Required text label for the line segment where scale permits Rank/Status Rank 1: International Boundary Rank 2: Other Line of International Separation Rank 3: Special Line Notes Explanation of any applicable special circumstances Cartographic Usage Depiction of the LSIB requires a visual differentiation between the three categories of boundaries: International Boundaries (Rank 1), Other Lines of International Separation (Rank 2), and Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the “Label” field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Additional cartographic information can be found in Guidance Bulletins (https://hiu.state.gov/data/cartographic_guidance_bulletins/) published by the Office of the Geographer and Global Issues. Please direct inquiries to internationalboundaries@state.gov.
The lines in the LSIB dataset are the product of decades of collaboration between geographers at the Department of State and the National Geospatial-Intelligence Agency with contributions from the Central Intelligence Agency and the UK Defence Geographic Centre. Attribution is welcome: U.S. Department of State, Office of the Geographer and Global Issues.
This version of the LSIB contains changes and accuracy refinements for the following line segments. These changes reflect improvements in spatial accuracy derived from newly available source materials, an ongoing review process, or the publication of new treaties or agreements. Changes to lines include: • Akrotiri (UK) / Cyprus • Albania / Montenegro • Albania / Greece • Albania / North Macedonia • Armenia / Turkey • Austria / Czechia • Austria / Slovakia • Austria / Hungary • Austria / Slovenia • Austria / Germany • Austria / Italy • Austria / Switzerland • Azerbaijan / Turkey • Azerbaijan / Iran • Belarus / Latvia • Belarus / Russia • Belarus / Ukraine • Belarus / Poland • Bhutan / India • Bhutan / China • Bulgaria / Turkey • Bulgaria / Romania • Bulgaria / Serbia • Bulgaria / Romania • China / Tajikistan • China / India • Croatia / Slovenia • Croatia / Hungary • Croatia / Serbia • Croatia / Montenegro • Czechia / Slovakia • Czechia / Poland • Czechia / Germany • Finland / Russia • Finland / Norway • Finland / Sweden • France / Italy • Georgia / Turkey • Germany / Poland • Germany / Switzerland • Greece / North Macedonia • Guyana / Suriname • Hungary / Slovenia • Hungary / Serbia • Hungary / Romania • Hungary / Ukraine • Iran / Turkey • Iraq / Turkey • Italy / Slovenia • Italy / Switzerland • Italy / Vatican City • Italy / San Marino • Kazakhstan / Russia • Kazakhstan / Uzbekistan • Kosovo / north Macedonia • Kosovo / Serbia • Kyrgyzstan / Tajikistan • Kyrgyzstan / Uzbekistan • Latvia / Russia • Latvia / Lithuania • Lithuania / Poland • Lithuania / Russia • Moldova / Ukraine • Moldova / Romania • Norway / Russia • Norway / Sweden • Poland / Russia • Poland / Ukraine • Poland / Slovakia • Romania / Ukraine • Romania / Serbia • Russia / Ukraine • Syria / Turkey • Tajikistan / Uzbekistan
This release also contains topology fixes, land boundary terminus refinements, and tripoint adjustments.
While U.S. Government works prepared by employees of the U.S. Government as part of their official duties are not subject to Federal copyright protection (see 17 U.S.C. § 105), copyrighted material incorporated in U.S. Government works retains its copyright protection. The works on or made available through download from the U.S. Department of State’s website may not be used in any manner that infringes any intellectual property rights or other proprietary rights held by any third party. Use of any copyrighted material beyond what is allowed by fair use or other exemptions may require appropriate permission from the relevant rightsholder. With respect to works on or made available through download from the U.S. Department of State’s website, neither the U.S. Government nor any of its agencies, employees, agents, or contractors make any representations or warranties—express, implied, or statutory—as to the validity, accuracy, completeness, or fitness for a particular purpose; nor represent that use of such works would not infringe privately owned rights; nor assume any liability resulting from use of such works; and shall in no way be liable for any costs, expenses, claims, or demands arising out of use of such works.
Climate change has been shown to influence lake temperatures globally. To better understand the diversity of lake responses to climate change and give managers tools to manage individual lakes, we modelled daily water temperature profiles for 10,774 lakes in Michigan, Minnesota and Wisconsin for contemporary (1979-2015) and future (2020-2040 and 2080-2100) time periods with climate models based on the Representative Concentration Pathway 8.5, the worst-case emission scenario. From simulated temperatures, we derived commonly used, ecologically relevant annual metrics of thermal conditions for each lake. We included all available supporting metadata including satellite and in-situ observations of water clarity, maximum observed lake depth, land-cover based estimates of surrounding canopy height and observed water temperature profiles (used here for validation). This unique dataset offers landscape-level insight into the future impact of climate change on lakes. This data set contains the following parameters: site_id, Prmnn_I, GNIS_ID, GNIS_Nm, ReachCd, FType, FCode, which are defined below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:N. Thakur, "Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets from 2017–2022 and 100 Research Questions", Journal of Analytics, Volume 1, Issue 2, 2022, pp. 72-97, DOI: https://doi.org/10.3390/analytics1020007AbstractThe exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today’s living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 Tweets about exoskeletons that were posted in a 5-year period from 21 May 2017 to 21 May 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The proposed AIS dataset encompasses a substantial temporal span of 20 months, spanning from April 2021 to December 2022. This extensive coverage period empowers analysts to examine long-term trends and variations in vessel activities. Moreover, it facilitates researchers in comprehending the potential influence of external factors, including weather patterns, seasonal variations, and economic conditions, on vessel traffic and behavior within the Finnish waters.
This dataset encompasses an extensive array of data pertaining to vessel movements and activities encompassing seas, rivers, and lakes. Anticipated to be comprehensive in nature, the dataset encompasses a diverse range of ship types, such as cargo ships, tankers, fishing vessels, passenger ships, and various other categories.
The AIS dataset exhibits a prominent attribute in the form of its exceptional granularity with a total of 2 293 129 345 data points. The provision of such granular information proves can help analysts to comprehend vessel dynamics and operations within the Finnish waters. It enables the identification of patterns and anomalies in vessel behavior and facilitates an assessment of the potential environmental implications associated with maritime activities.
Please cite the following publication when using the dataset:
TBD
The publication is available at: TBD
A preprint version of the publication is available at TBD
csv file structure
YYYY-MM-DD-location.csv
This file contains the received AIS position reports. The structure of the logged parameters is the following: [timestamp, timestampExternal, mmsi, lon, lat, sog, cog, navStat, rot, posAcc, raim, heading]
timestamp I beleive this is the UTC second when the report was generated by the electronic position system (EPFS) (0-59, or 60 if time stamp is not available, which should also be the default value, or 61 if positioning system is in manual input mode, or 62 if electronic position fixing system operates in estimated (dead reckoning) mode, or 63 if the positioning system is inoperative).
timestampExternal The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.
mmsi MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity
lon Longitude, Longitude in 1/10 000 min (+/-180 deg, East = positive (as per 2's complement), West = negative (as per 2's complement). 181= (6791AC0h) = not available = default)
lat Latitude, Latitude in 1/10 000 min (+/-90 deg, North = positive (as per 2's complement), South = negative (as per 2's complement). 91deg (3412140h) = not available = default)
sog Speed over ground in 1/10 knot steps (0-102.2 knots) 1 023 = not available, 1 022 = 102.2 knots or higher
cog Course over ground in 1/10 = (0-3599). 3600 (E10h) = not available = default. 3 601-4 095 should not be used
navStat Navigational status, 0 = under way using engine, 1 = at anchor, 2 = not under command, 3 = restricted maneuverability, 4 = constrained by her draught, 5 = moored, 6 = aground, 7 = engaged in fishing, 8 = under way sailing, 9 = reserved for future amendment of navigational status for ships carrying DG, HS, or MP, or IMO hazard or pollutant category C, high speed craft (HSC), 10 = reserved for future amendment of navigational status for ships carrying dangerous goods (DG), harmful substances (HS) or marine pollutants (MP), or IMO hazard or pollutant category A, wing in ground (WIG); 11 = power-driven vessel towing astern (regional use); 12 = power-driven vessel pushing ahead or towing alongside (regional use); 13 = reserved for future use, 14 = AIS-SART (active), MOB-AIS, EPIRB-AIS 15 = undefined = default (also used by AIS-SART, MOB-AIS and EPIRB-AIS under test)
rot ROTAIS Rate of turn
0 to +126 = turning right at up to 708 deg per min or higher
0 to -126 = turning left at up to 708 deg per min or higher
Values between 0 and 708 deg per min coded by ROTAIS = 4.733 SQRT(ROTsensor) degrees per min where ROTsensor is the Rate of Turn as input by an external Rate of Turn Indicator (TI). ROTAIS is rounded to the nearest integer value.
+127 = turning right at more than 5 deg per 30 s (No TI available)
-127 = turning left at more than 5 deg per 30 s (No TI available)
-128 (80 hex) indicates no turn information available (default).
ROT data should not be derived from COG information.
posAcc Position accuracy, The position accuracy (PA) flag should be determined in accordance with the table below:
1 = high (<= 10 m)
0 = low (> 10 m)
0 = default
See https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM
raim RAIM-flag Receiver autonomous integrity monitoring (RAIM) flag of electronic position fixing device; 0 = RAIM not in use = default; 1 = RAIM in use. See Table https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM
Check https://en.wikipedia.org/wiki/Receiver_autonomous_integrity_monitoring
heading True heading, Degrees (0-359) (511 indicates not available = default)
YYYY-MM-DD-metadata.csv
This file contains the received AIS metadata: the ship static and voyage related data. The structure of the logged parameters is the following: [timestamp, destination, mmsi, callSign, imo, shipType, draught, eta, posType, pointA, pointB, pointC, pointD, name]
timestamp The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.
destination Maximum 20 characters using 6-bit ASCII; @@@@@@@@@@@@@@@@@@@@ = not available For SAR aircraft, the use of this field may be decided by the responsible administration
mmsi MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity
callSign 7?=?6 bit ASCII characters, @@@@@@@ = not available = default Craft associated with a parent vessel, should use “A” followed by the last 6 digits of the MMSI of the parent vessel. Examples of these craft include towed vessels, rescue boats, tenders, lifeboats and liferafts.
imo 0 = not available = default – Not applicable to SAR aircraft
0000000001-0000999999 not used
0001000000-0009999999 = valid IMO number;
0010000000-1073741823 = official flag state number.
Check: https://en.wikipedia.org/wiki/IMO_number
shipType
0 = not available or no ship = default
1-99 = as defined below
100-199 = reserved, for regional use
200-255 = reserved, for future use Not applicable to SAR aircraft
Check https://www.navcen.uscg.gov/pdf/AIS/AISGuide.pdf and https://www.navcen.uscg.gov/?pageName=AISMessagesAStatic
draught In 1/10 m, 255 = draught 25.5 m or greater, 0 = not available = default; in accordance with IMO Resolution A.851 Not applicable to SAR aircraft, should be set to 0
eta Estimated time of arrival; MMDDHHMM UTC
Bits 19-16: month; 1-12; 0 = not available = default
Bits 15-11: day; 1-31; 0 = not available = default
Bits 10-6: hour; 0-23; 24 = not available = default
Bits 5-0: minute; 0-59; 60 = not available = default
For SAR aircraft, the use of this field may be decided by the responsible administration
posType Type of electronic position fixing device
0 = undefined (default)
1 = GPS
2 = GLONASS
3 = combined GPS/GLONASS
4 = Loran-C
5 = Chayka
6 = integrated navigation system
7 = surveyed
8 = Galileo,
9-14 = not used
15 = internal GNSS
pointA Reference point for reported position.
Also indicates the dimension of ship (m). For SAR aircraft, the use of this field may be decided by the responsible administration. If used it should indicate the maximum dimensions of the craft. As default should A = B = C = D be set to “0”
Check: https://www.navcen.uscg.gov/?pageName=AISMessagesAStatic#_Reference_point_for
pointB See above
pointC See above
pointD See above
name Maximum 20 characters 6 bit ASCII "@@@@@@@@@@@@@@@@@@@@" = not available = default The Name should be as shown on the station radio license. For SAR aircraft, it should be set to “SAR AIRCRAFT NNNNNNN” where NNNNNNN equals the aircraft registration number.
This project leveraged existing datasets to ground policy for children in the digital age for the first time. The project provided evidence to policy-makers, parents, teachers, and GPs on the impact of digital technologies in the lives of British children, highlighting key risk and resilience factors for future interventions. Using existing data, advanced statistical techniques, and robust open science methodologies, we addressed three main research questions: 1. What risk and resilience factors influence the effect of digital technology on adolescents' psychological well-being? 2. How does digital technology use relate to psychological well-being, and do identified risk factors mediate this relationship? 3. What are the causal pathways between risk factors, digital technology use, and psychological well-being that could inform future interventions? This helped develop profiles to explore long-term technology use and effects, distinguishing between over-hyped concerns, like social isolation, and those warranting further scrutiny, such as poor sleep. While the data cannot be shared or underlaying code is made available open access under Related Resources.
This project aims-for the first time-to use existing ESRC datasets to generate the science required to ground policy in this area. We aim to provide policy-makers, parents, teachers, and GPs with the evidence required to understand the role digital technologies play in the lives of British children, and to highlight potential risk and resilience factors that could be the focus of future interventions. We will use ESRC data assets, advanced statistical approaches, and robust open science methodologies to answer three pressing research questions:
Answers to these questions are currently elusive, due to the poor data quality and methodological shortcomings that restrain research on technology effects. We will leverage our extensive experience working with large-scale social datasets to examine the general effects of digital technologies and more technology-specific effects (e.g. social media and gaming). We will use machine learning, network modelling, and advanced longitudinal approaches to pinpoint potential risk and resilience factors (e.g. social support, economic deprivation) that alter children's reactions to digital technologies, and which could help guide future technology policy. This will create different profiles of children that we can use to investigate the uses and effects of digital technologies over the longer-term-determining which possible technology effects (e.g. social isolation) are currently unevidenced and over-hyped, and which (e.g. poor sleep) deserve a closer look.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sullivan J.A., Samii, C., Brown, D., Moyo, F., Agrawal, A. 2023. Large-scale land acquisitions exacerbate local farmland inequalities in Tanzania. Proceedings of the National Academy of Sciences 120, e2207398120. https://doi.org/10.1073/pnas.2207398120
Land inequality stalls economic development, entrenches poverty, and is associated with environmental degradation. Yet, rigorous assessments of land-use interventions attend to inequality only rarely. A land inequality lens is especially important to understand how recent large-scale land acquisitions (LSLAs) affect smallholder and indigenous communities across as much as 100 million hectares around the world. This paper studies inequalities in land assets, specifically landholdings and farm size, to derive insights into the distributional outcomes of LSLAs. Using a household survey covering four pairs of land acquisition and control sites in Tanzania, we use a quasi-experimental design to characterize changes in land inequality and subsequent impacts on well-being. We find convincing evidence that LSLAs in Tanzania lead to both reduced landholdings and greater farmland inequality among smallholders. Households in proximity to LSLAs are associated with 21.1% (P = 0.02) smaller landholdings while evidence, although insignificant, is suggestive that farm sizes are also declining. Aggregate estimates, however, hide that households in the bottom quartiles of farm size suffer the brunt of landlessness and land loss induced by LSLAs that combine to generate greater farmland inequality. Additional analyses find that land inequality is not offset by improvements in other livelihood dimensions, rather farm size decreases among households near LSLAs are associated with no income improvements, lower wealth, increased poverty, and higher food insecurity. The results demonstrate that without explicit consideration of distributional outcomes, land-use policies can systematically reinforce existing inequalities.
We include anonymized household survey data from our analysis to support open and reproducible science. In particular, we provide i) an anoymized household dataset collected in 2018 (n=994) for households nearby (treatment) and far-away from (control) LSLAs and ii) a household dataset collected in 2019 (n=165) within the same sites. For the 2018 surveys, several anonymized extracts are provided including an imputed (n=10) dataset to fill in missing data that was used for the main analysis. This data can be found in the hh_data folder and includes:
Our analysis also incorporates data from the Living Standards Measurement Survey (LSMS) collected by the World Bank (found in lsms_data folder). We've provide sub-modules from the LSMS dataset relevant to our analysis but the full datasets can be access through the World Bank's Microdata Library (https://microdata.worldbank.org/index.php/home).
Across several analyses we use the LSLA boundaries for our four selected sites. We provide a shapefile for the LSLA boundaries in the gis_data folder.
Finally, our data replication includes several model outputs (found in mod_outputs), particularly those that are lengthy to run in R. These datasets can optionally be loaded into R rather than re-running analysis using our main_analysis.Rmd script.
We provide replication code in the form of R Markdown (.Rmd) or R (.R) files. Alongside the replication data, this can be used to reproduce main figures, table, supplementary materials, and results reported in our article. Scripts include:
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
OGB-LSC consists of three datasets: MAG240M-LSC, WikiKG90M-LSC, and PCQM4M-LSC. Each dataset offers an independent task.
MAG240M-LSC is a heterogeneous academic graph, and the task is to predict the subject areas of papers situated in the heterogeneous graph (node classification). WikiKG90M-LSC is a knowledge graph, and the task is to impute missing triplets (link prediction). PCQM4M-LSC is a quantum chemistry dataset, and the task is to predict an important molecular property, the HOMO-LUMO gap, of a given molecule (graph regression).
Large-scale Renewable Projects Reported by NYSERDA Beginning 2004 dataset includes information for projects completed, operational, cancelled, and under development.
Projects reported by NYSERDA represent projects which NYSERDA has awarded, approved, or are pending approval of NYSERDA Agreements. Details on awards resulting from solicitations in 2024 and later will be published once Agreements associated with these awards are executed.
This dataset does not represent all renewable projects in New York State. Information pertaining to projects located in Long Island please visit LIPA’s 2024 Report.
Projects listed under development are subject to change. For additional information on dataset, please review the data dictionary.
Operating Renewable Energy Resources in NYS is reported through the New York Generation Attribute Tracking System and reported annually.
Note that all Large Scale Renewable energy projects awarded under solicitations ORECRFP18-1, ORECRFP20-1, ORECRFP22-1,ORECRFP23-1, ORECRFP24-1, RESRFP18-1, RESRFP19-1, RESRFP20-1,RESRFP21-1, RESRFP22-1, RESRFP23-1, RESRFP24-1, T2RFP21-1, and T4RFP21-1 require that all laborers, workpersons, and mechanics, within the meaning of NYS Labor Law Article 8, performing construction activities with respect to the Bid Facility and, if awarded, Energy Storage, must be paid at least the applicable Prevailing Wage applicable in the area where the Bid Facility will be situated, erected and used, as published by the NYS Department of Labor (DOL) or, if located outside of New York State, at least the equivalent Prevailing Wage of the jurisdiction where the Bid Facility is located.
For more information on Clean Energy Standard Results, please visit https://www.nyserda.ny.gov/All-Programs/Programs/Clean-Energy-Standard/Renewable-Generators-and-Developers/RES-Tier-One-Eligibility/Solicitations-for-Long-term-Contracts
For more information on the Offshore Wind Results, please visit https://www.nyserda.ny.gov/All-Programs/Offshore-Wind/Focus-Areas/NY-Offshore-Wind-Projects
For more information on the Competitive Tier Two Results, please visit https://www.nyserda.ny.gov/All-Programs/Programs/Clean-Energy-Standard/Renewable-Generators-and-Developers/Tier-Two-Competitive-Program
For more information on the Tier Four Results, please visit https://www.nyserda.ny.gov/All-Programs/Clean-Energy-Standard/Renewable-Generators-and-Developers/Tier-Four
For More Information on Long Island projects, please visit https://www.flipsnack.com/lipower/2024-budget-report/full-view.html
The New York State Energy Research and Development Authority (NYSERDA) offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit https://nyserda.ny.gov or follow us on Twitter, Facebook, YouTube, or Instagram.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This article introduces a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for k-means clustering. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm when clustering a large number of clusters, and a k-means algorithm initialized by the algorithm outperforms many of the classic clustering methods in both speed and accuracy, while scaling well to large datasets such as ImageNet. Supplementary materials for this article are available online.
Abstract copyright UK Data Service and data collection copyright owner. This project involved two economics experiments conducted using an existing household panel survey. The experiments aimed at understanding possible sources of heterogeneity in important economic outcomes such as saving and wealth accumulation. The two experiments use the CentERpanel, an online, weekly, stratified survey of a sample of over 2,000 households and 5,000 individual members, conducted in the Netherlands (not currently held at the UK Data Archive). The CentERdata DNB Household Survey, 2008 and 2009 provided a unique opportunity to combine experimental data with sociodemographic and economic variables from the survey. The subjects in the experiment were randomly recruited from the CentERpanel sample. The first experiment, concerned with decision-making under uncertainty, was conducted in May 2009 with 1,182 CentERpanel adult members. The second experiment, concerning intertemporal decision making, was conducted in June and September 2011 with 1,425 panel members. Subjects were presented with a sequence of decision problems: under uncertainty in the first experiment and over time in the second experiment. In the first experiment decision problems could be interpreted as the allocation of an endowment between two risky assets, while they are the allocation of an endowment over two payment dates in the second experiment. These decision problems were presented using and adapting a graphical interface introduced by Choi et al. (2007). Because the design was user-friendly, it was possible to present each subject with many choices, allowing analysis of the data at the level of the individual subject. Rich, individual-level information of experimental choices allows the separate measurement of quality and preferences of decision-making under uncertainty and over time. This analysis was then related to socio-demographic information and economic outcomes, such as saving and wealth accumulation in the panel data. Further information may be found on the ESRC Measuring Risk and Time Preferences: Large-Scale Field Experiments award webpage. The DNB Household Survey data are publicly available from the CentERdata DNB Household Survey website. Main Topics: Topics covered include: measurements of risk and time preferences; measurements of consistency with utility maximization as decision-making quality; relationships between experimental measures and socio-demographic information; relationships between experimental measures and economic variables - wealth and portfolio allocation. Simple random sample Self-completion
Over 4,400 large scale commercial solar facilities are in operation in the United States as of December, 2021, representing over 60 gigawatts of electric power capacity; of these, over 3,900 are ground-mounted with capacities of 1MW or more, specified as large scale solar photovoltaic (LSPV) facilities. LSPV ground-mounted installations continue to grow, with over 400 projects coming online in 2021 alone. Currently, a comprehensive, publicly available georectified data describing the locations and spatial footprints of these facilities does not exist. Analysts from the US Geological Survey and Lawrence Berkeley National Laboratory collaborated to develop and release the United States Large Scale Solar Photovoltaic Database (USPVDB). This effort built from the expertise gained while developing the regularly updated United States Wind Turbine Database (USWTDB). Starting from Energy Information Administration (EIA) data, locations of LSPV facilities were visually verified using high-resolution aerial imagery; a polygon was drawn around the extent of facility panel arrays, and facility attributes were appended. Quality assurance and control were achieved via team peer review, and comparing the USPVDB to other datasets of US PV. The data are available in several formats, including an interactive web application, comma-separated value spreadsheet (CSV), application programming interface (API), and a shapefile. The data are available for use by academic researchers, engineers and developers from PV companies, government agencies, planners, educators, and the general public.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This repository contains the Papyrus dataset, an aggregated dataset of small molecule bioactivities, as described in the manuscript "Papyrus - A large scale curated dataset aimed at bioactivity predictions" (Work in Progress).
With the recent rapid growth of publicly available ligand-protein bioactivity data, there is a trove of viable data that can be used to train machine learning algorithms. However, not all data is equal in terms of size and quality, and a significant portion of researcher’s time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. As an answer to that, we have constructed the Papyrus dataset, comprised of around 60 million datapoints. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets containing high quality data. This aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways, and also perform some rudimentary quantitative structure-activity relationship and proteochemometrics modeling. Our ambition is to create a benchmark set that can be used for constructing predictive models, while also providing a solid baseline for related research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Japan Large Scale Retail Stores: Existing Stores Adj Sales Ratio (ESASR) data was reported at 100.400 % in Sep 2018. This records an increase from the previous number of 99.900 % for Aug 2018. Japan Large Scale Retail Stores: Existing Stores Adj Sales Ratio (ESASR) data is updated monthly, averaging 98.400 % from Jan 1988 (Median) to Sep 2018, with 369 observations. The data reached an all-time high of 123.800 % in Mar 1989 and a record low of 85.100 % in Mar 1998. Japan Large Scale Retail Stores: Existing Stores Adj Sales Ratio (ESASR) data remains active status in CEIC and is reported by Ministry of Economy, Trade and Industry. The data is categorized under Global Database’s Japan – Table JP.H005: Large Scale Retail Stores: Sales and Commodity Stock Value.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109
Abstract
The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description
The dataset comprises a total of 52,984 Tweet IDs (that correspond to the same number of Tweets) about online learning that were posted on Twitter from 9th November 2021 to 13th July 2022. The earliest date was selected as 9th November 2021, as the Omicron variant was detected for the first time in a sample that was collected on this date. 13th July 2022 was the most recent date as per the time of data collection and publication of this dataset.
The dataset consists of 9 .txt files. An overview of these dataset files along with the number of Tweet IDs and the date range of the associated tweets is as follows. Table 1 shows the list of all the synonyms or terms that were used for the dataset development.
Filename: TweetIDs_November_2021.txt (No. of Tweet IDs: 1283, Date Range of the associated Tweet IDs: November 1, 2021 to November 30, 2021)
Filename: TweetIDs_December_2021.txt (No. of Tweet IDs: 10545, Date Range of the associated Tweet IDs: December 1, 2021 to December 31, 2021)
Filename: TweetIDs_January_2022.txt (No. of Tweet IDs: 23078, Date Range of the associated Tweet IDs: January 1, 2022 to January 31, 2022)
Filename: TweetIDs_February_2022.txt (No. of Tweet IDs: 4751, Date Range of the associated Tweet IDs: February 1, 2022 to February 28, 2022)
Filename: TweetIDs_March_2022.txt (No. of Tweet IDs: 3434, Date Range of the associated Tweet IDs: March 1, 2022 to March 31, 2022)
Filename: TweetIDs_April_2022.txt (No. of Tweet IDs: 3355, Date Range of the associated Tweet IDs: April 1, 2022 to April 30, 2022)
Filename: TweetIDs_May_2022.txt (No. of Tweet IDs: 3120, Date Range of the associated Tweet IDs: May 1, 2022 to May 31, 2022)
Filename: TweetIDs_June_2022.txt (No. of Tweet IDs: 2361, Date Range of the associated Tweet IDs: June 1, 2022 to June 30, 2022)
Filename: TweetIDs_July_2022.txt (No. of Tweet IDs: 1057, Date Range of the associated Tweet IDs: July 1, 2022 to July 13, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.
Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development
Terminology
List of synonyms and terms
COVID-19
Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus
online learning
online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures