79 datasets found
  1. Z

    INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Silicon Orchard Lab, Bangladesh
    Independent University, Bangladesh
    University of Memphis, USA
    Authors
    Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh, United States
    Description

    Introduction

    There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

    However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

    2 Data-set Introduction

    2.1 Data Collection

    We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

    The headline must have one or more words directly or indirectly related to COVID-19.

    The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

    The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

    Avoid taking duplicate reports.

    Maintain a time frame for the above mentioned newspapers.

    To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

    2.2 Data Pre-processing and Statistics

    Some pre-processing steps performed on the newspaper report dataset are as follows:

    Remove hyperlinks.

    Remove non-English alphanumeric characters.

    Remove stop words.

    Lemmatize text.

    While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

    The primary data statistics of the two dataset are shown in Table 1 and 2.

    Table 1: Covid-News-USA-NNK data statistics

    No of words per headline

    7 to 20

    No of words per body content

    150 to 2100

    Table 2: Covid-News-BD-NNK data statistics No of words per headline

    10 to 20

    No of words per body content

    100 to 1500

    2.3 Dataset Repository

    We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

    3 Literature Review

    Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

    Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

    Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

    Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

    4 Our experiments and Result analysis

    We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

    In February, both the news paper have talked about China and source of the outbreak.

    StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

    Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

    Washington Post discussed global issues more than StarTribune.

    StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

    While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

    We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

    where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,

  2. Complete Rxivist dataset of scraped biology preprint data

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated Mar 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard J. Abdill; Richard J. Abdill; Ran Blekhman; Ran Blekhman (2023). Complete Rxivist dataset of scraped biology preprint data [Dataset]. http://doi.org/10.5281/zenodo.7688682
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard J. Abdill; Richard J. Abdill; Ran Blekhman; Ran Blekhman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

    Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

    Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

    Version notes:

    • 2023-03-01
      • The final Rxivist data upload, more than four years after the first and encompassing 223,541 preprints posted to bioRxiv and medRxiv through the end of February 2023.
    • 2020-12-07***
      • In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.
        • The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".
      • We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.
        • The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.
      • Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.
        • The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.
        • Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.
        • There are still several gaps that are more than a week long due to missing data from Crossref.
        • We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.
      • The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.
    • 2020-11-18
      • Publication checks should be back on schedule.
    • 2020-10-26
      • This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.
    • 2020-09-15
      • A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.
    • 2020-09-08
      • This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.
    • 2019-12-27
      • Several dozen papers did not have dates associated with them; that has been fixed.
      • Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.
    • 2019-11-29
      • The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.
    • 2019-10-31
      • The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.
      • The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.
    • 2019-10-01
      • The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.
      • About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.
    • 2019-08-30
      • The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.
    • 2019-07-01
      • A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.
        • We began collecting this data in the middle of May, but it has not been applied to older papers yet.
    • 2019-05-11
      • The README was updated to correct a link to the Docker repository used for the pre-built images.
    • 2019-03-21
      • The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.
      • A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)
      • Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.
      • The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.
      • The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.
      • The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.
    • 2019-02-13.1
      • After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.
      • The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.
    • 2019-02-13
      • The redundant "paper" schema has been removed.
      • BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.
      • This is the first version that has a corresponding Docker image.
  3. Papers by Subject

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arman Mohammadi (2023). Papers by Subject [Dataset]. https://www.kaggle.com/datasets/arplusman/papers-by-subject
    Explore at:
    zip(18101984 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    Arman Mohammadi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    This extensive dataset comprises approximately 50,000 academic papers along with their corresponding metadata, designed to facilitate various natural language processing (NLP) tasks such as classification and retrieval. The dataset covers a diverse range of research domains, including but not limited to computer science, biology, social sciences, engineering, and more. The list of all categories can be found here. With its comprehensive collection of academic papers and enriched metadata, this dataset serves as a valuable resource for researchers and data enthusiasts interested in advancing NLP applications in the academic domain.

    Key Features

    Metadata: The dataset includes essential metadata for each paper, such as the publish date, title, summary/abstract, author(s), and category. The metadata is meticulously curated to ensure accuracy and consistency, enabling researchers to swiftly extract valuable insights and conduct exploratory data analysis.

    Vast Paper Collection: With nearly 50,000 academic papers, this dataset encompasses a broad spectrum of research topics and domains, making it suitable for a wide range of NLP tasks, including but not limited to document classification, topic modeling, and document retrieval.

    Application Flexibility: The dataset is meticulously preprocessed and annotated, making it adaptable for various NLP applications. Researchers and practitioners can use it for tasks like sentiment analysis, keyword extraction, and more.

    Potential Use Cases

    Document Classification: Leverage this dataset to build powerful classifiers capable of categorizing academic papers into relevant research domains or topics. This can aid in automated content organization and information retrieval.

    Document Retrieval: Develop efficient retrieval models that can quickly identify and retrieve relevant papers based on user queries or specific keywords. Such models can streamline the research process and assist researchers in finding relevant literature faster.

    Topic Modeling: Use this dataset to perform topic modeling and extract meaningful topics or themes present within the academic papers. This can provide valuable insights into the prevailing research trends and interests within different disciplines.

    Recommendation Systems: Employ the dataset to build personalized recommendation systems that suggest relevant papers to researchers based on their previous interests or research focus.

    Acknowledgment

    We would like to express our gratitude to the authors and publishers of the academic papers included in this dataset for their valuable contributions to the research community. By making this dataset publicly available, we hope to foster advancements in natural language processing and support data-driven research across diverse domains.

    Disclaimer

    As the curators of this dataset, we have made every effort to ensure the accuracy and quality of the data. However, we cannot guarantee the absolute correctness of the information or the suitability of the dataset for any specific purpose. Users are encouraged to exercise their judgment and discretion while utilizing the dataset for their research projects.

    We sincerely hope that this dataset proves to be a valuable resource for the NLP community and contributes to the development of innovative solutions in academic research and beyond. Happy analyzing and modeling!

  4. w

    Data Use in Academia Dataset

    • datacatalog.worldbank.org
    csv, utf-8
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
    Explore at:
    utf-8, csvAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Semantic Scholar Open Research Corpus (S2ORC)
    Brian William Stacy
    License

    https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

    Description

    This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.


    Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.


    We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.


    Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.


    The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.


    To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.


    The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.


    The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:


    Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

    There are two classification tasks in this exercise:

    1. identifying whether an academic article is using data from any country

    2. Identifying from which country that data came.

    For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

    After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

    For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

    We expect between 10 and 35 percent of all articles to use data.


    The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.


    A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.


    The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.


    The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

  5. d

    Data from: Turfgrass Soil Carbon Change Through Time: Raw Data and Code

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Turfgrass Soil Carbon Change Through Time: Raw Data and Code [Dataset]. https://catalog.data.gov/dataset/turfgrass-soil-carbon-change-through-time-raw-data-and-code-1dd62
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Data Description Managed turfgrass is a common component of urban landscapes that is expanding under current land use trends. Previous studies have reported high rates of soil carbon sequestration in turfgrass, but no systematic review has summarized these rates nor evaluated how they change as turfgrass ages. We conducted a meta-analysis of soil carbon sequestration rates from 63 studies. Those data, as well as the code used to analyze them and create figures, are shared here. Dataset Development We conducted a systematic review from Nov 2020 to Jan 2021 using Google Scholar, Web of Science, and the Michigan Turfgrass Information File Database. The search terms targeted were "soil carbon", "carbon sequestration", "carbon storage", or “carbon stock”, with "turf", "turfgrass", "lawn", "urban ecosystem", or "residential", “Fescue”, “Zoysia”, “Poa”, “Cynodon”, “Bouteloua”, “Lolium”, or “Agrostis”. We included only peer-reviewed studies written in English that measured SOC change over one year or longer, and where grass was managed as turf (mowed or clipped regularly). We included studies that sampled to any soil depth, and included several methodologies: small-plot research conducted over a few years (22 datasets from 4 articles), chronosequences of golf courses or residential lawns (39 datasets from 16 articles), and one study that was a variation on a chronosequence method and compiled long-term soil test data provided by golf courses of various ages (3 datasets from Qian & Follett, 2002). In total, 63 datasets from 21 articles met the search criteria. We excluded 1) duplicate reports of the same data, 2) small plot studies that did not report baseline SOC stocks, and 3) pure modeling studies. We included five papers that only measured changes in SOC concentrations, but not areal stocks (i.e., SOC in Mg ha-1). For these papers, we converted from concentrations to stocks using several approaches. For two papers (Law & Patton, 2017; Y. Qian & Follett, 2002) we used estimated bulk densities provided by the authors. For the chronosequences reported in Selhorst & Lal (2011), we used the average bulk density reported by the author. For the 13 choronosequences reported in Selhorst & Lal (2013), we estimated bulk density from the average relationship between percent C and bulk density reported by Selhorst (2011). For Wang et al. (2014), we used bulk density values from official soil survey descriptions. Data provenance In most cases we contacted authors of the studies to obtain the original data. If authors did not reply after two inquiries, or no longer had access to the data, we captured data from published figures using WebPlotDigitizer (Rohatgi, 2021). For three manuscripts the data was already available, or partially available, in public data repositories. Data provenance information is provided in the document "Dataset summaries and citations.docx". Recommended Uses We recommend the following to data users: Consult and cite the original manuscripts for each dataset, which often provide additional information about turfgrass management, experimental methods, and environmental context. Original citations are provided in the document "Dataset summaries and citations.docx". For datasets that were previously published in public repositories, consult and cite the original datasets, which may provide additional data on turfgrass management practices, soil nitrogen, and natural reference sites. Links to repositories are in the document "Dataset summaries and citations.docx". Consider contacting the dataset authors to notify them of your plans to use the data, and to offer co-authorship as appropriate.

  6. l

    Data from: Where do engineering students really get their information? :...

    • opal.latrobe.edu.au
    • researchdata.edu.au
    pdf
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clayton Bolitho (2025). Where do engineering students really get their information? : using reference list analysis to improve information literacy programs [Dataset]. http://doi.org/10.4225/22/59d45f4b696e4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    La Trobe
    Authors
    Clayton Bolitho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundAn understanding of the resources which engineering students use to write their academic papers provides information about student behaviour as well as the effectiveness of information literacy programs designed for engineering students. One of the most informative sources of information which can be used to determine the nature of the material that students use is the bibliography at the end of the students’ papers. While reference list analysis has been utilised in other disciplines, few studies have focussed on engineering students or used the results to improve the effectiveness of information literacy programs. Gadd, Baldwin and Norris (2010) found that civil engineering students undertaking a finalyear research project cited journal articles more than other types of material, followed by books and reports, with web sites ranked fourth. Several studies, however, have shown that in their first year at least, most students prefer to use Internet search engines (Ellis & Salisbury, 2004; Wilkes & Gurney, 2009).PURPOSEThe aim of this study was to find out exactly what resources undergraduate students studying civil engineering at La Trobe University were using, and in particular, the extent to which students were utilising the scholarly resources paid for by the library. A secondary purpose of the research was to ascertain whether information literacy sessions delivered to those students had any influence on the resources used, and to investigate ways in which the information literacy component of the unit can be improved to encourage students to make better use of the resources purchased by the Library to support their research.DESIGN/METHODThe study examined student bibliographies for three civil engineering group projects at the Bendigo Campus of La Trobe University over a two-year period, including two first-year units (CIV1EP – Engineering Practice) and one-second year unit (CIV2GR – Engineering Group Research). All units included a mandatory library session at the start of the project where student groups were required to meet with the relevant faculty librarian for guidance. In each case, the Faculty Librarian highlighted specific resources relevant to the topic, including books, e-books, video recordings, websites and internet documents. The students were also shown tips for searching the Library catalogue, Google Scholar, LibSearch (the LTU Library’s research and discovery tool) and ProQuest Central. Subject-specific databases for civil engineering and science were also referred to. After the final reports for each project had been submitted and assessed, the Faculty Librarian contacted the lecturer responsible for the unit, requesting copies of the student bibliographies for each group. References for each bibliography were then entered into EndNote. The Faculty Librarian grouped them according to various facets, including the name of the unit and the group within the unit; the material type of the item being referenced; and whether the item required a Library subscription to access it. A total of 58 references were collated for the 2010 CIV1EP unit; 237 references for the 2010 CIV2GR unit; and 225 references for the 2011 CIV1EP unit.INTERIM FINDINGSThe initial findings showed that student bibliographies for the three group projects were primarily made up of freely available internet resources which required no library subscription. For the 2010 CIV1EP unit, all 58 resources used were freely available on the Internet. For the 2011 CIV1EP unit, 28 of the 225 resources used (12.44%) required a Library subscription or purchase for access, while the second-year students (CIV2GR) used a greater variety of resources, with 71 of the 237 resources used (29.96%) requiring a Library subscription or purchase for access. The results suggest that the library sessions had little or no influence on the 2010 CIV1EP group, but the sessions may have assisted students in the 2011 CIV1EP and 2010 CIV2GR groups to find books, journal articles and conference papers, which were all represented in their bibliographiesFURTHER RESEARCHThe next step in the research is to investigate ways to increase the representation of scholarly references (found by resources other than Google) in student bibliographies. It is anticipated that such a change would lead to an overall improvement in the quality of the student papers. One way of achieving this would be to make it mandatory for students to include a specified number of journal articles, conference papers, or scholarly books in their bibliographies. It is also anticipated that embedding La Trobe University’s Inquiry/Research Quiz (IRQ) using a constructively aligned approach will further enhance the students’ research skills and increase their ability to find suitable scholarly material which relates to their topic. This has already been done successfully (Salisbury, Yager, & Kirkman, 2012)CONCLUSIONS & CHALLENGESThe study shows that most students rely heavily on the free Internet for information. Students don’t naturally use Library databases or scholarly resources such as Google Scholar to find information, without encouragement from their teachers, tutors and/or librarians. It is acknowledged that the use of scholarly resources doesn’t automatically lead to a high quality paper. Resources must be used appropriately and students also need to have the skills to identify and synthesise key findings in the existing literature and relate these to their own paper. Ideally, students should be able to see the benefit of using scholarly resources in their papers, and continue to seek these out even when it’s not a specific assessment requirement, though it can’t be assumed that this will be the outcome.REFERENCESEllis, J., & Salisbury, F. (2004). Information literacy milestones: building upon the prior knowledge of first-year students. Australian Library Journal, 53(4), 383-396.Gadd, E., Baldwin, A., & Norris, M. (2010). The citation behaviour of civil engineering students. Journal of Information Literacy, 4(2), 37-49.Salisbury, F., Yager, Z., & Kirkman, L. (2012). Embedding Inquiry/Research: Moving from a minimalist model to constructive alignment. Paper presented at the 15th International First Year in Higher Education Conference, Brisbane. Retrieved from http://www.fyhe.com.au/past_papers/papers12/Papers/11A.pdfWilkes, J., & Gurney, L. J. (2009). Perceptions and applications of information literacy by first year applied science students. Australian Academic & Research Libraries, 40(3), 159-171.

  7. Forecasting Studies: Comprehensive Review

    • kaggle.com
    zip
    Updated Feb 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Forecasting Studies: Comprehensive Review [Dataset]. https://www.kaggle.com/datasets/thedevastator/forecasting-studies-comprehensive-review
    Explore at:
    zip(9107 bytes)Available download formats
    Dataset updated
    Feb 13, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Forecasting Studies: Comprehensive Review

    A Survey of 100 Papers from the Last 40 Years

    By [source]

    About this dataset

    This dataset contains insightful data related to forecasting studies, highlighting the work of 100 influential papers from the last 40 years, hand-selected for their high citation levels and enduring importance. Through this valuable dataset, analyze more than just the publications year or average citations per year; data points like number of time series used, methods employed and measures performed are also tracked here. With this comprehensive review of forecasting studies in different domains, realize and appreciate the lasting impact these papers have had on this field

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers an in-depth study on forecasting studies conducted over the past forty years. With it, you can easily survey the extensive literature surrounding forecasting and get an understanding of how issues within this field have evolved and developed over time.

    To begin using this dataset, take a look at the columns included, and gain a better understanding of what they mean: - Name: The name of each paper in the dataset. - Url: The URL of each paper. - Cites Per Year: The average number of citations for each paper per year. This is an indication that these papers are important and influential within the field.
    - Year: The year each paper was published. This can help to understand how topics have changed throughout time with regards to forecasting practices and trends .
    - # Time Series: The number of time series used in each paper – useful if specific datasets are being surveyed across many publications or research topics need to be studied against particular datasets used over a period of time
    - # Methods & # Measures: These indicate which methods or measures were used by different researchers at different points in time to investigate various aspects related to forecasting studies – again helping with tracking changes across topics and areas related to forecasting

    Once familiar with what information is contained within this dataset, users will be able to filter through data more efficiently depending on their research needs; allowing them access relevant results quickly without having to trawl through hundreds volumes worth of material!

    Research Ideas

    • Analyzing the trends in time series forecasting methods over the years.
    • Identifying commonly-used time series and methodologies for forecasting different types of events, such as stock prices or industry trends.
    • Identifying new methods or measures which have yielded higher accuracy levels than existing ones for specific types of forecast studies

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: reviewed articles.csv | Column name | Description | |:-------------------|:------------------------------------------------------------------| | Name | The name of the paper. (String) | | Url | The URL of the paper. (String) | | Cites per year | The average number of citations per year for the paper. (Integer) | | Year | The year the paper was published. (Integer) | | # time series | The number of time series used in the paper. (Integer) | | # methods | The number of methods used in the paper. (Integer) | | # measures | The number of measures used in the paper. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  8. Consumer Price Index 2022 - West Bank and Gaza

    • pcbs.gov.ps
    Updated May 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palestinian Central Bureau of Statistics (2023). Consumer Price Index 2022 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/717
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset authored and provided by
    Palestinian Central Bureau of Statisticshttps://pcbs.gov/
    Time period covered
    2022
    Area covered
    Gaza Strip, Gaza, West Bank
    Description

    Abstract

    The Consumer price surveys primarily provide the following: Data on CPI in Palestine covering the West Bank, Gaza Strip and Jerusalem J1 for major and sub groups of expenditure. Statistics needed for decision-makers, planners and those who are interested in the national economy. Contribution to the preparation of quarterly and annual national accounts data.

    Consumer Prices and indices are used for a wide range of purposes, the most important of which are as follows: Adjustment of wages, government subsidies and social security benefits to compensate in part or in full for the changes in living costs. To provide an index to measure the price inflation of the entire household sector, which is used to eliminate the inflation impact of the components of the final consumption expenditure of households in national accounts and to dispose of the impact of price changes from income and national groups. Price index numbers are widely used to measure inflation rates and economic recession. Price indices are used by the public as a guide for the family with regard to its budget and its constituent items. Price indices are used to monitor changes in the prices of the goods traded in the market and the consequent position of price trends, market conditions and living costs. However, the price index does not reflect other factors affecting the cost of living, e.g. the quality and quantity of purchased goods. Therefore, it is only one of many indicators used to assess living costs. It is used as a direct method to identify the purchasing power of money, where the purchasing power of money is inversely proportional to the price index.

    Geographic coverage

    Palestine West Bank Gaza Strip Jerusalem

    Analysis unit

    The target population for the CPI survey is the shops and retail markets such as grocery stores, supermarkets, clothing shops, restaurants, public service institutions, private schools and doctors.

    Universe

    The target population for the CPI survey is the shops and retail markets such as grocery stores, supermarkets, clothing shops, restaurants, public service institutions, private schools and doctors.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A non-probability purposive sample of sources from which the prices of different goods and services are collected was updated based on the establishment census 2017, in a manner that achieves full coverage of all goods and services that fall within the Palestinian consumer system. These sources were selected based on the availability of the goods within them. It is worth mentioning that the sample of sources was selected from the main cities inside Palestine: Jenin, Tulkarm, Nablus, Qalqiliya, Ramallah, Al-Bireh, Jericho, Jerusalem, Bethlehem, Hebron, Gaza, Jabalia, Dier Al-Balah, Nusseirat, Khan Yunis and Rafah. The selection of these sources was considered to be representative of the variation that can occur in the prices collected from the various sources. The number of goods and services included in the CPI is approximately 730 commodities, whose prices were collected from 3,200 sources. (COICOP) classification is used for consumer data as recommended by the United Nations System of National Accounts (SNA-2008).

    Sampling deviation

    Not apply

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    A tablet-supported electronic form was designed for price surveys to be used by the field teams in collecting data from different governorates, with the exception of Jerusalem J1. The electronic form is supported with GIS, and GPS mapping technique that allow the field workers to locate the outlets exactly on the map and the administrative staff to manage the field remotely. The electronic questionnaire is divided into a number of screens, namely: First screen: shows the metadata for the data source, governorate name, governorate code, source code, source name, full source address, and phone number. Second screen: shows the source interview result, which is either completed, temporarily paused or permanently closed. It also shows the change activity as incomplete or rejected with the explanation for the reason of rejection. Third screen: shows the item code, item name, item unit, item price, product availability, and reason for unavailability. Fourth screen: checks the price data of the related source and verifies their validity through the auditing rules, which was designed specifically for the price programs. Fifth screen: saves and sends data through (VPN-Connection) and (WI-FI technology).

    In case of the Jerusalem J1 Governorate, a paper form has been designed to collect the price data so that the form in the top part contains the metadata of the data source and in the lower section contains the price data for the source collected. After that, the data are entered into the price program database.

    Cleaning operations

    The price survey forms were already encoded by the project management depending on the specific international statistical classification of each survey. After the researcher collected the price data and sent them electronically, the data was reviewed and audited by the project management. Achievement reports were reviewed on a daily and weekly basis. Also, the detailed price reports at data source levels were checked and reviewed on a daily basis by the project management. If there were any notes, the researcher was consulted in order to verify the data and call the owner in order to correct or confirm the information.

    At the end of the data collection process in all governorates, the data will be edited using the following process: Logical revision of prices by comparing the prices of goods and services with others from different sources and other governorates. Whenever a mistake is detected, it should be returned to the field for correction. Mathematical revision of the average prices for items in governorates and the general average in all governorates. Field revision of prices through selecting a sample of the prices collected from the items.

    Response rate

    Not apply

    Sampling error estimates

    The findings of the survey may be affected by sampling errors due to the use of samples in conducting the survey rather than total enumeration of the units of the target population, which increases the chances of variances between the actual values we expect to obtain from the data if we had conducted the survey using total enumeration. The computation of differences between the most important key goods showed that the variation of these goods differs due to the specialty of each survey. The variance of the key goods in the computed and disseminated CPI survey that was carried out on the Palestine level was for reasons related to sample design and variance calculation of different indicators since there was a difficulty in the dissemination of results by governorates due to lack of weights. Non-sampling errors are probable at all stages of data collection or data entry. Non-sampling errors include: Non-response errors: the selected sources demonstrated a significant cooperation with interviewers; so, there wasn't any case of non-response reported during 2019. Response errors (respondent), interviewing errors (interviewer), and data entry errors: to avoid these types of errors and reduce their effect to a minimum, project managers adopted a number of procedures, including the following: More than one visit was made to every source to explain the objectives of the survey and emphasize the confidentiality of the data. The visits to data sources contributed to empowering relations, cooperation, and the verification of data accuracy. Interviewer errors: a number of procedures were taken to ensure data accuracy throughout the process of field data compilation: Interviewers were selected based on educational qualification, competence, and assessment. Interviewers were trained theoretically and practically on the questionnaire. Meetings were held to remind interviewers of instructions. In addition, explanatory notes were supplied with the surveys. A number of procedures were taken to verify data quality and consistency and ensure data accuracy for the data collected by a questioner throughout processing and data entry (knowing that data collected through paper questionnaires did not exceed 5%): Data entry staff was selected from among specialists in computer programming and were fully trained on the entry programs. Data verification was carried out for 10% of the entered questionnaires to ensure that data entry staff had entered data correctly and in accordance with the provisions of the questionnaire. The result of the verification was consistent with the original data to a degree of 100%. The files of the entered data were received, examined, and reviewed by project managers before findings were extracted. Project managers carried out many checks on data logic and coherence, such as comparing the data of the current month with that of the previous month, and comparing the data of sources and between governorates. Data collected by tablet devices were checked for consistency and accuracy by applying rules at item level to be checked.

    Data appraisal

    Other technical procedures to improve data quality: Seasonal adjustment processes and estimations of non-available items' prices: Under each category, a number of common items are used in Palestine to calculate the price levels and to represent the commodity within the commodity group. Of course, it is

  9. p

    Business Activity Survey 2009 - Samoa

    • microdata.pacificdata.org
    Updated Jul 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samoa Bureau of Statistics (2019). Business Activity Survey 2009 - Samoa [Dataset]. https://microdata.pacificdata.org/index.php/catalog/253
    Explore at:
    Dataset updated
    Jul 2, 2019
    Dataset authored and provided by
    Samoa Bureau of Statistics
    Time period covered
    2009
    Area covered
    Samoa
    Description

    Abstract

    The intention is to collect data for the calendar year 2009 (or the nearest year for which each business keeps its accounts. The survey is considered a one-off survey, although for accurate NAs, such a survey should be conducted at least every five years to enable regular updating of the ratios, etc., needed to adjust the ongoing indicator data (mainly VAGST) to NA concepts. The questionnaire will be drafted by FSD, largely following the previous BAS, updated to current accounting terminology where necessary. The questionnaire will be pilot tested, using some accountants who are likely to complete a number of the forms on behalf of their business clients, and a small sample of businesses. Consultations will also include Ministry of Finance, Ministry of Commerce, Industry and Labour, Central Bank of Samoa (CBS), Samoa Tourism Authority, Chamber of Commerce, and other business associations (hotels, retail, etc.).

    The questionnaire will collect a number of items of information about the business ownership, locations at which it operates and each establishment for which detailed data can be provided (in the case of complex businesses), contact information, and other general information needed to clearly identify each unique business. The main body of the questionnaire will collect data on income and expenses, to enable value added to be derived accurately. The questionnaire will also collect data on capital formation, and will contain supplementary pages for relevant industries to collect volume of production data for selected commodities and to collect information to enable an estimate of value added generated by key tourism activities.

    The principal user of the data will be FSD which will incorporate the survey data into benchmarks for the NA, mainly on the current published production measure of GDP. The information on capital formation and other relevant data will also be incorporated into the experimental estimates of expenditure on GDP. The supplementary data on volumes of production will be used by FSD to redevelop the industrial production index which has recently been transferred under the SBS from the CBS. The general information about the business ownership, etc., will be used to update the Business Register.

    Outputs will be produced in a number of formats, including a printed report containing descriptive information of the survey design, data tables, and analysis of the results. The report will also be made available on the SBS website in “.pdf” format, and the tables will be available on the SBS website in excel tables. Data by region may also be produced, although at a higher level of aggregation than the national data. All data will be fully confidentialised, to protect the anonymity of all respondents. Consideration may also be made to provide, for selected analytical users, confidentialised unit record files (CURFs).

    A high level of accuracy is needed because the principal purpose of the survey is to develop revised benchmarks for the NA. The initial plan was that the survey will be conducted as a stratified sample survey, with full enumeration of large establishments and a sample of the remainder.

    Geographic coverage

    National Coverage

    Analysis unit

    The main statistical unit to be used for the survey is the establishment. For simple businesses that undertake a single activity at a single location there is a one-to-one relationship between the establishment and the enterprise. For large and complex enterprises, however, it is desirable to separate each activity of an enterprise into establishments to provide the most detailed information possible for industrial analysis. The business register will need to be developed in such a way that records the links between establishments and their parent enterprises. The business register will be created from administrative records and may not have enough information to recognize all establishments of complex enterprises. Large businesses will be contacted prior to the survey post-out to determine if they have separate establishments. If so, the extended structure of the enterprise will be recorded on the business register and a questionnaire will be sent to the enterprise to be completed for each establishment.

    SBS has decided to follow the New Zealand simplified version of its statistical units model for the 2009 BAS. Future surveys may consider location units and enterprise groups if they are found to be useful for statistical collections.

    It should be noted that while establishment data may enable the derivation of detailed benchmark accounts, it may be necessary to aggregate up to enterprise level data for the benchmarks if the ongoing data used to extrapolate the benchmark forward (mainly VAGST) are only available at the enterprise level.

    Universe

    The BAS's covered all employing units, and excluded small non-employing units such as the market sellers. The surveys also excluded central government agencies engaged in public administration (ministries, public education and health, etc.). It only covers businesses that pay the VAGST. (Threshold SAT$75,000 and upwards).

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    -Total Sample Size was 1240 -Out of the 1240, 902 successfully completed the questionnaire. -The other remaining 338 either never responded or were omitted (some businesses were ommitted from the sample as they do not meet the requirement to be surveyed) -Selection was all employing units paying VAGST (Threshold SAT $75,000 upwards)

    WILL CONFIRM LATER!!

    OSO LE MEA E LE FAASA...AEA :-)

    Mode of data collection

    Mail Questionnaire [mail]

    Research instrument

    1. General instructions, authority for the survey, etc;
    2. Business demography information on ownership, contact details, structure, etc.;
    3. Employment;
    4. Income;
    5. Expenses;
    6. Inventories;
    7. Profit or loss and reconciliation to business accounts' profit and loss;
    8. Fixed assets - purchases, disposals, book values
    9. Thank you and signature of respondent.

    Supplementary Pages Additional pages have been prepared to collect data for a limited range of industries. 1.Production data. To rebase and redevelop the Industrial Production Index (IPI), it is intended to collect volume of production information from a selection of large manufacturing businesses. The selection of businesses and products is critical to the usefulness of the IPI. The products must be homogeneous, and be of enough importance to the economy to justify collecting the data. Significance criteria should be established for the selection of products to include in the IPI, and the 2009 BAS provides an opportunity to collect benchmark data for a range of products known to be significant (based on information in the existing IPI, CPI weights, export data, etc.) as well as open questions for respondents to provide information on other significant products. 2.Tourism. There is a strong demand for estimates of tourism value added. To estimate tourism value added using the international standard Tourism Satellite Account methodology requires the use of an input-output table, which is beyond the capacity of SBS at present. However, some indicative estimates of the main parts of the economy influenced by tourism can be derived if the necessary data are collected. Tourism is a demand concept, based on defining tourists (the international standard includes both international and domestic tourists), what products are characteristically purchased by tourists, and which industries supply those products. Some questions targeted at those industries that have significant involvement with tourists (hotels, restaurants, transport and tour operators, vehicle hire, etc.), on how much of their income is sourced from tourism would provide valuable indicators of the size of the direct impact of tourism.

    Cleaning operations

    Partial imputation was done at the time of receipt of questionnaires, after follow-up procedures to obtain fully completed questionnaires have been followed. Imputation followed a process, i.e., apply ratios from responding units in the imputation cell to the partial data that was supplied. Procedures were established during the editing stage (a) to preserve the integrity of the questionnaires as supplied by respondents, and (b) to record all changes made to the questionnaires during editing. If SBS staff writes on the form, for example, this should only be done in red pen, to distinguish the alterations from the original information.

    Additional edit checks were developed, including checking against external data at enterprise/establishment level. External data to be checked against include VAGST and SNPF for turnover and purchases, and salaries and wages and employment data respectively. Editing and imputation processes were undertaken by FSD using Excel.

    Sampling error estimates

    NOT APPLICABLE!!

  10. d

    August 2024 data-update for "Updated science-wide author databases of...

    • elsevier.digitalcommonsdata.com
    Updated Sep 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John P.A. Ioannidis (2024). August 2024 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.7
    Explore at:
    Dataset updated
    Sep 16, 2024
    Authors
    John P.A. Ioannidis
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a

  11. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Washington and Lee University
    College of William and Mary
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  12. r

    Student Messy Handwritten Dataset (SMHD)

    • research-repository.rmit.edu.au
    • researchdata.edu.au
    application/x-rar
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiqmat Nisa; James Thom; Vic ciesielski; Ruwan Tennakoon (2023). Student Messy Handwritten Dataset (SMHD) [Dataset]. http://doi.org/10.25439/rmt.24312715.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    RMIT University
    Authors
    Hiqmat Nisa; James Thom; Vic ciesielski; Ruwan Tennakoon
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Within the central repository, there are subfolders of different categories. Each of these subfolders contains both images and their corresponding transcriptions, saved as .txt files. As an example, the folder 'summary-based-0001-0055' encompasses 55 handwritten image documents pertaining to the summary task, with the images ranging from 0001 to 0055 within this category. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.Moreover, there exists a document detailing the transcription rules utilized for transcribing the dataset. Following these guidelines will enable the seamless addition of more images.Dataset Description:We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.

  13. iCite Database Snapshot 2022-10

    • nih.figshare.com
    bin
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iCite; B. Ian Hutchins; George Santangelo; Ehsanul Haque (2023). iCite Database Snapshot 2022-10 [Dataset]. http://doi.org/10.35092/yhjc21502470.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    iCite; B. Ian Hutchins; George Santangelo; Ehsanul Haque
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

    Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

    Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

    Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

    Definitions for individual data fields:

    pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

    doi: Digital Object Identifier, if available

    year: Year the article was published

    title: Title of the article

    authors: List of author names

    journal: Journal name (ISO abbreviation)

    is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

    relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

    provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

    citation_count: Number of unique articles that have cited this one

    citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

    field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

    expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

    nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

    human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    x_coord: X coordinate of the article on the Triangle of Biomedicine

    y_coord: Y Coordinate of the article on the Triangle of Biomedicine

    is_clinical: Flag indicating that this paper meets the definition of a clinical article.

    cited_by_clin: PMIDs of clinical articles that this article has been cited by.

    apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

    cited_by: PMIDs of articles that have cited this one.

    references: PMIDs of articles in this article's reference list.

    Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

    Comments and questions can be addressed to iCite@mail.nih.gov

  14. BUSUM-BNLP Dataset (Multi-Document Bangla Summary)

    • kaggle.com
    zip
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marwa_Nurtaj (2023). BUSUM-BNLP Dataset (Multi-Document Bangla Summary) [Dataset]. https://www.kaggle.com/datasets/marwanurtaj/busum-bnlp-dataset-multi-document-bangla-summary
    Explore at:
    zip(1048000 bytes)Available download formats
    Dataset updated
    Oct 11, 2023
    Authors
    Marwa_Nurtaj
    Area covered
    Büsum
    Description

    =====================================================================================

    BUSUM-BNLP DATASET

    =====================================================================================

    A Public Dataset for Multi-Document Update Summarization Task: To Improve the artificial intelligence-centered Information Retrieval Mission

    =====================================================================================

    This is the README file for the BUSUM-BNLP dataset. Good performance in NLP projects depends on high-quality datasets. For the multi-document update summarization task, we have researched some NLP datasets and also tried to generate a new one in Bangla. After researching various literature, we have found many English datasets like DUC2002, DUC2007, Daily-Mail, TAC dataset, and some Bangla datasets like Bangla Summarization Dataset (Prothom Alo), Bangla-Extra Sum Dataset, and BNLPC Dataset. In several papers, DUC2002 and DUC2007 were used to generate update summaries, while Daily Mail was tested for extractive and abstractive summaries. Though in Bengali summarization, there are not so many works done yet.

    =====================================================================================

    For our dataset, we have collected some old and new-dated articles of the same news from the online websites of Prothom Alo, Kaler Kontho, BBC News, Jugantor, etc. Both old and new news can contain slimier information. We have sorted out the old and new news and developed our latest dataset for applying our multi-document update summarization task.

    =====================================================================================

    Our multi-document Bangla dataset can be used for keyword detection in multiple files related to a specific topic. You can develop various summarization models, including updated summarizers and generic summarizers, using machine learning and deep learning techniques. This dataset can serve as a valuable resource for inspiring and generating new datasets by scraping news articles and other sources. It can also aid in developing domain-specific title detection models and extracting relevant features to improve accuracy.

    =====================================================================================

    One can follow the given methods for pre-processing the data: (POS) Tagging: This method will group or organise text phrases corresponding to language types such as nouns, verbs, adverbs, adjectives, etc. Cleaning Stop Words: It will eliminate common words from a document that give no useful information. Some of the stop words are like they, there, this, were, etc. Discard words or numerals containing digits: This sort of term, such as wordscapes59 or game5ts7, is tough to handle, so it's best to eliminate them or change them with an empty set and use regular expressions instead. Erase extra white spaces: In the pre-processing, the regular expression library works well to remove extra space, which is unnecessary. Cut-off Punctuations: There are 32 major punctuations in all that need to be considered. The string module and a regular expression can be used to replace any punctuation in the text with an empty string. Converting into the Same Case: If the text is in the same case throughout, a computer can easily understand the words since machines perceive lowercase and uppercase letters differently. Recognition of Named Entities: Keywords in the response text should be identified as item labels (i.e., individual, place, company title, etc.). Lemmatization or stemming: Reduction of the words to their base (like run is the ground word of runs, running, runed) form will be performed by lemmatization (which matches words with a linguistic dictionary) and stemming (which removes the suffix and the prefix from the word). Enlarge Contractions: Words like don't, which signifies "do not," and are not, which means "are not," are examples of words that have been contracted. It will be easier to carry out sentence processing duties if the contraction is expanded. Tokenization: The tokenization technique breaks text flows into tokens, which can be words, phrases, symbols, or other significant pieces of information.

    ====================================================================================

    During our project, we encountered a few limitations that affected our data collection and modeling efforts. Firstly, to collect news on a particular topic, we had to put in a considerable amount of effort, as newspapers do not publish news in a serial manner every day. This meant that we had to read through entire texts to select relevant news articles for our dataset. Additionally, generating human-generated summaries was not an easy task, as we had to read lengthy documents and condense them into shorter summaries. However, due to time constraints and difficulties in r...

  15. Replication Kit: "On the Defect-Detection Capabilities of Unit and...

    • zenodo.org
    application/gzip, bin +1
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Trautsch; Fabian Trautsch; Steffen Herbold; Jens Grabowski; Steffen Herbold; Jens Grabowski (2020). Replication Kit: "On the Defect-Detection Capabilities of Unit and Integration Tests" [Dataset]. http://doi.org/10.5281/zenodo.1256624
    Explore at:
    bin, csv, application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fabian Trautsch; Fabian Trautsch; Steffen Herbold; Jens Grabowski; Steffen Herbold; Jens Grabowski
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Replication Kit for the Paper "On the Defect-Detection Capabilities of Unit and Integration Tests"
    This additional material shall provide other researchers with the ability to replicate our results. Furthermore, we want to facilitate further insights that might be generated based on our data sets.

    Structure
    The stucture of the replication kit is as follows:

    • additional_visualizations: contains additional visualizations (Venn-Diagrams) for each projects for each of the data sets that we used
    • data_analysis: contains two python scripts that we used to analyze our raw data (one for each research question)
    • data_collection_tools: contains all source code used for the data collection, including the used versions of the COMFORT framework, the BugFixClassifier, the script that was used to filter and collect the issues with their commits, and the used tools of the SmartSHARK environment
    • mongodb_no_authors: Archived dump of our MongoDB that we created by executing our data collection tools. The "comfort" database can be restored via the mongorestore command.
    • project_defects.csv: Raw data of our manual data collection process. It included all collected defects including the name of the issue, its description and the commit references in which these issues were fixed. Furthermore, it includes information about which test failed or errored after re-integrating the defect into the analyzed release version of the project.


    Additional Visualizations
    We provide three additional visualizations for each project:

    1. Disjoint-Mutation-Data (visualizations for the DISJ data set)
    2. Mutation-Data (visualizations for the ALL data set)
    3. Seeded-Data (visualizations for the SEEDED data set)

    For each of these data sets there exist one visualization for each project that shows six Venn-Diagrams, where five of them present the different defect types and one overall Venn-Diagram. These Venn-Diagrams show the number of defects that were detected by either unit, or integration tests (or both).

    Analysis scripts
    Requirements:

    Both python files contain all code for the statistical analysis we performed. The scripts can be executed via:

    python rq1_statistical_tests.py
    python rq2_statistical_tests.py

    Data Collection Tools
    We provide all data collection tools that we have implemented and used throughout our paper. Overall it contains six different projects and one python script:

    • BugFixClassifier: Used to classify our defects.
    • comfort-core: Core of the comfort framework. Used to classify our tests into unit and integration tests and calculate different metrics for these tests.
    • comfort-jacoco-listner: Used to intercept the coverage collection process as we were executing the tests of our case study projects.
    • filter_issues.py: Used to filter and collect issues with their commits (need the vcsSHARK and issueSHARK executed beforehand)
    • issueSHARK: Used to collect data from the ITSs of the projects.
    • pycoSHARK: Library that contains models for the used ORM mapper that is used insight the SmartSHARK environment.
    • vcsSHARK: Used to collect data from the VCSs of the projects.
  16. Research Papers Dataset

    • kaggle.com
    zip
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NECHBA MOHAMMED (2023). Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/nechbamohammed/research-papers-dataset
    Explore at:
    zip(619131172 bytes)Available download formats
    Dataset updated
    May 8, 2023
    Authors
    NECHBA MOHAMMED
    Description

    Description: This dataset (Version 10) contains a collection of research papers along with various attributes and metadata. It is a comprehensive and diverse dataset that can be used for a wide range of research and analysis tasks. The dataset encompasses papers from different fields of study, including computer science, mathematics, physics, and more.

    Fields in the Dataset: - id: A unique identifier for each paper. - title: The title of the research paper. - authors: The list of authors involved in the paper. - venue: The journal or venue where the paper was published. - year: The year when the paper was published. - n_citation: The number of citations received by the paper. - references: A list of paper IDs that are cited by the current paper. - abstract: The abstract of the paper.

    Example: - "id": "013ea675-bb58-42f8-a423-f5534546b2b1", - "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors", - "authors": ["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"], - "venue": "Journal of Computational Chemistry", - "year": 2017, - "n_citation": 0, - "references": ["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"], - "abstract": "This paper studies ..."

    Cite: https://www.aminer.cn/citation

  17. Convergent evolution of life-history traits in Mycalesina butterflies

    • figshare.com
    txt
    Updated Sep 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sridhar Halali; Erik van Bergen; Casper J Breuker; Paul M Brakefield; Oskar Brattström (2020). Convergent evolution of life-history traits in Mycalesina butterflies [Dataset]. http://doi.org/10.6084/m9.figshare.13001756.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 28, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Sridhar Halali; Erik van Bergen; Casper J Breuker; Paul M Brakefield; Oskar Brattström
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have used two means of data collection. One involved collecting/coding habitat data for several species of Mycalesina butterflies. The habitat coding was done by referring several papers/books, personal communications with local field experts (names given in the Acknowledgement section of the paper and in the habitat classification file). Second part of the data collection involved measuring species life-history traits (egg area, development times, pupal weight and so on) by rearing four species in the common garden. Also, we secondarily obtained life-history data of three species from literature (please see details in the main text). All collected data was processed and analyzed in R studio.

  18. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.

  19. Data archiving and sharing survey

    • zenodo.org
    • search.dataone.org
    • +1more
    bin, csv
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connie Mulligan; Connie Mulligan (2022). Data archiving and sharing survey [Dataset]. http://doi.org/10.5061/dryad.5x69p8d40
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Connie Mulligan; Connie Mulligan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Open data sharing improves the quality and reproducibility of scholarly research. Open data are defined as data that can be freely used, reused, and redistributed by anyone. Open data sharing democratizes science by making data more equitably available throughout the world regardless of funding or access to other resources necessary for generating cutting-edge data. For an interdisciplinary field like biological anthropology, data sharing is critical since one person cannot easily collect data across multiple domains. The goal of this paper is to encourage broader data sharing by exploring the state of data sharing in the field of biological anthropology. Our paper is divided into four parts: the first section describes the benefits, challenges, and emerging solutions to open data sharing; the second section presents the results of our data archiving and sharing survey that was completed by over 700 researchers; the third section presents personal experiences of data sharing by the authors; and the fourth section discusses the strengths of different types of data repositories and provides a list of recommended data repositories.

    The data archiving and sharing survey and the raw data collected from the survey are deposited here.

  20. t

    Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

    • researchdata.tuwien.at
    html, pdf, zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
    Explore at:
    html, zip, pdfAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    TU Wien
    Authors
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How To Cite?

    Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

    Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

    Folder Structure

    The folder named “submission” contains the following:

    1. “pythonProject”: This folder contains all the Python files and subfolders needed for analysis.
    2. ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

    Setting Up the Environment

    1. Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.
    2. The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

    Subfolders

    1. Data_4_IJGIS

    • This folder contains the data used for the results reported in the paper.
    • Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

    2. results_[DateTime] (e.g., results_20240906_15_00_13)

    • This folder will be generated when you run the code and will store the output of each step.
    • The current folder contains results created during code debugging for the submission.
    • When you run the code, a new folder with fresh results will be generated.

    Python Files

    1. helper_functions.py

    • Contains reusable functions used throughout the analysis.
    • Each function includes a description of its purpose and the input parameters required.

    2. create_sanity_plots.py

    • Generates scatter plots like those in Figure 3 of the paper.
    • Although the code has been run for all 309 trials, it can be used to check the sample data provided.
    • Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.
    • Usage: Run this file to create visualizations similar to Figure 3.

    3. overlapping_sliding_window_loop.py

    • Implements overlapping sliding window segmentation and generates plots like those in Figure 4.
    • Output:
      • Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.
      • Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.
      • A visualization of the segments, similar to Figure 4, will be automatically generated.

    4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

    • These files compute features as explained in Tables 1 and 2 of the paper, respectively.
    • They process the segmented recordings generated by the overlapping_sliding_window_loop.py.
    • Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

    5. training_prediction.py

    • This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:
    a. Data Preparation (corresponding to Section 5.1.1 of the paper)
    • Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.
    • A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.
    b. Training/Validation/Test Split
    • Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).
    • Make sure that you follow the instructions in the comments to the code exactly.
    • Output: The split data is saved as .csv files in the results folder.
    c. Machine and Deep Learning Experiments

    This part contains three main code blocks:

    iii. One for the XGboost code with correct hyperparameter tuning:
    Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

    • MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.
    • XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.
    • XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

    Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

    d. Inference (Monitoring Part)
    • Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.
    • Figure 8 in the paper is generated using this part of the code.

    6. sequence_analysis.py

    • Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.
    • This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

    Licenses

    The data is licensed under CC-BY, the code is licensed under MIT.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Silicon Orchard Lab, Bangladesh
Independent University, Bangladesh
University of Memphis, USA
Authors
Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Bangladesh, United States
Description

Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,

Search
Clear search
Close search
Google apps
Main menu