Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification result classifiers using TF-IDF with SMOTE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Arabic news credibility on Twitter using sentiment analysis and ensemble learning.
WHAT IS IT?
-----------
an Arabic news credibility model on Twitter using sentiment analysis and ensemble learning.
Here we include the Collected dataset and the source code of the proposed model written in Python language and using Keras library with Tensorflow backend.
Required Packages
------------------
To Run the model
---------------
One data file is required to run the model which are:
The dataset
---------------
CONTACTS
--------
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of different sentiments from the citation sentiment corpus.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.
--- Dataset description provided by original source is as follows ---
This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).
The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.
Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.
https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)
We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).
Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.
Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hyperparameter details of all machine learning models.
Supervised machine learning is an increasingly popular tool for analysing large political text corpora. The main disadvantage of supervised machine learning is the need for thousands of manually annotated training data points. This issue is particularly important in the social sciences where most new research questions require the automation of a new task with new and imbalanced training data. This paper analyses how deep transfer learning can help address this challenge by accumulating ‘prior knowledge’ in algorithms. Pre-training algorithms like BERT creates representations of statistical language patterns (‘language knowledge’), and training on universal tasks like Natural Language Inference (NLI) reduces reliance on task-specific data (‘task knowledge’). We systematically show the benefits of transfer learning on a wide range of eight tasks. Across these eight tasks, BERT-NLI fine-tuned on 100 to 2500 data points performs on average 10.7 to 18.3 percentage points better than classical algorithms without transfer learning. Our study indicates that BERT-NLI trained on 500 data points achieves similar average performance as classical algorithms trained on around 5000 data points. Moreover, we show that transfer learning works particularly well on imbalanced data. We conclude by discussing limitations of transfer learning and by outlining new opportunities for political science research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification results of classifiers using fastText.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification results of machine learning models using CNN features with.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Strength and weakness of feature representation technique.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social Media has given an exponential rise to an ever-connected world. Health data that was earlier viewed as hospital records or clinical records is now being shared as text over social media. Information and updates regarding the outbreak of a pandemic, clinical visit results, general health updates, etc., are being analyzed. The data is now shared more frequently in various formats such as images, text, documents, and videos. With fast streaming systems and no constraints on storage spaces, all this shared rich media data is quite voluminous and informative. For shared health data such as discussions on ailments, hospital visits, general health well-being updates, and drug research updates via official Twitter handles of various pharmaceutical companies and healthcare organizations, a unique level of challenge is posed for analysis of this data. The text indicating the ailment often varies from proper medical jargon to common names for the same, whereas the intent is the same in predicting the disease or ailment term. This paper focuses on how we can extract and analyze health-related data exchanged on social media and introduce an Augmented Ensemble Model (AEM), which identifies the frequently shared topics and discussions about health on social networks, to predict the emerging health trends. The analytical model works with chronological datasets to deduce text classification of topics related to health. This Hybrid Model uses text data augmentation to address class imbalance for health terms and further employs a clustering technique for location-based aggregation. An algorithm for health terms Word Vector Embedding model is formulated. This Word Vector model is further used in Text Data Augmentation to reduce the class imbalance. We evaluate the accuracy of the classifiers by constructing a Machine Learning pipeline. For our Augmented Ensemble Model, the Text classification accuracy is evaluated after the augmentation using a voting ensemble technique, and a greater accuracy has been observed. Emerging health trends are analyzed via temporal classification and location-wise aggregation of the health terms. This model demonstrates that a Text Augmented Ensemble Machine Learning approach for health topics is more efficient than the conventional Machine Learning classification technique(s).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Social Media has given an exponential rise to an ever-connected world. Health data that was earlier viewed as hospital records or clinical records is now being shared as text over social media. Information and updates regarding the outbreak of a pandemic, clinical visit results, general health updates, etc., are being analyzed. The data is now shared more frequently in various formats such as images, text, documents, and videos. With fast streaming systems and no constraints on storage spaces, all this shared rich media data is quite voluminous and informative. For shared health data such as discussions on ailments, hospital visits, general health well-being updates, and drug research updates via official Twitter handles of various pharmaceutical companies and healthcare organizations, a unique level of challenge is posed for analysis of this data. The text indicating the ailment often varies from proper medical jargon to common names for the same, whereas the intent is the same in predicting the disease or ailment term. This paper focuses on how we can extract and analyze health-related data exchanged on social media and introduce an Augmented Ensemble Model (AEM), which identifies the frequently shared topics and discussions about health on social networks, to predict the emerging health trends. The analytical model works with chronological datasets to deduce text classification of topics related to health. This Hybrid Model uses text data augmentation to address class imbalance for health terms and further employs a clustering technique for location-based aggregation. An algorithm for health terms Word Vector Embedding model is formulated. This Word Vector model is further used in Text Data Augmentation to reduce the class imbalance. We evaluate the accuracy of the classifiers by constructing a Machine Learning pipeline. For our Augmented Ensemble Model, the Text classification accuracy is evaluated after the augmentation using a voting ensemble technique, and a greater accuracy has been observed. Emerging health trends are analyzed via temporal classification and location-wise aggregation of the health terms. This model demonstrates that a Text Augmented Ensemble Machine Learning approach for health topics is more efficient than the conventional Machine Learning classification technique(s).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Health Term Text Data Augmentation (HTTDA) transformation for reduction to n-Classes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accuracy of the classifiers using machine learning pipeline.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The F1 macro, weighted, and micro scores for various classifiers and a voting ensemble.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The introduction of pre-trained language models in natural language processing (NLP) based on deep learning and the availability of electronic health records (EHRs) presents a great opportunity to transfer the “knowledge” learned from data in the general domain to enable the analysis of unstructured textual data in clinical domains. This study explored the feasibility of applying NLP to a small EHR dataset to investigate the power of transfer learning to facilitate the process of patient screening in psychiatry. A total of 500 patients were randomly selected from a medical center database. Three annotators with clinical experience reviewed the notes to make diagnoses for major/minor depression, bipolar disorder, schizophrenia, and dementia to form a small and highly imbalanced corpus. Several state-of-the-art NLP methods based on deep learning along with pre-trained models based on shallow or deep transfer learning were adapted to develop models to classify the aforementioned diseases. We hypothesized that the models that rely on transferred knowledge would be expected to outperform the models learned from scratch. The experimental results demonstrated that the models with the pre-trained techniques outperformed the models without transferred knowledge by micro-avg. and macro-avg. F-scores of 0.11 and 0.28, respectively. Our results also suggested that the use of the feature dependency strategy to build multi-labeling models instead of problem transformation is superior considering its higher performance and simplicity in the training process.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification result classifiers using TF-IDF with SMOTE.