15 datasets found

f
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
Data from: Arabic news credibility on Twitter using sentiment analysis and...
zenodo.org
data.niaid.nih.gov
csv, txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duha Samdani; Duha Samdani; Mounira Taileb; Nada Almani; Mounira Taileb; Nada Almani (2023). Arabic news credibility on Twitter using sentiment analysis and ensemble learning [Dataset]. http://doi.org/10.5281/zenodo.8000717
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8000717
Dataset updated
Jun 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Duha Samdani; Duha Samdani; Mounira Taileb; Nada Almani; Mounira Taileb; Nada Almani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Arabic news credibility on Twitter using sentiment analysis and ensemble learning.

WHAT IS IT?

-----------

an Arabic news credibility model on Twitter using sentiment analysis and ensemble learning.

Here we include the Collected dataset and the source code of the proposed model written in Python language and using Keras library with Tensorflow backend.

Required Packages

------------------

Keras (https://keras.io/).

Scikit-learn (http://scikit-learn.org/)

Imnlearn (imbalanced-learn documentation — Version 0.10.1)

To Run the model

---------------

One data file is required to run the model which are:

The data that were used are the collected dataset in the file, set the path of the required data file in the code.

The dataset

---------------

There are the dataset file with all features, you can choose the features that you need and apply it on the model.

There are a description file that describe each feature in the news credibility dataset

The file Tweet_ID contains the list of tweets id in the dataset.

The annotated replies based on credibility is provided.

CONTACTS

--------

If you want to report bugs or have general queries email to
f
Example of different sentiments from the citation sentiment corpus.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Example of different sentiments from the citation sentiment corpus. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t001
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example of different sentiments from the citation sentiment corpus.
A
‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-sentiment-analysis-of-commodity-news-gold-732f/e3232de2/?iid=002-045&v=presentation
Explore at:
Dataset updated
Sep 27, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).

Content

The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.

Acknowledgements

Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.

https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)

We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).

Inspiration

Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.

Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.

--- Original source retains full ownership of the source dataset ---
f
Hyperparameter details of all machine learning models.
figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Hyperparameter details of all machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t002
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameter details of all machine learning models.
d
Replication Data for: Less Annotating, More Classifying: Addressing the Data...
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laurer, Moritz; van Atteveldt, Wouter; Casas, Andreu; Welbers, Kasper (2023). Replication Data for: Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI [Dataset]. http://doi.org/10.7910/DVN/8ACDTT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/8ACDTT
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Laurer, Moritz; van Atteveldt, Wouter; Casas, Andreu; Welbers, Kasper
Description
Supervised machine learning is an increasingly popular tool for analysing large political text corpora. The main disadvantage of supervised machine learning is the need for thousands of manually annotated training data points. This issue is particularly important in the social sciences where most new research questions require the automation of a new task with new and imbalanced training data. This paper analyses how deep transfer learning can help address this challenge by accumulating ‘prior knowledge’ in algorithms. Pre-training algorithms like BERT creates representations of statistical language patterns (‘language knowledge’), and training on universal tasks like Natural Language Inference (NLI) reduces reliance on task-specific data (‘task knowledge’). We systematically show the benefits of transfer learning on a wide range of eight tasks. Across these eight tasks, BERT-NLI fine-tuned on 100 to 2500 data points performs on average 10.7 to 18.3 percentage points better than classical algorithms without transfer learning. Our study indicates that BERT-NLI trained on 500 data points achieves similar average performance as classical algorithms trained on around 5000 data points. Moreover, we show that transfer learning works particularly well on imbalanced data. We conclude by discussing limitations of transfer learning and by outlining new opportunities for political science research.
f
Classification results of classifiers using fastText.
plos.figshare.com
xls
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification results of classifiers using fastText. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t011
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification results of classifiers using fastText.
f
Classification results of machine learning models using CNN features with.
figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification results of machine learning models using CNN features with. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t008
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification results of machine learning models using CNN features with.
Strength and weakness of feature representation technique.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Strength and weakness of feature representation technique. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t003
Dataset updated
May 28, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Strength and weakness of feature representation technique.
f
The curated data from the Twitter Dataset.
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma (2025). The curated data from the Twitter Dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0323449.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323449.t001
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Social Media has given an exponential rise to an ever-connected world. Health data that was earlier viewed as hospital records or clinical records is now being shared as text over social media. Information and updates regarding the outbreak of a pandemic, clinical visit results, general health updates, etc., are being analyzed. The data is now shared more frequently in various formats such as images, text, documents, and videos. With fast streaming systems and no constraints on storage spaces, all this shared rich media data is quite voluminous and informative. For shared health data such as discussions on ailments, hospital visits, general health well-being updates, and drug research updates via official Twitter handles of various pharmaceutical companies and healthcare organizations, a unique level of challenge is posed for analysis of this data. The text indicating the ailment often varies from proper medical jargon to common names for the same, whereas the intent is the same in predicting the disease or ailment term. This paper focuses on how we can extract and analyze health-related data exchanged on social media and introduce an Augmented Ensemble Model (AEM), which identifies the frequently shared topics and discussions about health on social networks, to predict the emerging health trends. The analytical model works with chronological datasets to deduce text classification of topics related to health. This Hybrid Model uses text data augmentation to address class imbalance for health terms and further employs a clustering technique for location-based aggregation. An algorithm for health terms Word Vector Embedding model is formulated. This Word Vector model is further used in Text Data Augmentation to reduce the class imbalance. We evaluate the accuracy of the classifiers by constructing a Machine Learning pipeline. For our Augmented Ensemble Model, the Text classification accuracy is evaluated after the augmentation using a voting ensemble technique, and a greater accuracy has been observed. Emerging health trends are analyzed via temporal classification and location-wise aggregation of the health terms. This model demonstrates that a Text Augmented Ensemble Machine Learning approach for health topics is more efficient than the conventional Machine Learning classification technique(s).
f
Comparison with existing models.
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma (2025). Comparison with existing models. [Dataset]. http://doi.org/10.1371/journal.pone.0323449.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323449.t005
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Social Media has given an exponential rise to an ever-connected world. Health data that was earlier viewed as hospital records or clinical records is now being shared as text over social media. Information and updates regarding the outbreak of a pandemic, clinical visit results, general health updates, etc., are being analyzed. The data is now shared more frequently in various formats such as images, text, documents, and videos. With fast streaming systems and no constraints on storage spaces, all this shared rich media data is quite voluminous and informative. For shared health data such as discussions on ailments, hospital visits, general health well-being updates, and drug research updates via official Twitter handles of various pharmaceutical companies and healthcare organizations, a unique level of challenge is posed for analysis of this data. The text indicating the ailment often varies from proper medical jargon to common names for the same, whereas the intent is the same in predicting the disease or ailment term. This paper focuses on how we can extract and analyze health-related data exchanged on social media and introduce an Augmented Ensemble Model (AEM), which identifies the frequently shared topics and discussions about health on social networks, to predict the emerging health trends. The analytical model works with chronological datasets to deduce text classification of topics related to health. This Hybrid Model uses text data augmentation to address class imbalance for health terms and further employs a clustering technique for location-based aggregation. An algorithm for health terms Word Vector Embedding model is formulated. This Word Vector model is further used in Text Data Augmentation to reduce the class imbalance. We evaluate the accuracy of the classifiers by constructing a Machine Learning pipeline. For our Augmented Ensemble Model, the Text classification accuracy is evaluated after the augmentation using a voting ensemble technique, and a greater accuracy has been observed. Emerging health trends are analyzed via temporal classification and location-wise aggregation of the health terms. This model demonstrates that a Text Augmented Ensemble Machine Learning approach for health topics is more efficient than the conventional Machine Learning classification technique(s).
Health Term Text Data Augmentation (HTTDA) transformation for reduction to...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma (2025). Health Term Text Data Augmentation (HTTDA) transformation for reduction to n-Classes. [Dataset]. http://doi.org/10.1371/journal.pone.0323449.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323449.t002
Dataset updated
Jun 5, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Health Term Text Data Augmentation (HTTDA) transformation for reduction to n-Classes.
f
Accuracy of the classifiers using machine learning pipeline.
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma (2025). Accuracy of the classifiers using machine learning pipeline. [Dataset]. http://doi.org/10.1371/journal.pone.0323449.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323449.t003
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accuracy of the classifiers using machine learning pipeline.
f
The F1 macro, weighted, and micro scores for various classifiers and a...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma (2025). The F1 macro, weighted, and micro scores for various classifiers and a voting ensemble. [Dataset]. http://doi.org/10.1371/journal.pone.0323449.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323449.t004
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Sonia Saini; Ruchi Agarwal; S.P. Singh; Punit Gupta; Ankit Vidhyarthi; Rohit Verma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The F1 macro, weighted, and micro scores for various classifiers and a voting ensemble.
f
Data_Sheet_1_Deep Learning-Based Natural Language Processing for Screening...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hong-Jie Dai; Chu-Hsien Su; You-Qian Lee; You-Chen Zhang; Chen-Kai Wang; Chian-Jue Kuo; Chi-Shin Wu (2023). Data_Sheet_1_Deep Learning-Based Natural Language Processing for Screening Psychiatric Patients.docx [Dataset]. http://doi.org/10.3389/fpsyt.2020.533949.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyt.2020.533949.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Hong-Jie Dai; Chu-Hsien Su; You-Qian Lee; You-Chen Zhang; Chen-Kai Wang; Chian-Jue Kuo; Chi-Shin Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The introduction of pre-trained language models in natural language processing (NLP) based on deep learning and the availability of electronic health records (EHRs) presents a great opportunity to transfer the “knowledge” learned from data in the general domain to enable the analysis of unstructured textual data in clinical domains. This study explored the feasibility of applying NLP to a small EHR dataset to investigate the power of transfer learning to facilitate the process of patient screening in psychiatry. A total of 500 patients were randomly selected from a medical center database. Three annotators with clinical experience reviewed the notes to make diagnoses for major/minor depression, bipolar disorder, schizophrenia, and dementia to form a small and highly imbalanced corpus. Several state-of-the-art NLP methods based on deep learning along with pre-trained models based on shallow or deep transfer learning were adapted to develop models to classify the aforementioned diseases. We hypothesized that the models that rely on transferred knowledge would be expected to outperform the models learned from scratch. The experimental results demonstrated that the models with the pre-trained techniques outperformed the models without transferred knowledge by micro-avg. and macro-avg. F-scores of 0.11 and 0.28, respectively. Our results also suggested that the use of the feature dependency strategy to build multi-labeling models instead of problem transformation is superior considering its higher performance and simplicity in the training process.
Not seeing a result you expected?
Learn how you can add new datasets to our index.