6 datasets found

f
MVR performance t-test result.
figshare.com
xls
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). MVR performance t-test result. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309206.t005
Dataset updated
Aug 23, 2024
Dataset provided by
PLOS ONE
Authors
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
f
Data_Sheet_1_Text mining analysis to understand the impact of online news on...
frontiersin.figshare.com
docx
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Pinto; Juciano Lacerda; Lyrene Silva; Ana Claudia Araújo; Raphael Fontes; Thaisa Santos Lima; Angélica E. Miranda; Lucía Sanjuán; Hugo Gonçalo Oliveira; Rifat Atun; Ricardo Valentim (2023). Data_Sheet_1_Text mining analysis to understand the impact of online news on public health response: case of syphilis epidemic in Brazil.docx [Dataset]. http://doi.org/10.3389/fpubh.2023.1248121.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2023.1248121.s001
Dataset updated
Nov 1, 2023
Dataset provided by
Frontiers
Authors
Rafael Pinto; Juciano Lacerda; Lyrene Silva; Ana Claudia Araújo; Raphael Fontes; Thaisa Santos Lima; Angélica E. Miranda; Lucía Sanjuán; Hugo Gonçalo Oliveira; Rifat Atun; Ricardo Valentim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brazil
Description
BackgroundTo effectively combat the rising incidence of syphilis, the Brazilian Ministry of Health (MoH) created a National Rapid Response to Syphilis with actions aimed at bolstering epidemiological surveillance of acquired, congenital syphilis, and syphilis during pregnancy complemented with communication activities to raise population awareness and to increase uptake of testing that targeted mass media outlets from November 2018 to March 2019 throughout Brazil, and mainly areas with high rates of syphilis. This study analyzes the volume and quality of online news content on syphilis in Brazil between 2015 and 2019 and examines its effect on testing.MethodsThe collection and processing of online news were automated by means of a proprietary digital health ecosystem established for the study. We applied text data mining techniques to online news to extract patterns from categories of text. The presence and combination of such categories in collected texts determined the quality of news that were analyzed to classify them as high-, medium-and low-quality news. We examined the correlation between the quality of news and the volume of syphilis testing using Spearman’s Rank Correlation Coefficient.Results1,049 web pages were collected using a Google Search API, of which 630 were categorized as earned media. We observed a steady increase in the number of news on syphilis in 2015 (n = 18), 2016 (n = 26), and 2017 (n = 42), with a substantial rise in the number of news in 2018 (n = 107) and 2019 (n = 437), although the relative proportion of high-quality news remained consistently high (77.6 and 70.5% respectively) and in line with similar years. We found a correlation between news quality and syphilis testing performed in primary health care with an increase of 82.32, 78.13, and 73.20%, respectively, in the three types of treponemal tests used to confirm an infection.ConclusionEffective communication strategies that lead to dissemination of high quality of information are important to increase uptake of public health policy actions.
WSDM - Fake News Classification
kaggle.com
Updated Apr 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bytedance WSDM Cup 2019 (2019). WSDM - Fake News Classification [Dataset]. https://www.kaggle.com/wsdmcup/wsdm-fake-news-classification/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2019
Dataset provided by
Kaggle
Authors
Bytedance WSDM Cup 2019
Description
Background

WSDM (pronounced "wisdom") is one of the premier conferences on web-inspired research involving search and data mining. The 12th ACM International WSDM Conference will take place in Melbourne, Australia during Feb. 11-15, 2019.

This task is organized by ByteDance, the Platinum Level Sponsor of the conference. ByteDance is a global Internet technology company started from China. Our goal is to build a global content platform that enable people to enjoy various content in various forms. We inform, entertain, and inspire people across language, culture and geography.

One of the challenges which we are facing is to combat different types of fake news. Fake news here refers to all forms of false, inaccurate or misleading information, which now poses a big threat to human civilization.

At Bytedance, we have created a large-scale database to store existing fake news articles. Any new article must go through a test on the truthfulness of content before being published. We conduct matching between the new article and the articles in the database. Articles identified as containing fake news will be withdrawn after human verification. The accuracy and efficiency of the process, therefore, becomes crucial for us to make the platform safe, reliable, and healthy.

About This Dataset

This dataset is released as the competition dataset of Task: Fake News Classification with the following task:

Given the title of a fake news article A and the title of a coming news article B, participants are asked to classify B into one of the three categories.

agreed: B talks about the same fake news as A

disagreed: B refutes the fake news in A

unrelated: B is unrelated to A

File

train.csv - training data contains 320,767 news pairs in both Chinese and English. This file provides the only data you can use to finish the task. Using external data is not allowed.

test.csv - testing data contains 80,126 news pairs in both Chinese and English. The approximately 25% of the testing data is set to be public and is used to calculate your accuracy shown on the leading board. The remaining 75% private data is used to calculate your final result of the competition.

sample_submission.csv - sample answer to the testing data.

Data fields

id - the id of each news pair.

tid1 - the id of fake news title 1.

tid2 - the id of news title 2.

title1_zh - the fake news title 1 in Chinese.

title2_zh - the news title 2 in Chinese.

title1_en - the fake news title 1 in English.

title2_en - the news title 2 in English.

label - indicates the relation between the news pair: agreed/disagreed/unrelated.

The English titles are machine translated from the related Chinese titles. This may help participants from all background to get better understanding of the datasets. Participants are highly recommended to use the Chinese version titles to finish the task.

Evaluation Metrics

We use Weighted Categorization Accuracy to evaluate your performance. Weighted categorization accuracy can be generally defined as:

\[ WeightedAccuracy(y, \hat{y}, \omega) = \frac{1}{n} \displaystyle{\sum_{i=1}^{n}} \frac{\omega_i(y_i=\hat{y}_i)}{\sum \omega_i} \]

where \(y\) are ground truths, \(\hat{y}\) are the predicted results, and \(\omega_i\) is the weight associated with the \(i\)th item in the dataset.

In our test set, we assign each testing item a weight according to its category. The weights of the three categories, agreed, disagreed and unrelated are \(\frac{1}{15}\), \(\frac{1}{5}\), \(\frac{1}{16}\), respectively. We set the weights in consideration of the imbalance of the data distribution to minimize the bias to your performance caused by the majority class (unrelated pairs accounts for approximately 70% of the dataset).
E
Data from: Contextualizing Trending Entities in News Stories
live.european-language-grid.eu
json
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Contextualizing Trending Entities in News Stories [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18312
Explore at:
jsonAvailable download formats
Dataset updated
Apr 30, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the enrichments for the dataset The New York Times Annotated Corpus developed for the paper:
“Marco Ponza, Diego Ceccarelli, Paolo Ferragina, Edgar Meij, Sambhav Kothari. Contextualizing Trending Entities in News Stories. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM 2021).”
It includes a total of 149 trends constituted by 120K entities. The goal is to retrieve a set of entities ranked with respect to their usefulness in explaining why a given trending entity is actually trending.
Format
The repository contains the enrichments in JSON format.
The news stories of the New York Times from which these enrichments have been developed are available from LDC.
Data Splits
We perform two kinds of evaluation.
Unsupervised evaluation, where we use the complete dataset of 149 trends as a benchmark.
Supervised evaluation, where we train/tune our models on a training/development set and we test them on a test set.
The training set contains 50 trends constituted by 36.3K entities from 1996 to 2000.
The development set contains 34 trends constituted by 26.7K entities from 2000 to 2002.
The test set contains 65 trends constituted by 57K entities from 2002 to 2007.
Use
Please cite the data set and the accompanying paper if you found the resources in this repository useful:
@inproceedings{ponza2021,
Title = {Contextualizing Trending Entities in News Stories},
author = {Ponza, Marco and Ceccarelli, Diego and Ferragina, Paolo and Meij, Edgar and Kothari, Sambhav},
Booktitle = {Proceedings of the 14th ACM International Conference on Web Search and Data Mining},
Year = {2021},
}
T
ag_news_subset
tensorflow.org
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
Explore at:
Unique identifier
https://identifiers.org/arxiv:1509.01626
Dataset updated
Dec 6, 2022
Description
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('ag_news_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
Iron Ore - Price Data
tradingeconomics.com
it.tradingeconomics.com
+13more
csv, excel, json, xml
Updated Dec 22, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2015). Iron Ore - Price Data [Dataset]. https://tradingeconomics.com/commodity/iron-ore
Explore at:
excel, json, xml, csvAvailable download formats
Dataset updated
Dec 22, 2015
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 22, 2010 - Jul 23, 2025
Area covered
World
Description
Iron Ore rose to 98.27 USD/T on July 23, 2025, up 0.16% from the previous day. Over the past month, Iron Ore's price has risen 3.85%, but it is still 8.56% lower than a year ago, according to trading on a contract for difference (CFD) that tracks the benchmark market for this commodity. Iron Ore - values, historical data, forecasts and news - updated on July of 2025.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). MVR performance t-test result. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t005

MVR performance t-test result.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0309206.t005

Dataset updated

Aug 23, 2024

Dataset provided by

PLOS ONE

Authors

Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.

Clear search

Close search

Google apps

Main menu

MVR performance t-test result.

Data_Sheet_1_Text mining analysis to understand the impact of online news on...

WSDM - Fake News Classification

Background

About This Dataset

File

Data fields

Evaluation Metrics

Data from: Contextualizing Trending Entities in News Stories

ag_news_subset

Iron Ore - Price Data

MVR performance t-test result.