Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundTo effectively combat the rising incidence of syphilis, the Brazilian Ministry of Health (MoH) created a National Rapid Response to Syphilis with actions aimed at bolstering epidemiological surveillance of acquired, congenital syphilis, and syphilis during pregnancy complemented with communication activities to raise population awareness and to increase uptake of testing that targeted mass media outlets from November 2018 to March 2019 throughout Brazil, and mainly areas with high rates of syphilis. This study analyzes the volume and quality of online news content on syphilis in Brazil between 2015 and 2019 and examines its effect on testing.MethodsThe collection and processing of online news were automated by means of a proprietary digital health ecosystem established for the study. We applied text data mining techniques to online news to extract patterns from categories of text. The presence and combination of such categories in collected texts determined the quality of news that were analyzed to classify them as high-, medium-and low-quality news. We examined the correlation between the quality of news and the volume of syphilis testing using Spearman’s Rank Correlation Coefficient.Results1,049 web pages were collected using a Google Search API, of which 630 were categorized as earned media. We observed a steady increase in the number of news on syphilis in 2015 (n = 18), 2016 (n = 26), and 2017 (n = 42), with a substantial rise in the number of news in 2018 (n = 107) and 2019 (n = 437), although the relative proportion of high-quality news remained consistently high (77.6 and 70.5% respectively) and in line with similar years. We found a correlation between news quality and syphilis testing performed in primary health care with an increase of 82.32, 78.13, and 73.20%, respectively, in the three types of treponemal tests used to confirm an infection.ConclusionEffective communication strategies that lead to dissemination of high quality of information are important to increase uptake of public health policy actions.
WSDM (pronounced "wisdom") is one of the premier conferences on web-inspired research involving search and data mining. The 12th ACM International WSDM Conference will take place in Melbourne, Australia during Feb. 11-15, 2019.
This task is organized by ByteDance, the Platinum Level Sponsor of the conference. ByteDance is a global Internet technology company started from China. Our goal is to build a global content platform that enable people to enjoy various content in various forms. We inform, entertain, and inspire people across language, culture and geography.
One of the challenges which we are facing is to combat different types of fake news. Fake news here refers to all forms of false, inaccurate or misleading information, which now poses a big threat to human civilization.
At Bytedance, we have created a large-scale database to store existing fake news articles. Any new article must go through a test on the truthfulness of content before being published. We conduct matching between the new article and the articles in the database. Articles identified as containing fake news will be withdrawn after human verification. The accuracy and efficiency of the process, therefore, becomes crucial for us to make the platform safe, reliable, and healthy.
This dataset is released as the competition dataset of Task: Fake News Classification with the following task:
Given the title of a fake news article A and the title of a coming news article B, participants are asked to classify B into one of the three categories.
The English titles are machine translated from the related Chinese titles. This may help participants from all background to get better understanding of the datasets. Participants are highly recommended to use the Chinese version titles to finish the task.
We use Weighted Categorization Accuracy to evaluate your performance. Weighted categorization accuracy can be generally defined as:
\[ WeightedAccuracy(y, \hat{y}, \omega) = \frac{1}{n} \displaystyle{\sum_{i=1}^{n}} \frac{\omega_i(y_i=\hat{y}_i)}{\sum \omega_i} \]
where \(y\) are ground truths, \(\hat{y}\) are the predicted results, and \(\omega_i\) is the weight associated with the \(i\)th item in the dataset.
In our test set, we assign each testing item a weight according to its category. The weights of the three categories, agreed, disagreed and unrelated are \(\frac{1}{15}\), \(\frac{1}{5}\), \(\frac{1}{16}\), respectively. We set the weights in consideration of the imbalance of the data distribution to minimize the bias to your performance caused by the majority class (unrelated pairs accounts for approximately 70% of the dataset).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the enrichments for the dataset The New York Times Annotated Corpus developed for the paper:
“Marco Ponza, Diego Ceccarelli, Paolo Ferragina, Edgar Meij, Sambhav Kothari. Contextualizing Trending Entities in News Stories. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM 2021).”
It includes a total of 149 trends constituted by 120K entities. The goal is to retrieve a set of entities ranked with respect to their usefulness in explaining why a given trending entity is actually trending.
Format
The repository contains the enrichments in JSON format.
The news stories of the New York Times from which these enrichments have been developed are available from LDC.
Data Splits
We perform two kinds of evaluation.
Use
Please cite the data set and the accompanying paper if you found the resources in this repository useful:
@inproceedings{ponza2021,
Title = {Contextualizing Trending Entities in News Stories},
author = {Ponza, Marco and Ceccarelli, Diego and Ferragina, Paolo and Meij, Edgar and Kothari, Sambhav},
Booktitle = {Proceedings of the 14th ACM International Conference on Web Search and Data Mining},
Year = {2021},
}
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .
The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ag_news_subset', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Iron Ore rose to 98.27 USD/T on July 23, 2025, up 0.16% from the previous day. Over the past month, Iron Ore's price has risen 3.85%, but it is still 8.56% lower than a year ago, according to trading on a contract for difference (CFD) that tracks the benchmark market for this commodity. Iron Ore - values, historical data, forecasts and news - updated on July of 2025.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.