Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind. Original github repo; https://github.com/cs-chan/Total-Text-Dataset Forked repo; https://github.com/yunusserhat/Total-Text-Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Total Text Dataset is a dataset for object detection tasks - it contains Text annotations for 1,255 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
This statistic shows mobile messaging volumes in the U.S. for selected years between 2004 and 2014. In 2010, approximately ***** billion messages were sent in total, up from ** billion in 2004.
U.S. mobile messaging volumes - additional information
A total of around *** trillion text messages were sent in the United States in 2012, marking an almost tenfold increase on the figure from 2006. A further ** million MMS messages were sent in the country in 2012, an increase from * million in 2006. In 2013, the United States was the country with the highest average number of text messages sent per month and per mobile connection. Over *** messages were sent monthly per mobile connection in the United States, in comparison to *** in the United Kingdom and *** in Germany.
The most active age group for sending and receiving text messages in the United States were those aged 18 to 29, as ** percent of respondents said that they did use mobile messaging in 2013. By comparison, only ** percent of those aged 65 and older said that they used their mobile phone for text messaging in 2013.
Rather than using a mobile phone’s integrated text messaging service, many users are opting for third party apps to communicate. As of January 2015, mobile messaging service WhatsApp had around 700 million monthly active users, marking double the amount of users it had in October 2013. Within the U.S. market, iOS and Android users spent a total of 680 million minutes on WhatsApp in February 2013, with those aged between 25 and 34 years most likely to use the service in 2014.
This data was used in the TREC 2015 and 2016 total recall track. The goal of the total recall track was to help develop retrieval systems tuned to retrieving ALL relevant information, as opposed to common web search engines where one good answer could be sufficient.
In 2021, mobile users in the United States sent roughly 2 trillion SMS or MMS messages. Following a sharp drop off in 2012, the number of SMS and MMS messages sent in the U.S. has generally increased over the past several years to another peak in 2020, during the COVID pandemic, at 2.2 trillion SMS or MMS messages.
The total number of SMS and MMS messages sent in Turkey mostly presented a diminishing trend, with some fluctuations from the 1st quarter of 2019 to the first quarter of 2024. The number of SMS messages sent went down to nearly *** billion in the first quarter of 2024 from **** billion in the first quarter of 2019. However, the number of MMS messages sent increased in the first quarter of 2024, and amounted to nearly ** million.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Ministry of Educations' - Basic Education Statistical Booklet captures national statistics for the Education Sector in totality. This dataset details the number of English, Kiswahili, Maths, Biology, Chemistry and Physics subjects text books across the 47 counties. Source - The Ministry of Educations, Basic Education Statistical Booklet, Table 84: Total Secondary Schools Text Book for Selected Subjects
The Total-Text dataset contains the text of various shapes, including horizontal, multi-orientational, and curved.
The total equity of Open Text with headquarters in Canada amounted to *** billion U.S. dollars in 2024. The reported fiscal year ends on June 30.Compared to the earliest depicted value from 2020 this is a total increase by approximately **** billion U.S. dollars. The trend from 2020 to 2024 shows, however, that this increase did not happen continuously.
green-luigi/total-text dataset hosted on Hugging Face and contributed by the HF Datasets community
The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.
https://opensource.org/licenses/BSD-3-Clausehttps://opensource.org/licenses/BSD-3-Clause
This dataset consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
“…” indicates excluded text for brevity. Message types include: ‘(S):’–a ‘system’ message containing customizable content with automated sending; ‘(P):’–a ‘patient’ message containing patient remarks in free text form; and ‘(C):’–a ‘clinician’ message containing clinician remarks in free text form.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TEL is the Threatening English Language corpus. It is a collection of 309 written texts compiled from the publicly-available portion of CTARC (the Communicated Threat Assessment Research Corpus, compiled by Tammy Gales), MFT (the Malicious Forensic Texts corpus, compiled by Andrea Nini), and the written portion of CoJO (the Corpus of Judicial Opinions, compiled by Julia Muschalik). Additional texts are from ForensicLing.com (the forensic linguistic data site hosted by Tammy Gales and Dakota Wing). Basic metadata is supplied for each text where known from the original case research. We wish to thank our graduate student fellows who helped compile the texts and metadata: Nicole Harris, Annina van Riper, Zara Rabinko, and Zachary Boudreaux.
Total texts: 309 Total estimated authors: 203 Total word count: 54,167
METADATA KEY
TG = Tammy Gales (public portion of CTARC) AN = Andrea Nini (MFT) JM = Julia Muschalik (written portion of CoJo) FL = ForensicLing.com (Tammy Gales and Dakota Wing)
Name###_## = file name, case number, text number within case File name might be threat recipient or author; remaining info is about the author, where known
The total equity of SMS Co., Ltd. with headquarters in Japan amounted to 44.28 billion Japanese yen in 2023. The reported fiscal year ends on March 31.Compared to the earliest depicted value from 2020 this is a total increase by approximately 21.62 billion Japanese yen. The trend from 2020 to 2023 shows, furthermore, that this increase happened continuously.
The graph shows the monthly amount of text messages in China from ********* to *********. In *********, about ***** billion text messages had been sent in China. Text messaging in China – additional information
There has been a significant decline in text messaging after the total number of text messages sent in China peaked in 2012 at *** billion. The decrease is even more noticeable in terms of text messages sent per person, taking account of the increasing number of registered mobile users in China. The reason for the continuous decline in text messaging is quite obvious; due to the growing popularity of smartphones and mobile internet, Chinese mobile users are preferring mobile messaging apps to share information. The usage of mobile message apps is almost universal among Chinese smartphone users; around ** percent of iPhone users in China are using WeChat, for example, the most popular Chinese messaging app developed by Tencent. As of the second quarter of 2015, the number of monthly active WeChat users has reached approximately *** million. Mobile message apps like WeChat are gaining rapid traction among Chinese users because they offer more than an alternative to texting. Voice messaging, also known as “push-to-talk”, was the most commonly used function of WeChat in 2014. One reason may be that Chinese language is relatively hard to type, so voice messaging could take full advantage in keeping the users hands free and saving a considerable amount of time. Besides, mobile message apps in China are even more appealing due to the inclusion of social media features: As of ********, about ** percent of WeChat users had used “moment”, a sharing feature allowing people to exchange stories, photos and short videos among their circle of friends. Moreover, China’s instant messaging apps like WeChat are expanding their services in sectors such as gaming, commercial promoting, online shopping, and even banking.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a large multilingual toxicity dataset with 3M rows of text data from 55 natural languages, all of which are written/sent by humans, not machine translation models. The preprocessed training data alone consists of 2,880,667 rows of comments, tweets, and messages. Among these rows, 416,529 are classified as toxic, while the remaining 2,463,773 are considered neutral. Below is a table to illustrate the data composition:
Toxic Neutral Total
multilingual-train-deduplicated.csv… See the full description on the dataset page: https://huggingface.co/datasets/FredZhang7/toxi-text-3M.
The WOS Hierarchical Text Classification are three dataset variants created from Web of Science (WOS) title and abstract data categorised into a hierarchical, multi-label class structure. The aim of the sampling and filtering methodology used was to create well-balanced class distributions (at chosen hierarchical levels). Furthermore, the WOS_JTF variant was also created with the aim to only contain publication data such that their class assignments results is classes instances that semantically more similar.
The three dataset variants have the following properties: 1. WOS_JT comprises 43,366 total samples (train=30356, dev=6505, test=6505) and only uses the journal-based classifications as labels. 2. WOS_CT comprises 65,200 total samples (train=45640, dev=9780, test=9780) and only uses citation-based classifications as labels. 3. WOS_JTF comprises 42,926 total samples (train=30048, dev=6439, test=6439) and uses a filtered set of papers based on journal and citation classification.
The dataset is available at:
https://huggingface.co/datasets/marcelsun/wos_hierarchical_multi_label_text_classification
Dataset details: *.json: - concatenated title and abstract mapped to a list each associated class label.
depth2label.pt: dictionary where: - key = depth of classification hierarchy. - value = list of classes associated with depth.
path_list.pt: - list of tuples for every edge between classes in the hierarchical classification. This specifies the acyclic graph.
slot.pt: dictionary where: - key = label_id of parent class - value = label_ids of children classes
value2slot.pt: dictionary where: - key = label_id - value = label_id of parent class
value_dict.pt: dictionary where: - key = label_id - value = string representation of class.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The study aimed to assess the effectiveness of text messages as a replacement for routine postal reminders in a fecal immunochemical test (FIT) based colorectal cancer screening program in Catalonia. For that purpose, a randomized controlled trial was conducted. Study population: individuals aged 50 to 69 invited to screening who had not completed FIT within six weeks. The intervention group (n=12,167) received a text message reminder, and the control group (n=12,221) used the standard procedure (reminder letter). The primary outcome was a participation rate within 18 weeks of the invitation. The trial was discontinued, and a recovery strategy was implemented by sending a reminder letter to non-participant individuals from the intervention group. We performed a final analysis to determine the impact of the recovery strategy. Results: Interim analysis (n=7095) showed a lower participation rate among nonparticipants within six weeks in the text message group compared to the control group (16.4% vs. 20.9%, OR 0.71, 95% CI 0.63–0.81). A total of 7591 non-participants in the text message group received a second reminder by letter, reaching a participation rate of 23%. Final analysis (n=24,388) showed that the intervention group, which received two reminders, had higher participation than the control group (29.3% vs. 26.5%, OR 1.16, 95% CI 1.09–1.23). Our attempt to replace reminder letters with text messages was unsuccessful, but receiving two reminders significantly increased participation rates among non-participants within six weeks compared to one postal reminder. Additional research is essential to determine the best timing and frequency of reminders to boost participation without being intrusive in their choice of participation
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind. Original github repo; https://github.com/cs-chan/Total-Text-Dataset Forked repo; https://github.com/yunusserhat/Total-Text-Dataset