Text statistics
This dataset is a combination of the following datasets:
agentlans/text-quality-v2 agentlans/readability agentlans/twitter-sentiment-meta-analysis
The main purpose is to collect the large data into one place for easy training and evaluation.
Data Preparation and Transformation
Quality Score Normalization
The dataset was enhanced with additional columns, and quality scores (n = 909 533) were normalized using Ordered Quantile… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/text-stats.
https://electroiq.com/privacy-policyhttps://electroiq.com/privacy-policy
SMS Marketing Statistics: With mobile marketing on the rise, it’s no wonder SMS marketing is becoming more popular among businesses. SMS, which stands for short message service, is a key part of mobile and text message marketing. This tool helps you boost brand awareness, engage with customers, and increase sales. It lets e-commerce marketers build a stronger relationship with their customers and connect with them more personally. SMS marketing is both practical and smart, changing how you grow your business.
Using SMS ensures quick delivery and efficiency, as your message can be sent instantly and reach customers’ phones within seconds. SMS is especially effective for advertising time-sensitive sales or special promotions exclusive to your mobile customers. We shall shed more light on SMS Marketing Statistics through this article.
Information and communication was the industry with the most usage of artificial intelligence (AI) for text analysis in Denmark in 2023 with ** enterprises. Construction made up the least share of only * company.
https://lss01.lingsoft.fi/assets/docs/Lingsoft%20Language%20Management%20Central%20-%20Terms%20of%20Service.pdfhttps://lss01.lingsoft.fi/assets/docs/Lingsoft%20Language%20Management%20Central%20-%20Terms%20of%20Service.pdf
Named entity recognition
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.
Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0
Direct Use
This data can be loaded using the following Python code. from datasets import load_dataset
ds = load_dataset('billingsmoore/text-clustering-example-data')
It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.
This dataset was created by DEBJYOTI SAHA
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data introduction • Emotion-analysis dataset is data for analyzing the emotions of text.
2) Data utilization (1) Emotion-analysis data has characteristics that: • Contains a variety of texts that convey emotions ranging from happiness to anger to sadness. The goal is to build an efficient model for detecting emotions in text. (2) Emotion-analysis data can be used to: • Sentiment classification models: This dataset can be used to train machine learning models that classify text based on sentiment, which helps companies and researchers understand public opinion and sentiment trends. • Market research: Researchers can analyze sentiment data to understand consumer preferences and market trends and support data-driven decision making.
This data asset contains data files of text extracted from pdf reports on the Development Experience Clearinghouse (DEC) for the years 2011 to 2021 (as of July 2021). It includes three specific "Document types" identified by the DEC: Final Contractor/Grantee Report, Final Evaluation Report, and Special Evaluation. Each PDF document labeled as one of these three document types and labeled with a publication year from 2011 to 2021 was downloaded from the DEC in July 2011. The dataset includes text data files from 2,579 Final Contractor/Grantee Reports, 1,299 Final Evaluation reports, and 1,323 Special Evaluation reports. Raw text from each of these PDFs was extracted and saved as individual csv files, the names of which correspond to the Document ID of the PDF document on the DEC. Within each csv file, the raw text is split into paragraphs and corresponding sentences. In addition, to enable Natural Language Processing of the data, the sentences are cleaned by removing unnecessary special characters, punctuation, and numbers, and each word is stemmed to its root to remove inflections (e.g. pluralization and conjugation). This data could be used to analyze trends in USAID's programming approaches and terminology. This data was compiled for USAID/PPL/LER with the Program Cycle Mechanism.
This statistic shows mobile messaging volumes in the U.S. for selected years between 2004 and 2014. In 2010, approximately ***** billion messages were sent in total, up from ** billion in 2004.
U.S. mobile messaging volumes - additional information
A total of around *** trillion text messages were sent in the United States in 2012, marking an almost tenfold increase on the figure from 2006. A further ** million MMS messages were sent in the country in 2012, an increase from * million in 2006. In 2013, the United States was the country with the highest average number of text messages sent per month and per mobile connection. Over *** messages were sent monthly per mobile connection in the United States, in comparison to *** in the United Kingdom and *** in Germany.
The most active age group for sending and receiving text messages in the United States were those aged 18 to 29, as ** percent of respondents said that they did use mobile messaging in 2013. By comparison, only ** percent of those aged 65 and older said that they used their mobile phone for text messaging in 2013.
Rather than using a mobile phone’s integrated text messaging service, many users are opting for third party apps to communicate. As of January 2015, mobile messaging service WhatsApp had around 700 million monthly active users, marking double the amount of users it had in October 2013. Within the U.S. market, iOS and Android users spent a total of 680 million minutes on WhatsApp in February 2013, with those aged between 25 and 34 years most likely to use the service in 2014.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global Text Analysis System market is experiencing robust growth, driven by the increasing need for businesses to extract actionable insights from unstructured textual data. The market, valued at approximately $5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by several key factors, including the rising adoption of cloud-based solutions offering scalability and cost-effectiveness, the expanding use of text analytics in various sectors like customer service, marketing, and risk management, and the increasing availability of sophisticated AI-powered tools capable of handling complex natural language processing tasks. Large enterprises are currently the dominant segment, but the SME sector is demonstrating rapid growth potential, driven by the accessibility of user-friendly and cost-effective solutions. While data privacy and security concerns present a restraint, the overall market trajectory remains positive, fueled by continued technological advancements and growing data volumes. The competitive landscape is marked by a mix of established players like SAP, Microsoft, and IBM, alongside innovative technology providers such as RapidMiner and Luminoso. Regional analysis indicates North America currently holds the largest market share, driven by early adoption of advanced analytics and a strong technology infrastructure. However, significant growth opportunities exist in the Asia Pacific region, particularly in countries like China and India, due to their burgeoning digital economies and increasing demand for data-driven decision-making across various industries. The ongoing development of sophisticated algorithms capable of handling multilingual text and sentiment analysis, along with the integration of text analysis into broader business intelligence platforms, will further propel market expansion in the forecast period (2025-2033). This continuous evolution ensures the text analysis system market remains dynamic and highly lucrative.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global text analysis system market is estimated to be valued at USD XXX million in 2025 and is projected to reach USD XXX million by 2033, at a CAGR of XX%. The increasing need for extracting insights from unstructured data, government initiatives promoting digital transformation, and the growing adoption of cloud-based solutions drive market growth. Various industries, including healthcare, retail, and finance, are increasingly using text analysis systems to analyze customer feedback, monitor social media trends, and improve product development. Key market trends include the rising adoption of artificial intelligence (AI) and machine learning (ML) algorithms, the proliferation of cloud-based solutions, and the growing emphasis on data privacy and security. The increasing availability of open-source text analysis tools and the emergence of low-code/no-code platforms are also expected to fuel market expansion. Moreover, the growing adoption of text analysis systems in emerging economies presents significant growth opportunities. Key players in the market include SAP SE, Microsoft Corporation, RapidMiner Inc., OpenText Corporation, Luminoso Technologies Inc., Lexalytics Inc., Infegy Inc., Micro Focus International PLC, IBM Corporation, Clarabridge Inc., Medallia Inc., SAS Institute Inc., and others. These companies offer a wide range of solutions, from on-premise to cloud-based, to cater to the diverse needs of various industries. Mergers and acquisitions, strategic partnerships, and new product launches are some of the key growth strategies adopted by these companies.
This dataset was created by Ankush Tiwari
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip
. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours
subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar
as text_recognition_multipro.py
.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar
.
Parameter sweeps are automated by param_sweep.rb
. This file also shows how to invoke all of these components.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Climate Policy Radar Open Data
This repo contains the full text data of all of the documents from the Climate Policy Radar database (CPR), which is also available at Climate Change Laws of the World (CCLW). Please note that this replaces the Global Stocktake open dataset: that data, including all NDCs and IPCC reports is now a subset of this dataset.
What’s in this dataset
This dataset contains two corpus types (groups of the same types or sources of documents) which… See the full description on the dataset page: https://huggingface.co/datasets/ClimatePolicyRadar/all-document-text-data.
https://coolest-gadgets.com/privacy-policyhttps://coolest-gadgets.com/privacy-policy
SMS Marketing Statistics: SMS marketing is one of the most effective tools for businesses to connect with their customers. With nearly everyone owning a mobile phone, text messages offer a direct and personal way to share information. Statistics show that SMS messages have an incredibly high open rate, often exceeding 90%. Unlike emails that can go unread or calls that may be ignored, texts are usually seen within minutes.
Businesses are using SMS marketing to promote products, share offers, and provide updates quickly. It’s a cost-effective method that works for companies of all sizes. As mobile phone usage continues to grow, SMS marketing is becoming an essential part of any marketing strategy. Understanding its impact can help businesses improve customer engagement.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We address the question of authorship of biblical texts by employing statistical analysis to the frequency of words using a new method that is particularly sensitive to deviations in frequencies associated with few words out of potentially many. The data below consists of the "discriminating words" which have the biggest effect on the value of the Higher Criticism statistic from the analyses of 50 chapters.
The two available datasets were used to conduct the quantitative analysis of the text classification area. The set, such as:
biblio.bib contains all articles that are grouped in categories
biblio.csv contains processed records from biblio.bib, based on it were built the statistics presented in the article
The ViTexOCR script presents a new method for extracting navigation data from videos with text overlays using optical character recognition (OCR) software. Over the past few decades, it was common for videos recorded during surveys to be overlaid with real-time geographic positioning satellite chyrons including latitude, longitude, date and time, as well as other ancillary data (such as speed, heading, or user input identifying fields). Embedding these data into videos provides them with utility and accuracy, but using the location data for other purposes, such as analysis in a geographic information system, is not possible when only available on the video display. Extracting the text data from imagery using software allows these videos to be located and analyzed in a geospatial context. The script allows a user to select a video, specify the text data types (e.g. latitude, longitude, date, time, or other), text color, and the pixel locations of overlay text data on a sample video frame. The script’s output is a data file containing the retrieved geospatial and temporal data. All functionality is bundled in a Python script that incorporates a graphical user interface and several other software dependencies.
Text statistics
This dataset is a combination of the following datasets:
agentlans/text-quality-v2 agentlans/readability agentlans/twitter-sentiment-meta-analysis
The main purpose is to collect the large data into one place for easy training and evaluation.
Data Preparation and Transformation
Quality Score Normalization
The dataset was enhanced with additional columns, and quality scores (n = 909 533) were normalized using Ordered Quantile… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/text-stats.