According to the results of a survey on customer experience (CX) among businesses conducted in the United States in 2021, the main challenge affecting data analysis capability for CX is the lack of reliability and integrity of available data. Data security followed, being chosen by almost 46 percent of the respondents.
During a survey conducted among TV marketers in the United States and released in May 2023, the main challenge of merging linear and digital data was identified by 53 percent of respondents with the lack of common metrics across channels. The creation of a holistic framework for planning and measurement was mentioned by 41 percent of respondents, while 40 percent cited data-sharing restrictions by walled gardens.
In 2020, 54 percent of healthcare providers and 50 percent of healthcare payers surveyed in the United States indicated that lack of technical interoperability was the biggest challenge around health data sharing. Among 52 percent of providers, noted that timeliness of data that is shared was a challenge, in comparison only 21 percent of payers shared the same concern.
Overview The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. The current edition is version 11.4 (published 24 February 2025). The 11.4 release contains updated boundary lines and data refinements designed to extend the functionality of the dataset. These data and generalized derivatives are the only international boundary lines approved for U.S. Government use. The contents of this dataset reflect U.S. Government policy on international boundary alignment, political recognition, and dispute status. They do not necessarily reflect de facto limits of control. National Geospatial Data Asset This dataset is a National Geospatial Data Asset (NGDAID 194) managed by the Department of State. It is a part of the International Boundaries Theme created by the Federal Geographic Data Committee. Dataset Source Details Sources for these data include treaties, relevant maps, and data from boundary commissions, as well as national mapping agencies. Where available and applicable, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery process includes analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground. Cartographic Visualization The LSIB is a geospatial dataset that, when used for cartographic purposes, requires additional styling. The LSIB download package contains example style files for commonly used software applications. The attribute table also contains embedded information to guide the cartographic representation. Additional discussion of these considerations can be found in the Use of Core Attributes in Cartographic Visualization section below. Additional cartographic information pertaining to the depiction and description of international boundaries or areas of special sovereignty can be found in Guidance Bulletins published by the Office of the Geographer and Global Issues: https://hiu.state.gov/data/cartographic_guidance_bulletins/ Contact Direct inquiries to internationalboundaries@state.gov. Direct download: https://data.geodata.state.gov/LSIB.zip Attribute Structure The dataset uses the following attributes divided into two categories: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | Core CC1_GENC3 | Extension CC1_WPID | Extension COUNTRY1 | Core CC2 | Core CC2_GENC3 | Extension CC2_WPID | Extension COUNTRY2 | Core RANK | Core LABEL | Core STATUS | Core NOTES | Core LSIB_ID | Extension ANTECIDS | Extension PREVIDS | Extension PARENTID | Extension PARENTSEG | Extension These attributes have external data sources that update separately from the LSIB: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | GENC CC1_GENC3 | GENC CC1_WPID | World Polygons COUNTRY1 | DoS Lists CC2 | GENC CC2_GENC3 | GENC CC2_WPID | World Polygons COUNTRY2 | DoS Lists LSIB_ID | BASE ANTECIDS | BASE PREVIDS | BASE PARENTID | BASE PARENTSEG | BASE The core attributes listed above describe the boundary lines contained within the LSIB dataset. Removal of core attributes from the dataset will change the meaning of the lines. An attribute status of âExtensionâ represents a field containing data interoperability information. Other attributes not listed above include âFIDâ, âShape_lengthâ and âShape.â These are components of the shapefile format and do not form an intrinsic part of the LSIB. Core Attributes The eight core attributes listed above contain unique information which, when combined with the line geometry, comprise the LSIB dataset. These Core Attributes are further divided into Country Code and Name Fields and Descriptive Fields. County Code and Country Name Fields âCC1â and âCC2â fields are machine readable fields that contain political entity codes. These are two-character codes derived from the Geopolitical Entities, Names, and Codes Standard (GENC), Edition 3 Update 18. âCC1_GENC3â and âCC2_GENC3â fields contain the corresponding three-character GENC codes and are extension attributes discussed below. The codes âQ2â or âQX2â denote a line in the LSIB representing a boundary associated with areas not contained within the GENC standard. The âCOUNTRY1â and âCOUNTRY2â fields contain the names of corresponding political entities. These fields contain names approved by the U.S. Board on Geographic Names (BGN) as incorporated in the â"Independent States in the World" and "Dependencies and Areas of Special Sovereignty" lists maintained by the Department of State. To ensure maximum compatibility, names are presented without diacritics and certain names are rendered using common cartographic abbreviations. Names for lines associated with the code "Q2" are descriptive and not necessarily BGN-approved. Names rendered in all CAPITAL LETTERS denote independent states. Names rendered in normal text represent dependencies, areas of special sovereignty, or are otherwise presented for the convenience of the user. Descriptive Fields The following text fields are a part of the core attributes of the LSIB dataset and do not update from external sources. They provide additional information about each of the lines and are as follows: ATTRIBUTE NAME | CONTAINS NULLS RANK | No STATUS | No LABEL | Yes NOTES | Yes Neither the "RANK" nor "STATUS" fields contain null values; the "LABEL" and "NOTES" fields do. The "RANK" field is a numeric expression of the "STATUS" field. Combined with the line geometry, these fields encode the views of the United States Government on the political status of the boundary line. A value of â1â in the âRANKâ field corresponds to an "International Boundary" value in the âSTATUSâ field. Values of â2â and â3â correspond to âOther Line of International Separationâ and âSpecial Line,â respectively. The âLABELâ field contains required text to describe the line segment on all finished cartographic products, including but not limited to print and interactive maps. The âNOTESâ field contains an explanation of special circumstances modifying the lines. This information can pertain to the origins of the boundary lines, limitations regarding the purpose of the lines, or the original source of the line. Use of Core Attributes in Cartographic Visualization Several of the Core Attributes provide information required for the proper cartographic representation of the LSIB dataset. The cartographic usage of the LSIB requires a visual differentiation between the three categories of boundary lines. Specifically, this differentiation must be between: - International Boundaries (Rank 1); - Other Lines of International Separation (Rank 2); and - Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the âLabelâ field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Please consult the style files in the download package for examples of this depiction. The requirement to incorporate the contents of the "LABEL" field on cartographic products is scale dependent. If a label is legible at the scale of a given static product, a proper use of this dataset would encourage the application of that label. Using the contents of the "COUNTRY1" and "COUNTRY2" fields in the generation of a line segment label is not required. The "STATUS" field contains the preferred description for the three LSIB line types when they are incorporated into a map legend but is otherwise not to be used for labeling. Use of the âCC1,â âCC1_GENC3,â âCC2,â âCC2_GENC3,â âRANK,â or âNOTESâ fields for cartographic labeling purposes is prohibited. Extension Attributes Certain elements of the attributes within the LSIB dataset extend data functionality to make the data more interoperable or to provide clearer linkages to other datasets. The fields âCC1_GENC3â and âCC2_GENCâ contain the corresponding three-character GENC code to the âCC1â and âCC2â attributes. The code âQX2â is the three-character counterpart of the code âQ2,â which denotes a line in the LSIB representing a boundary associated with a geographic area not contained within the GENC standard. To allow for linkage between individual lines in the LSIB and World Polygons dataset, the âCC1_WPIDâ and âCC2_WPIDâ fields contain a Universally Unique Identifier (UUID), version 4, which provides a stable description of each geographic entity in a boundary pair relationship. Each UUID corresponds to a geographic entity listed in the World Polygons dataset. These fields allow for linkage between individual lines in the LSIB and the overall World Polygons dataset. Five additional fields in the LSIB expand on the UUID concept and either describe features that have changed across space and time or indicate relationships between previous versions of the feature. The âLSIB_IDâ attribute is a UUID value that defines a specific instance of a feature. Any change to the feature in a lineset requires a new âLSIB_ID.â The âANTECIDS,â or antecedent ID, is a UUID that references line geometries from which a given line is descended in time. It is used when there is a feature that is entirely new, not when there is a new version of a previous feature. This is generally used to reference countries that have dissolved. The âPREVIDS,â or Previous ID, is a UUID field that contains old versions of a line. This is an additive field, that houses all Previous IDs. A new version of a feature is defined by any change to the featureâeither line geometry or attributeâbut it is still conceptually the same feature. The âPARENTIDâ field
Version 11.1 Release Date: August 22, 2022
The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. These data and their derivatives are the only international boundary lines approved for U.S. Government use. They reflect U.S. Government policy, and not necessarily de facto limits of control. This dataset is a National Geospatial Data Asset.
Sources for these data include treaties, relevant maps, and data from boundary commissions and national mapping agencies. Where available, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery of the data involves analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground.
The dataset uses the following attributes: Attribute Name Explanation Country Code Country-level codes are from the Geopolitical Entities, Names, and Codes Standard (GENC). The Q2 code denotes a line representing a boundary associated with an area not in GENC. Country Names Names approved by the U.S. Board on Geographic Names (BGN). Names for lines associated with a Q2 code are descriptive and are not necessarily BGN-approved. Label Required text label for the line segment where scale permits Rank/Status Rank 1: International Boundary Rank 2: Other Line of International Separation Rank 3: Special Line Notes Explanation of any applicable special circumstances Cartographic Usage Depiction of the LSIB requires a visual differentiation between the three categories of boundaries: International Boundaries (Rank 1), Other Lines of International Separation (Rank 2), and Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the âLabelâ field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Additional cartographic information can be found in Guidance Bulletins (https://hiu.state.gov/data/cartographic_guidance_bulletins/) published by the Office of the Geographer and Global Issues. Please direct inquiries to internationalboundaries@state.gov.
The lines in the LSIB dataset are the product of decades of collaboration between geographers at the Department of State and the National Geospatial-Intelligence Agency with contributions from the Central Intelligence Agency and the UK Defence Geographic Centre. Attribution is welcome: U.S. Department of State, Office of the Geographer and Global Issues.
This version of the LSIB contains changes and accuracy refinements for the following line segments. These changes reflect improvements in spatial accuracy derived from newly available source materials, an ongoing review process, or the publication of new treaties or agreements. Changes to lines include: âą Akrotiri (UK) / Cyprus âą Albania / Montenegro âą Albania / Greece âą Albania / North Macedonia âą Armenia / Turkey âą Austria / Czechia âą Austria / Slovakia âą Austria / Hungary âą Austria / Slovenia âą Austria / Germany âą Austria / Italy âą Austria / Switzerland âą Azerbaijan / Turkey âą Azerbaijan / Iran âą Belarus / Latvia âą Belarus / Russia âą Belarus / Ukraine âą Belarus / Poland âą Bhutan / India âą Bhutan / China âą Bulgaria / Turkey âą Bulgaria / Romania âą Bulgaria / Serbia âą Bulgaria / Romania âą China / Tajikistan âą China / India âą Croatia / Slovenia âą Croatia / Hungary âą Croatia / Serbia âą Croatia / Montenegro âą Czechia / Slovakia âą Czechia / Poland âą Czechia / Germany âą Finland / Russia âą Finland / Norway âą Finland / Sweden âą France / Italy âą Georgia / Turkey âą Germany / Poland âą Germany / Switzerland âą Greece / North Macedonia âą Guyana / Suriname âą Hungary / Slovenia âą Hungary / Serbia âą Hungary / Romania âą Hungary / Ukraine âą Iran / Turkey âą Iraq / Turkey âą Italy / Slovenia âą Italy / Switzerland âą Italy / Vatican City âą Italy / San Marino âą Kazakhstan / Russia âą Kazakhstan / Uzbekistan âą Kosovo / north Macedonia âą Kosovo / Serbia âą Kyrgyzstan / Tajikistan âą Kyrgyzstan / Uzbekistan âą Latvia / Russia âą Latvia / Lithuania âą Lithuania / Poland âą Lithuania / Russia âą Moldova / Ukraine âą Moldova / Romania âą Norway / Russia âą Norway / Sweden âą Poland / Russia âą Poland / Ukraine âą Poland / Slovakia âą Romania / Ukraine âą Romania / Serbia âą Russia / Ukraine âą Syria / Turkey âą Tajikistan / Uzbekistan
This release also contains topology fixes, land boundary terminus refinements, and tripoint adjustments.
While U.S. Government works prepared by employees of the U.S. Government as part of their official duties are not subject to Federal copyright protection (see 17 U.S.C. § 105), copyrighted material incorporated in U.S. Government works retains its copyright protection. The works on or made available through download from the U.S. Department of Stateâs website may not be used in any manner that infringes any intellectual property rights or other proprietary rights held by any third party. Use of any copyrighted material beyond what is allowed by fair use or other exemptions may require appropriate permission from the relevant rightsholder. With respect to works on or made available through download from the U.S. Department of Stateâs website, neither the U.S. Government nor any of its agencies, employees, agents, or contractors make any representations or warrantiesâexpress, implied, or statutoryâas to the validity, accuracy, completeness, or fitness for a particular purpose; nor represent that use of such works would not infringe privately owned rights; nor assume any liability resulting from use of such works; and shall in no way be liable for any costs, expenses, claims, or demands arising out of use of such works.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States SBOI: sa: Most Pressing Problem: Survey High: Competit'n frm Big Bus data was reported at 14.000 % in Feb 2025. This stayed constant from the previous number of 14.000 % for Jan 2025. United States SBOI: sa: Most Pressing Problem: Survey High: Competit'n frm Big Bus data is updated monthly, averaging 14.000 % from Jan 2014 (Median) to Feb 2025, with 130 observations. The data reached an all-time high of 14.000 % in Feb 2025 and a record low of 14.000 % in Feb 2025. United States SBOI: sa: Most Pressing Problem: Survey High: Competit'n frm Big Bus data remains active status in CEIC and is reported by National Federation of Independent Business. The data is categorized under Global Databaseâs United States â Table US.S032: NFIB Index of Small Business Optimism. [COVID-19-IMPACT]
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
The U.S. Census defines Asian Americans as individuals having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent (U.S. Office of Management and Budget, 1997). As a broad racial category, Asian Americans are the fastest-growing minority group in the United States (U.S. Census Bureau, 2012). The growth rate of 42.9% in Asian Americans between 2000 and 2010 is phenomenal given that the corresponding figure for the U.S. total population is only 9.3% (see Figure 1). Currently, Asian Americans make up 5.6% of the total U.S. population and are projected to reach 10% by 2050. It is particularly notable that Asians have recently overtaken Hispanics as the largest group of new immigrants to the U.S. (Pew Research Center, 2015). The rapid growth rate and unique challenges as a new immigrant group call for a better understanding of the social and health needs of the Asian American population.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Brackish groundwater (BGW), defined for this assessment as having a dissolved-solids concentration between 1,000 and 10,000 milligrams per liter is an unconventional source of water that may offer a partial solution to current (2016) and future water challenges. In support of the National Water Census, the U.S. Geological Survey has completed a BGW assessment to gain a better understanding of the occurrence and character of BGW resources of the United States as an alternative source of water. Analyses completed as part of this assessment relied on previously collected data from multiple sources, and no new data were collected. One of the most important contributions of this assessment was the creation of a database containing chemical data and aquifer information for the known quantities of BGW in the United States. Data were compiled from single publications to large datasets and from local studies to national assessments, and includes chemical data on the concentrations of disso ...
Dataset Summary
This dataset aims to facilitate the creation of sophisticated, multi-turn dialogue datasets focused on coding for Large Language Models (LLMs).
It also serves as a robust foundation for problem-solving in Large Language Models (LLMs).
The dataset includes both accepted and failed solutions from Atcoders's (ABC) contests.
In total, it features 1911 unique problems and 384,536 submissions across over 50 different programming languages.
It covers contests from ABC⊠See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/atcoder_contests.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virusâs awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as âCovid-News-USA-NNKâ. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it âCovid-News-BD-NNKâ. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall âtalk of the topicâ. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like âDeathsâ, âInfectedâ, âDiedâ , âInfectionsâ, âQuarantinedâ, Lock-downâ, âDiagnosedâ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: âChinaâ, Governmentâ, âMasksâ, âEconomyâ, âCrisisâ, âTheftâ , âStock marketâ , âJobsâ , âElectionâ, âMisstepsâ, âHealthâ, âResponseâ. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for GSM8K
Dataset Summary
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ â ĂĂ·) to reach the⊠See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
Alternative Data Market Size 2025-2029
The alternative data market size is forecast to increase by USD 60.32 billion at a CAGR of 52.5% between 2024 and 2029.
The market is experiencing significant growth due to the increased availability and diversity of data sources. This trend is driven by the rise of alternative data-driven investment strategies, which offer unique insights and opportunities for businesses and investors. However, challenges persist in the form of issues related to data quality and standardization. big data analytics and machine learning help businesses gain insights from vast amounts of data, enabling data-driven innovation and competitive advantage. Data governance, data security, and data ethics are crucial aspects of managing alternative data.
As more data becomes available, ensuring its accuracy and consistency is crucial for effective decision-making. The market analysis report provides an in-depth examination of these factors and their impact on the growth of the market. With the increasing importance of data-driven strategies, staying informed about the latest trends and challenges is essential for businesses looking to remain competitive in today's data-driven economy.
What will be the Size of the Alternative Data Market During the Forecast Period?
To learn more about the market report, Request Free Sample
Alternative data, the non-traditional information sourced from various industries and domains, is revolutionizing business landscapes by offering new opportunities for data monetization. This trend is driven by the increasing availability of data from various sources such as credit card transactions, IoT devices, satellite data, social media, and more. Data privacy is a critical consideration in the market. With the increasing focus on data protection regulations, businesses must ensure they comply with stringent data privacy standards. Data storytelling and data-driven financial analysis are essential applications of alternative data, providing valuable insights for businesses to make informed decisions. Data-driven product development and sales prediction are other significant areas where alternative data plays a pivotal role.
Moreover, data management platforms and analytics tools facilitate data integration, data quality, and data visualization, ensuring data accuracy and consistency. Predictive analytics and data-driven risk management help businesses anticipate trends and mitigate risks. Data enrichment and data-as-a-service are emerging business models that enable businesses to access and utilize alternative data. Economic indicators and data-driven operations are other areas where alternative data is transforming business processes.
How is the Alternative Data Market Segmented?
The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Type
Credit and debit card transactions
Social media
Mobile application usage
Web scrapped data
Others
End-user
BFSI
IT and telecommunication
Retail
Others
Geography
North America
Canada
Mexico
US
Europe
Germany
UK
France
Italy
APAC
China
India
Japan
South America
Middle East and Africa
By Type Insights
The credit and debit card transactions segment is estimated to witness significant growth during the forecast period.
Alternative data derived from card and debit card transactions offers valuable insights into consumer spending behaviors and lifestyle choices. This data is essential for market analysts, financial institutions, and businesses seeking to enhance their strategies and customer experiences. The two primary categories of card transactions are credit and debit. Credit card transactions provide information on discretionary spending, luxury purchases, and credit management skills. In contrast, debit card transactions reveal essential spending habits, budgeting strategies, and daily expenses. By analyzing this data using advanced methods, businesses can gain a competitive advantage, understand market trends, and cater to consumer needs effectively. IT & telecommunications companies, hedge funds, and other organizations rely on web scraped data, social and sentiment analysis, and public data to supplement their internal data sources. Adhering to GDPR regulations ensures ethical data usage and compliance.
Get a glance at the market report of share of various segments. Request Free Sample
The credit and debit card transactions segment was valued at USD 228.40 million in 2019 and showed a gradual increase during the forecast period.
Regional Analysis
North America is estimated to contribute 56% to the growth of the global market during the forecast period.
T
Judgement on the presence of American troops in West Germany. Topics: Most important problems of the FRG; attitude to participation of the FRG in the costs of stationing NATO military forces and to American troops remaining in the FRG; attitude to a reduction in American military forces; general judgement on the American soldiers; perceived changes in the relationship of American soldiers to the German civilian population; criticism of the way of life of American soldiers; frequency of contact with American soldiers after the war; attitude to construction of housing settlements for the families living in Germany; perception of the Americans as occupying forces or protective forces; attitude to children of members of the occupying forces and their mothers; judgement on the confiscation of buildings by Americans; residency; participation in the world war and deployment in battle against the Americans. Demography: membership in clubs, trade unions or a party und offices taken on there; party preference; age (classified); sex; marital status; religious denomination; school education; occupation; employment; household income; head of household; state; Interviewer rating: social class and willingness of respondent to cooperate; number of contact attempts; city size. Also encoded was: identification of interviewer; sex of interviewer and age of interviewer. Beurteilung der Anwesenheit der amerikanischen Truppen in Westdeutschland. Themen: Wichtigste Probleme der BRD; Einstellung zu einer Beteiligung der BRD an den Stationierungskosten der NATO-StreitkrĂ€fte und zu einem Verbleib der amerikanischen Truppen in der BRD; Einstellung zu einer Verringerung der amerikanischen StreitkrĂ€fte; allgemeine Beurteilung der amerikanischen Soldaten; wahrgenommene VerĂ€nderungen im VerhĂ€ltnis der amerikanischen Soldaten zur deutschen Zivilbevölkerung; Kritik an der Lebensweise amerikanischer Soldaten; KontakthĂ€ufigkeit zu amerikanischen Soldaten nach dem Kriege; Einstellung zum Bau von Wohnsiedlungen fĂŒr die in Deutschland lebenden Familien; Wahrnehmung der Amerikaner als Besatzungstruppen oder Schutztruppe; Einstellung zu Besatzungskindern und ihren MĂŒttern; Beurteilung der Beschlagnahme von HĂ€usern durch Amerikaner; Teilnahme am Weltkrieg und Einsatz im Kampf gegen die Amerikaner. Demographie: Mitgliedschaft in Vereinen, Gewerkschaften oder einer Partei und dabei ĂŒbernommene Ămter; ParteiprĂ€ferenz; Alter (klassiert); Geschlecht; Familienstand; Konfession; Schulbildung; Beruf; BerufstĂ€tigkeit; Haushaltseinkommen; Haushaltungsvorstand; Bundesland; FlĂŒchtlingsstatus. Interviewerrating: Schichtzugehörigkeit und Kooperationsbereitschaft des Befragten; Anzahl der Kontaktversuche; OrtsgröĂe. ZusĂ€tzlich verkodet wurde: Intervieweridentifikation; Interviewergeschlecht und Intervieweralter.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States SBOI: sa: Most Pressing Problem: Competition from Large Businesses data was reported at 5.000 % in Jan 2025. This records an increase from the previous number of 4.000 % for Dec 2024. United States SBOI: sa: Most Pressing Problem: Competition from Large Businesses data is updated monthly, averaging 8.000 % from Jan 2014 (Median) to Jan 2025, with 129 observations. The data reached an all-time high of 11.000 % in Dec 2019 and a record low of 0.000 % in May 2022. United States SBOI: sa: Most Pressing Problem: Competition from Large Businesses data remains active status in CEIC and is reported by National Federation of Independent Business. The data is categorized under Global Databaseâs United States â Table US.S032: NFIB Index of Small Business Optimism. [COVID-19-IMPACT]
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Use Cambridge's open data to help our city come up with innovative solutions to its biggest challenges. This dataset lists city issues that you can help us solve by analyzing or hacking on our open data. It's certainly not an exhaustive list, but we hope it will at least point you in the right direction. Feel free to reach out at OpenData@cambridgema.gov with questions or ideas. Thanks for your help. We're glad you're on our team!
This dataset of historical poor law cases was created as part of a project aiming to assess the implications of the introduction of Artificial Intelligence (AI) into legal systems in Japan and the United Kingdom. The project was jointly funded by the UKâs Economic and Social Research Council, part of UKRI, and the Japanese Society and Technology Agency (JST), and involved collaboration between Cambridge University (the Centre for Business Research, Department of Computer Science and Faculty of Law) and Hitotsubashi University, Tokyo (the Graduate Schools of Law and Business Administration). As part of the project, a dataset of historic poor law cases was created to facilitate the analysis of legal texts using natural language processing methods. The dataset contains judgments of cases which have been annotated to facilitate computational analysis. Specifically, they make it possible to see how legal terms have evolved over time in the area of disputes over the law governing settlement by hiring.
A World Economic Forum meeting at Davos 2019 heralded the dawn of 'Society 5.0' in Japan. Its goal: creating a 'human-centred society that balances economic advancement with the resolution of social problems by a system that highly integrates cyberspace and physical space.' Using Artificial Intelligence (AI), robotics and data, 'Society 5.0' proposes to '...enable the provision of only those products and services that are needed to the people that need them at the time they are needed, thereby optimizing the entire social and organizational system.' The Japanese government accepts that realising this vision 'will not be without its difficulties,' but intends 'to face them head-on with the aim of being the first in the world as a country facing challenging issues to present a model future society.' The UK government is similarly committed to investing in AI and likewise views the AI as central to engineering a more profitable economy and prosperous society.
This vision is, however, starting to crystallise in the rhetoric of LegalTech developers who have the data-intensive-and thus target-rich-environment of law in their sights. Buoyed by investment and claims of superior decision-making capabilities over human lawyers and judges, LegalTech is now being deputised to usher in a new era of 'smart' law built on AI and Big Data. While there are a number of bold claims made about the capabilities of these technologies, comparatively little attention has been directed to more fundamental questions about how we might assess the feasibility of using them to replicate core aspects of legal process, and ensuring the public has a meaningful say in the development and implementation.
This innovative and timely research project intends to approach these questions from a number of vectors. At a theoretical level, we consider the likely consequences of this step using a Horizon Scanning methodology developed in collaboration with our Japanese partners and an innovative systemic-evolutionary model of law. Many aspects of legal reasoning have algorithmic features which could lend themselves to automation. However, an evolutionary perspective also points to features of legal reasoning which are inconsistent with ML: including the reflexivity of legal knowledge and the incompleteness of legal rules at the point where they encounter the 'chaotic' and unstructured data generated by other social sub-systems. We will test our theory by developing a hierarchical model (or ontology), derived from our legal expertise and public available datasets, for classifying employment relationships under UK law. This will let us probe the extent to which legal reasoning can be modelled using less computational-intensive methods such as Markov Models and Monte Carlo Trees.
Building upon these theoretical innovations, we will then turn our attention from modelling a legal domain using historical data to exploring whether the outcome of legal cases can be reliably predicted using various technique for optimising datasets. For this we will use a data set comprised of 24,179 cases from the High Court of England and Wales. This will allow us to harness Natural Language Processing (NLP) techniques such as named entity recognition (to identify relevant parties) and sentiment analysis (to analyse opinions and determine the disposition of a party) in addition to identifying the main legal and factual points of the dispute, remedies, costs, and trial durations. By trailing various predictive heuristics and ML techniques against this dataset we hope to develop a more granular understanding as to the feasibility of predicting dispute outcomes and insight to what factors are relevant for legal decision-making. This will allow us to then undertake a comparative analysis with the results of existing studies and shed light on the legal contexts and questions where AI can and cannot be used to produce accurate and repeatable results.
Attitudes to current national and international questions. Topics: most important national problem; most important international problem; countries in conflict with the FRG; major problems and differences between FRG and USA; major problems between FRG and other countries; opinion on France, Great Britain, USA, USSR, Red China; reasons for negative and positive attitude to countries USA, USSR and China; trust in USA and USSR in treatment of world problems; reasons for little trust in USA and USSR; effort of USA and USSR for world peace; relationship of USA to USSR; strongest current nuclear power; strongest nuclear power in 5 years; desired strongest nuclear power; reasons for desire for balanced nuclear potential between USA and USSR; knowledge about the SALT negotiations; countries participating in the SALT negotiations; purpose and chances for success of the SALT negotiations; beneficiary of a treaty between USA and USSR; relying on USA in negotiations; security conference; threat to national security of Germany; support for FRG in the case of conflict; knowledge of international organizations; purpose of NATO; membership in NATO; reasons for desired membership; trust in defense ability of NATO; stationing troops in Western Europe; reduction of US troop strength in Europe; necessity of USA for security of Western Europe; defense budget of FRG; navy forces in the Mediterranean; strongest naval power in the Mediterranean; relationship of Israel and Arab nations; support of FRG for Israel; significance of result of the Middle East Conflict for FRG; peace process in the Middle East; European unification process; powers of a European Government; attitude of the USA to European integration; solving the problem of environmental pollution by international organizations; economic aid for other countries. Demography: age; marital status; education; occupation; income; religious denomination; church attendance; sex; city size; state. Also encoded was: length of interview; number of contact attempts; presence of others during interview; willingness to cooperate; difficulty; end time; date of interview; interviewer number. Einstellungen zu aktuellen nationalen und internationalen Fragen. Themen: wichtigstes nationales Problem; wichtigstes internationales Problem; LĂ€nder im Konflikt mit der BRD; Hauptprobleme und Differenzen zwischen BRD und USA; Hauptprobleme zwischen BRD und anderen LĂ€ndern; Meinung ĂŒber Frankreich, GroĂbritannien, USA, UdSSR, Rot-China; GrĂŒnde fĂŒr negative und positive Einstellung zu den LĂ€ndern USA, UdSSR und China; Vertrauen in die USA und die UdSSR bei der Behandlung von Weltproblemen; GrĂŒnde fĂŒr geringes Vertrauen in die USA und UdSSR; BemĂŒhen der USA und der UdSSR um den Weltfrieden; VerhĂ€ltnis der USA zur UdSSR; stĂ€rkste derzeitige Atommacht; StĂ€rkste Atommacht in 5 Jahren; gewĂŒnschte stĂ€rkste Atommacht; GrĂŒnde fĂŒr Wunsch nach ausgeglichenem Nuklearpotential zwischen USA und UdSSR; Kenntnis der SALT-Verhandlungen; Teilnehmerstaaten der SALT-Verhandlungen; Zweck und Erfolgschancen der SALT-Verhandlungen; NutznieĂer eines Abkommens zwischen USA und UdSSR; VerlaĂ auf USA bei Verhandlungen; Sicherheitskonferenz; Bedrohung der nationalen Sicherheit Deutschlands; Beistand fĂŒr BRD im Konfliktfall; Kenntnis internationaler Organisationen; Zweck der NATO; Mitgliedschaft in der NATO; GrĂŒnde fĂŒr gewĂŒnschte Mitgliedschaft; Vertrauen in VerteidigungsfĂ€higkeit der NATO; Truppenstationierungen in Westeuropa; Reduktion der US-TruppenstĂ€rke in Europa; Notwendigkeit der USA fĂŒr die Sicherheit Westeuropas; Verteidigungsbudget der BRD; MarinestreitkrĂ€fte im Mittelmeer; stĂ€rkste Seemacht im Mittelmeer; VerhĂ€ltnis Israel und arabische Staaten; UnterstĂŒtzung der BRD fĂŒr Israel; Bedeutung des Ausganges des Nahostkonfliktes fĂŒr die BRD; Friedensprozess im Nahen Osten; europĂ€ischer Einigungsprozess; Kompetenzen einer europĂ€ischen Regierung; Haltung der USA zur europĂ€ischen Integration; Lösung des Problems der Umweltverschmutzung durch internationale Organisationen; Wirtschaftshilfe fĂŒr andere Staaten. Demographie: Alter; Familienstand; Bildung; Beruf; Einkommen; Konfession; Kirchgang; Geschlecht; OrtsgröĂe; Bundesland. ZusĂ€tzlich verkodet wurden: Interviewdauer; Anzahl der Kontaktversuche; Anwesenheit anderer wĂ€hrend des Interviews; Kooperationsbereitschaft; Schwierigkeit; Endzeit; Interviewdatum; Interviewer-Nummer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
Big-Math is the largest open-source dataset of high-quality mathematical problems, curated specifically for reinforcement learning (RL) training in language models. With over 250,000 rigorously filtered and verified problems, Big-Math bridges the gap between quality and quantity, establishing a robust foundation for advancing reasoning in LLMs.
Request Early Access to Private⊠See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified.
This dataset contains model-based census tract estimates. PLACES covers the entire United Statesâ50 states and the District of Columbiaâat county, place, census tract, and ZIP Code Tabulation Area levels. It provides information uniformly on this large scale for local areas at four geographic levels. Estimates were provided by the Centers for Disease Control and Prevention (CDC), Division of Population Health, Epidemiology and Surveillance Branch. PLACES was funded by the Robert Wood Johnson Foundation in conjunction with the CDC Foundation. The dataset includes estimates for 36 measures: 13 for health outcomes, 9 for preventive services use, 4 for chronic disease-related health risk behaviors, 7 for disabilities, and 3 for health status. These estimates can be used to identify emerging health problems and to help develop and carry out effective, targeted public health prevention activities. Because the small area model cannot detect effects due to local interventions, users are cautioned against using these estimates for program or policy evaluations. Data sources used to generate these model-based estimates are Behavioral Risk Factor Surveillance System (BRFSS) 2021 or 2020 data, Census Bureau 2010 population data, and American Community Survey 2015â2019 estimates. The 2023 release uses 2021 BRFSS data for 29 measures and 2020 BRFSS data for seven measures (all teeth lost, dental visits, mammograms, cervical cancer screening, colorectal cancer screening, core preventive services among older adults, and sleeping less than 7 hours) that the survey collects data on every other year. More information about the methodology can be found at www.cdc.gov/places.
According to the results of a survey on customer experience (CX) among businesses conducted in the United States in 2021, the main challenge affecting data analysis capability for CX is the lack of reliability and integrity of available data. Data security followed, being chosen by almost 46 percent of the respondents.