Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sentiment analysis of tech media articles using VADER package and co-occurrence analysis
Sources: Above 140k articles (01.2016-03.2019):
Gigaom 0.5%
Euractiv 0.9%
The Conversation 1.3%
Politico Europe 1.3%
IEEE Spectrum 1.8%
Techforge 4.3%
Fastcompany 4.5%
The Guardian (Tech) 9.2%
Arstechnica 10.0%
Reuters 11%
Gizmodo 17.5%
ZDNet 18.3%
The Register 19.5%
Methodology
The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.
As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.
The process included the following main steps:
The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)
The articles containing the given social issue and co-occurring term are identified
The identified articles are divided into paragraphs
Social issue and co-occurring words are removed from the paragraph
The VADER sentiment analysis is carried out for every identified and modified paragraph
The average for the given word pair is calculated for the final result
Therefore, the procedure has been repeated for 100 words for all identified social issues.
The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.
The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:
The articles containing the given social issue are identified
The paragraphs containing the social issue are selected for sentiment analysis
*Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Files
sentiments_mod11.csv sentiment score based on chosen unigrams
sentiments_mod22.csv sentiment score based on chosen bigrams
sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains polarity word lists that have been specifically induced for dictionary-based sentiment analysis of the communication of financial analysts. For details on the preparation of the polarity word lists, please refer to the publication listed below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Il Lessico Italiano dei Sentimenti è stato sviluppato in modo semi-automatico da ItalWordNet v.2 partendo da una lista di 1.000 parole-chiave controllate manualmente. Contiene 24.293 entrate lessicali annotate con polarità positiva/negativa/neutra. E' distribuito in formato LMF.
The Italian Sentiment Lexicon was semi-automatically developed from ItalWordNet v.2 starting from a list of seed key-words classified manually. It contains 24.293 lexical entries annotated for positive/negative/neutral polarity. It is distributed in XML-LMF format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sentiment analysis of tech media articles using VADER package and co-occurrence analysis
Sources with weights:
Euractiv 5%
The Conversation 5%
Politico Europe 5 %
IEEE Spectrum 5 %
Techforge 5%
Fastcompany 5%
The Guardian (Tech) 12%
Arstechnica 5%
Reuters 5%
Gizmodo 9%
ZDNet 9%
The Register 12%
The Verge 9%
TechCrunch 9%
Methodology
The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.
As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.
The process included the following main steps:
The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)
The articles containing the given social issue and co-occurring term are identified
The identified articles are divided into paragraphs
Social issue and co-occurring words are removed from the paragraph
The VADER sentiment analysis is carried out for every identified and modified paragraph
The average for the given word pair is calculated for the final result
Therefore, the procedure has been repeated for 100 words for all identified social issues.
The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.
The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:
The articles containing the given social issue are identified
The paragraphs containing the social issue are selected for sentiment analysis
*Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a list in dictionary/json format for smart 'spinning' content.
Keys: "word" - The word to look up. "pos" - The Part of Speech for the word being looked up. Only for words that can have different Parts of Speech. "sentiment" - The sentiment of the lookup word. "neg": - This gives the negativity score "neu" - This gives the neutrality score "pos" - This gives the positivity score "syn list" - The list of synonyms of the word being looked up. "synonym" - A synonym in syn list. -- Sentiment keys same as for lookup word "compound" - This takes all sentiment values and shows score variance. "talk syn" - List of synonym usage. Words with '_' on the end, ex. '_cedes_' is the synonym in question. "ant list" - The list of antonyms of the word being looked up. "antonym" - A antonym in ant list. -- Sentiment keys same as for lookup word "prep list" - List of usages with prepositions.
This dictionary is made from: Project Gutenberg's English Synonyms and Antonyms, by James Champlin Fernald https://www.gutenberg.org/cache/epub/28900/pg28900.txt
For Sentiment nltk.sentiment.vader was used: https://www.nltk.org/_modules/nltk/sentiment/vader.html
For word similarity I used spaCy's large trained model for vector similarity: https://spacy.io/usage/spacy-101#vectors-similarity https://spacy.io/models/en#en_core_web_lg
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two computational studies provide different sentiment analyses for text segments (e.g., “fearful” passages) and figures (e.g., “Voldemort”) from the Harry Potter books (Rowling, 1997, 1998, 1999, 2000, 2003, 2005, 2007) based on a novel simple tool called SentiArt. The tool uses vector space models together with theory-guided, empirically validated label lists to compute the valence of each word in a text by locating its position in a 2d emotion potential space spanned by the words of the vector space model. After testing the tool's accuracy with empirical data from a neurocognitive poetics study, it was applied to compute emotional figure and personality profiles (inspired by the so-called “big five” personality theory) for main characters from the book series. The results of comparative analyses using different machine-learning classifiers (e.g., AdaBoost, Neural Net) show that SentiArt performs very well in predicting the emotion potential of text passages. It also produces plausible predictions regarding the emotional and personality profile of fiction characters which are correctly identified on the basis of eight character features, and it achieves a good cross-validation accuracy in classifying 100 figures into “good” vs. “bad” ones. The results are discussed with regard to potential applications of SentiArt in digital literary, applied reading and neurocognitive poetics studies such as the quantification of the hybrid hero potential of figures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An extended version of this dataset that also covers nominal and adjectival polarity shifters can be found at doi:10.5281/zenodo.3365601.
We provide a bootstrapped lexicon of English verbal polarity shifters. Our lexicon covers 3043 verbs of WordNet v3.1 (Miller et al., 1990) that are single word or particle verbs. Polarity shifter labels are given for each word lemma.
Data
The data consists of:
Two lists of WordNet verbs (Miller et al., 1990), annotated for whether they cause shifting.
The initial gold standard (§2) of 2000 randomly chosen verbs.
The bootstrapped 1043 verbs (§5.3) that were labelled as shifters by our best classifier and then manually annotated.
Data set of verb phrases from the Amazon Product Review Data corpus (Jindal & Liu, 2008), annotated for polarity of phrase and polar noun.
Files
The initial gold standard: verbal_shifters.gold_standard.txt
The bootstrapped verbs: verbal_shifters.bootstrapping.txt
Format
Each line contains a verb and its label, separate by a whitespace.
Multiword expressions are separated by an underscore (WORD_WORD).
All labels were assigned by an expert annotator.
Files
All annotated verb phrases: sentiment_phrases.txt
Content
The file starts with 400 phrases containing shifter verbs, followed by 2231 phrases containing non-shifter verbs.
Format
Every item consists of:
The sentence from which the VP and the polar noun were extracted.
The VP, polar noun and the verb heading the VP.
Constituency parse for the VP.
Gold labels for VP and polar noun by a human annotator.
Predicted labels for VP and polar noun by RNTN tagger (Socher et al., 2013) and LEX_gold approach.
Items are separated by a line of asterisks (*)
Related Resources
Paper: ACL Anthology or DOI: 10.5281/zenodo.3365609
Presentation: ACL Anthology
Word Embedding: DOI: 10.5281/zenodo.3370051
Attribution
This dataset was created as part of the following publication:
Marc Schulder, Michael Wiegand, Josef Ruppenhofer and Benjamin Roth (2017). "Towards Bootstrapping a Polarity Shifter Lexicon using Linguistic Features". Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP). Taipei, Taiwan, November 27 - December 3, 2017. DOI: 10.5281/zenodo.3365609.
If you use the data in your research or work, please cite the publication.
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 positive and 1,818 negative words, which sum up to 15,649 positive and 15,632 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.
See: R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC'10), 2010
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide a complete lexicon of English verbal polarity shifters and their shifting scope. Our lexicon covers all verbs of WordNet v3.1 that are single word or particle verbs. Polarity shifter and scope labels are given for each lemma-synset pair (i.e. each word sense of a lemma).
Data
The data is presented in the following forms:
A complete lexicon of all verbal shifters and their shifting scopes.
Two auxiliary lists:
A list of all lemmas with shifter labels
A list of all word senses with shifter labels
All files are in CSV (comma-separated value) format.
1. Main Lexicon
File name: shifter_lexicon.csv
The main lexicon lists all verbal shifters and their shifting scopes. Verbal shifters are modelled as lemma-sense pairs with one or more shifting scopes.
Each line of the lexicon file contains a single lemma-sense-scope triple, using the format:
LEMMA,SYNSET,SCOPE
The elements are defined as follows:
LEMMA: The lemma form of the verb.
SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).
SCOPE: The scope of the shifting:
subj: The verbal shifter affects its subject.
dobj: The verbal shifter affects its direct object.
pobj_*: The verbal shifter affects objects within a prepositional phrase. The preposition in question is included in the annotation. For example a from-preposition scope receives the label pobj_from and a a for-preposition receives pobj_for.
comp: The verbal shifter affects a clausal complement, such as infinitive clauses or gerunds.
The lexicon lists all lemma-sense pairs that are verbal shifters. Any lemma-sense pair not listed is not a verbal shifter. When a lemma-sense pair has more than one possible scope, a separate entry is made for each scope.
2. Auxiliary Lists
The auxiliary files represent the same shifter information as the main lexicon, but for lemmas and synsets, respectively, instead of for lemma-sense pairs. Due to their nature, these lists are more coarse-grained than the main lexicon and contain no information on shifter scope. They are provided as a convenience for fast experimentation.
2.1. List of Lemmas
File name: shifter_lemma_lexicon.csv
List of all verb lemmas and whether they are shifters in at least one of their word senses.
LEMMA,LABEL
LEMMA: The lemma form of the verb.
LABEL: shifter if the verb is a shifter in at least one of its word senses, otherwise nonshifter.
Many verbal shifter lemmas only cause shifting in some of their word senses. This list is therefore considerably more coarse-grained than the main lexicon.
2.2. List of Synsets
File name: shifter_synset_lexicon.csv
List of all synsets and whether their lemmas are shifters in this specific word sense.
SYNSET,LABEL
SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).
LABEL: shifter if the word sense causes shifting, otherwise nonshifter.
Shifting is shared among lemmas of the same word sense. This list, therefore, provides (almost) the same granularity for the shifter label as the main lexicon. However, in a few exceptions, synsets contained words with subtly different senses that did not all cause shifting. These senses are considered shifters in this list, analogous to the generalisation in the list of lemmas.
Attribution
This dataset was created as part of the following publication:
Schulder, Marc and Wiegand, Michael and Ruppenhofer, Josef and Köser, Stephanie (2018). "Introducing a Lexicon of Verbal Polarity Shifters for English". Proceedings of the 11th Conference on Language Resources and Evaluation (LREC). Miyazaki, Japan, May 7-12, 2018. DOI: 10.5281/zenodo.3365683.
If you use the data in your research or work, please cite the publication.
This work was partially supported by the German Research Foundation (DFG) under grants RU 1873/2-1 and WI4204/2-1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide a bootstrapped lexicon of German verbal polarity shifters. Our lexicon covers 2595 verbs of GermaNet. Polarity shifter labels are given for each word lemma. All labels were assigned by an expert annotator who is a native speaker of German.
Data
The data consists of two lists of GermaNet verbs annotated for whether they cause shifting:
verbal_shifters.gold_standard.txt
: The initial gold standard (§3) of 2000 randomly sampled verbs.verbal_shifters.bootstrapping.txt
: The bootstrapped 595 verbs (§5.3) that were labelled as shifters by our best classifier and then manually annotated.Format
Each line contains a verb and its label, separate by a whitespace.
Attribution
This dataset was created as part of the following publication:
Marc Schulder, Michael Wiegand, Josef Ruppenhofer (2018). "Automatically Creating a Lexicon of Verbal Polarity Shifters: Mono- and Cross-lingual Methods for German". Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018). Santa Fe, New Mexico, USA, August 20 - August 26, 2018. DOI: 10.5281/zenodo.3365694.
If you use the data in your research or work, please cite the publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created as part of Marc Schulder's doctoral thesis Sentiment Polarity Shifters: Creating Lexical Resources through Manual Annotation and Bootstrapped Machine Learning
The collection of polarity shifter resources presented herein is also connected to a number of publications:
Schulder et al. (IJCNLP 2017): Lexicon of English Verbal Shifters (bootstrapped, lemma-level) and sentiment verb phrase dataset. doi: 10.5281/zenodo.3364812
Schulder et al. (LREC 2018): Lexicon of English Verbal Shifters (manual, sense-level). doi: 10.5281/zenodo.3365288
Schulder et al. (COLING 2018): Lexicon of German Verbal Shifters (bootstrapped, lemma-level). doi: 10.5281/zenodo.3365370
Schulder et al. (LREC 2020): Lexicon of Polarity Shifting Directions (supervised classification, lemma-level). doi: 10.5281/zenodo.3545947
Schulder et al. (JNLE 2020): General Lexicon of English Shifters (bootstrapped, lemma-level). doi: 10.5281/zenodo.3365601
Data
The repository contains the following resources:
A general lexicon of English polarity shifters, covering verbs, adjectives and nouns. Provides lemma labels for shifters and for which polarities they can affect.
A lexicon of English verbal shifters. Provides word sense labels for shifters and their shifting scopes.
A lexicon of German verbal shifters. Provides lemma labels for shifters.
A set of verb phrases annotated for shifting polarities.
1. English Shifter Lexicon (Lemma)
A lexicon of 9145 English words, annotated for whether they are polarity shifters and which polarities they affect. The lexicon is based on the vocabulary of WordNet v3.1 (Miller et al., 1990). It contains 2631 shifters and 6514 non-shifters.
File: shifters.english.all.lemma.txt
The lexicon is a comma-separated value (CSV) table.
Each line follows the format POS,LEMMA,SHIFTER_LABEL,DIRECTION_LABEL,SOURCE.
POS: The part of speech of the word (verb, noun, adj)
LEMMA: The lemma representation of the word in question. Multiword expressions are separated by an underscore (WORD_WORD).
SHIFTER_LABEL: Whether the word is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).
DIRECTION_LABEL: Whether the shifter affects only positive polarities (AFFECTS_POSITIVE), only negative polarities. (AFFECTS_NEGATIVE) or can shift in both directions (AFFECTS_BOTH). Non-shifters are all labeled (NONE).
SOURCE: Whether the word was part of the gold standard. (GOLD_STANDARD) or was bootstrapped (BOOTSTRAPPED). Note that while bootstrapped shifter labels are verified by a human annotator, their direction label is automatically classified without verification.
2. English Verbal Shifter Lexicon (Word Sense)
A lexicon of word senses of English verbs, annotated for whether they are polarity shifters and their shifting scope. The lexicon covers all verbs of WordNet v3.1 (Miller et al., 1990) that are single word or particle verbs. Polarity shifter and scope labels are given for each lemma-synset pair (i.e. each word sense of a lemma).
The data is presented in the following forms:
A complete lexicon of all verbal shifters and their shifting scopes.
Two auxiliary lists containing simplified information:
A list of all lemmas with shifter labels
A list of all word senses with shifter labels
All files are in CSV (comma-separated value) format.
2.1. Complete Lexicon
The main lexicon lists all verbal shifters and their shifting scopes. Verbal shifters are modeled as lemma-sense pairs with one or more shifting scopes.
The lexicon lists all lemma-sense pairs that are verbal shifters. Any lemma-sense pair not listed is not a verbal shifter. When a lemma-sense pair has more than one possible scope, a separate entry is made for each scope.
File name: shifters.english.verb.sense.csv
Each line contains a single lemma-sense-scope triple, using the format LEMMA,SYNSET,SCOPE.
LEMMA: The lemma representation of the verb in question. Multiword expressions are separated by an underscore (WORD_WORD).
SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).
SCOPE: The scope of the shifting:
subj: The verbal shifter affects its subject.
dobj: The verbal shifter affects its direct object.
pobj_*: The verbal shifter affects objects within a prepositional phrase. The preposition in question is included in the annotation. For example a from-preposition scope receives the label pobj_from and a a for-preposition receives pobj_for.
comp: The verbal shifter affects a clausal complement, such as infinitive clauses or gerunds.
2.2. List of Lemmas
List of all verb lemmas and whether they are shifters in at least one of their word senses. Does not provide shifter scope information.
Many verbal shifter lemmas only cause shifting in some of their word senses. This list is therefore considerably more coarse-grained than the main lexicon. It is intended as a convenience measure for quick experimentation.
File name: shifters.english.verb.sense.lemmas_only.csv
Each line follows the format LEMMA,SHIFTER_LABEL.
LEMMA: The lemma representation of the verb in question. Multiword expressions are separated by an underscore (WORD_WORD).
SHIFTER_LABEL: Whether the verb is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).
2.3. List of Synsets
List of all synsets and whether their lemmas are shifters in this specific word sense. Does not provide shifter scope information.
Shifting is shared among lemmas of the same word sense. This list, therefore, provides (almost) the same granularity for the shifter label as the main lexicon. However, in a few exceptions, synsets contained words with subtly different senses that did not all cause shifting. These senses are considered shifters in this list, analogous to the generalization in the list of lemmas.
File name: shifters.english.verb.sense.synsets_only.csv
Each line follows the format SYNSET,SHIFTER_LABEL.
SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).
SHIFTER_LABEL: Whether the verb is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).
3. German Verbal Shifter Lexicon (Lemma)
A lexicon of 2595 German verbs, annotated for whether they are polarity shifters and which polarities they affect. The lexicon is based on the vocabulary of GermaNet (Hamp and Feldweg, 1997). It contains 677 shifters and 1918 non-shifters.
File: shifters.german.verb.lemma.txt
The lexicon is a comma-separated value (CSV) table.
Each line follows the format LEMMA,SHIFTER_LABEL,SOURCE.
LEMMA: The lemma representation of the verb in question. Multiword expressions are separated by an underscore (WORD_WORD).
SHIFTER_LABEL: Whether the verb is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).
SOURCE: Whether the word was part of the gold standard. (GOLD_STANDARD) or was bootstrapped (BOOTSTRAPPED). In either case the verbs were verified by a human annotator.
4. Sentiment Verb Phrases
A set of verb phrases, annotated for the polarity of the verb phrase and the polarity of a polar noun that it contains. Can be used to evaluate whether a polarity classifier correctly recognizes polarity shifting. The file starts with 400 phrases containing shifter verbs, followed by 2231 phrases containing non-shifter verbs.
File: sentiment_phrases.txt
Every item consists of:
The sentence from which the VP and the polar noun were extracted.
The VP, polar noun and the verb heading the VP.
Constituency parse for the VP.
Gold labels for VP and polar noun by a human annotator.
Predicted labels for VP and polar noun by RNTN tagger (Socher et al., 2013) and LEX_gold approach.
Items are separated by a line of asterisks (*)
This work was partially supported by the German Research Foundation (DFG) under grants RU 1873/2-1 and WI4204/2-1.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A Kannada (kn) lexicon dataset of more than 8k words consisting of positive, neutral and negative lexicons with polarity of +1, 0 and -1 built for the purpose of Sentiment analysis verified by kannada annotators.
According to Liu (2012), Sentiment Analysis which is a sub-domain of Natural Language processing is one of the most active research areas in natural language processing and is also widely studied in data mining, web mining, and text mining .
This dataset was created for the purpose of Kannada Lexicon Sentiment Analysis. This dataset is highly useful in classifying data into positive, neutral or negative for example : kannada movie reviews, Text documents etc can be classified into positive or negative texts and the opinion of the author/person on a perticular subject can be found whether it is positive or negative.
For the creation of Kannada lexicons, English Subjectivity lexicons, Opinion lexicons and AFINN-111 lexicons were the three most reliable resources. On reviewing English AFINN-111 and comparing it with other lexicon sources, Huge duplications and word ambiguities were found and removed. Additionally, New words words were added by annotators to the final set of lexicons which were not previously present in the dataset.The resources used are described below:
|List Name| Word Count(No of Tokens) | |--|--| | AFINN-111 | 2477 | | Subjectivity lexicon | 5569 | | Opinion lexicon | 6789 | | VADER lexicons | 7517 | | Mannualy added lexicons | 30 |
Next, Google Translate was used to translate the English AIFNNN-111 lexicons, Subjectivity lexicons and Opinion lexicons to kannada. Google Translate translated most of the english words to kannada and translitrated the words which had no realtive native meaning in kannada and the untranslated english words were removed.
Google Translate is a free multilingual statistical and neural machine translation service developed by Google, to translate text and websites from one language into another.
The link to Google Translate can be found here : https://translate.google.co.in/
The flow design of building the dataset is given below.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3299806%2Fd8a87bf3f4a9048bb42d907de4b89d45%2Fflow-design.png?generation=1592740676619195&alt=media" alt="Flow Design">
I would also like to thank our mentor Prof. Hemanth Kumar A for mentoring in creating this dataset and guiding us throughout this project.
Liu, B., 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), pp.1-167.
Ding, X., Liu, B. and Yu, P.S., 2008, February. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining (pp. 231-240).
Conrad, C., 1974. Context effects in sentence comprehension: A study of the subjective lexicon. Memory & Cognition, 2(1), pp.130-138.
Hutto, C.J. and Gilbert, E., 2014, May. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth international AAAI conference on weblogs and social media.
Kannan, A., Mohanty, G. and Mamidi, R., 2016, December. Towards building a sentiwordnet for tamil. In Proceedings of the 13th International Conference on Natural Language Processing (pp. 30-35).
This Dataset is licenced under CC BY-SA 4.0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With an increasingly large number of Chinese tourists in Japan, the hotel industry is in need of an affordable market research tool that does not rely on expensive and time consuming surveys or inter- views. Because this problem is real and relevant to the hotel industry in Japan, and otherwise completely unexplored in other studies, we have extracted a list of potential keywords from Chinese reviews of Japanese hotels in the hotel portal site Ctrip using a mathematical model to then use them in a sentiment analysis with a machine learning classifier. While most studies that use information collected from the internet use pre-existing data analysis tools, in our study we designed the mathematical model to have the highest possible performing results in classification, while also exploring on the potential business implications these may have.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Supply chains play a pivotal role in driving economic growth and societal well-being, facilitating the efficient movement of goods from producers to consumers. However, the increasing frequency of disruptions caused by geopolitical events, pandemics, natural disasters, and shifts in commerce poses significant challenges to supply chain resilience. This draft update report discusses the development of a dynamic spatio-temporal monitoring and analysis tool to assess supply chain vulnerability, resilience, and sustainability. Leveraging news data, macroeconomic metrics, inbound cargo data (for sectors in California), and operational conditions of California’s highways, the tool employs Natural Language Processing (NLP) and empirical regression analyses to identify emerging trends and extract valuable information about disruptions to inform decision-making. Key features of the tool include sentiment analysis of news articles, topic classification, visualization of geographic locations, and tracking of macroeconomic indicators. By integrating diverse and dynamic data sources (e.g., news articles) and using empirical and analytical techniques, the tool offers a comprehensive framework to enhance our understanding of supply chain vulnerabilities and resilience, ultimately contributing to more effective strategies for decision-making in supply chain management. The dynamic nature of this tool enables continuous monitoring and adaptation to evolving conditions, thereby enhancing the analysis of resilience and sustainability in global supply chains. Methods The research team implemented a two-stage procedure to streamline the collection, processing, and analysis of news data. The stages are as follows:
Lexicon Setup: This stage establishes the lexicons required for sentiment and topic analysis. Topics are categorized into eight groups relevant to supply chain risks: political, environmental, financial, supply and demand, logistics, system, infrastructure, and sector. Sentiments are evaluated using a dictionary-based approach with the AFINN lexicon.Three comprehensive lists of countries, states, and cities are used to classify geographical locations at three hierarchical levels: countries, states within the U.S., and cities/municipalities within California.
News collection and processing: Automated algorithms collect the most recent news daily, with a 24-hour lag, based on the predefined query: (USA or United States) and (supply chain or supply-chain) and (disruption or resilience) and (retailer or warehouse or transportation or factory). Text mining tasks are performed to extract key performance metrics, including n-grams, topics, sentiments, and geographical locations. The process involves several steps:
Corpus setup, Term Frequency-Inverse Document Frequency (TF-IDF) for measuring word relevance in documents, Entity recognition and consolidation, Conversion of the corpus into a Document-Feature Matrix (DFM), Dictionary-based extraction of sentiments, topics, and geographical locations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data produced by the running of the Naive Bayes classifier algorithm. It is a list of every word in the vocabulary of the classifier, as well as the number of occurrences of each word, as well as the likelihood ratio of this word. Please note the likelihood ratio is calculated by taking the likelihood of word given a positive label divided by the likelihood of a word given a negative label. This data is licensed under the CC BY 4.0 international license, and may be taken and used freely with credit given. This data was produced by two different datasets, using a Naive Bayes classifier. These datasets were the Polarity Review v2.0 dataset from Cornell, and the Large Movie Review Dataset from Stanford.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘SemEval 2014 Task 4: AspectBasedSentimentAnalysis’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/charitarth/semeval-2014-task-4-aspectbasedsentimentanalysis on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Copied from https://alt.qcri.org/semeval2014/task4/#, all credits to respective authors.
Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of current approaches, however, attempt to detect the overall polarity of a sentence, paragraph, or text span, regardless of the entities mentioned (e.g., laptops, restaurants) and their aspects (e.g., battery, screen; food, service). By contrast, this task is concerned with aspect based sentiment analysis (ABSA), where the goal is to identify the aspects of given target entities and the sentiment expressed towards each aspect. Datasets consisting of customer reviews with human-authored annotations identifying the mentioned aspects of the target entities and the sentiment polarity of each aspect will be provided.
In particular, the task consists of the following subtasks:
Given a set of sentences with pre-identified entities (e.g., restaurants), identify the aspect terms present in the sentence and return a list containing all the distinct aspect terms. An aspect term names a particular aspect of the target entity.
For example, "I liked the service and the staff, but not the food”, “The food was nothing much, but I loved the staff”. Multi-word aspect terms (e.g., “hard disk”) should be treated as single terms (e.g., in “The hard disk is very noisy” the only aspect term is “hard disk”).
For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative, neutral or conflict (i.e., both positive and negative).
For example:
“I loved their fajitas” → {fajitas: positive} “I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive} “The fajitas are their first plate” → {fajitas: neutral} “The fajitas were great to taste, but not to see” → {fajitas: conflict}
Given a predefined set of aspect categories (e.g., price, food), identify the aspect categories discussed in a given sentence. Aspect categories are typically coarser than the aspect terms of Subtask 1, and they do not necessarily occur as terms in the given sentence.
For example, given the set of aspect categories {food, service, price, ambience, anecdotes/miscellaneous}:
“The restaurant was too expensive” → {price} “The restaurant was expensive, but the menu was great” → {price, food}
Given a set of pre-identified aspect categories (e.g., {food, price}), determine the polarity (positive, negative, neutral or conflict) of each aspect category.
For example:
“The restaurant was too expensive” → {price: negative} “The restaurant was expensive, but the menu was great” → {price: negative, food: positive}
Two domain-specific datasets for laptops and restaurants, consisting of over 6K sentences with fine-grained aspect-level human annotations have been provided for training.
Restaurant reviews:
This dataset consists of over 3K English sentences from the restaurant reviews of Ganu et al. (2009). The original dataset of Ganu et al. included annotations for coarse aspect categories (Subtask 3) and overall sentence polarities; we modified the dataset to include annotations for aspect terms occurring in the sentences (Subtask 1), aspect term polarities (Subtask 2), and aspect category-specific polarities (Subtask 4). We also corrected some errors (e.g., sentence splitting errors) of the original dataset. Experienced human annotators identified the aspect terms of the sentences and their polarities (Subtasks 1 and 2). Additional restaurant reviews, not in the original dataset of Ganu et al. (2009), are being annotated in the same manner, and they will be used as test data.
Laptop reviews:
This dataset consists of over 3K English sentences extracted from customer reviews of laptops. Experienced human annotators tagged the aspect terms of the sentences (Subtask 1) and their polarities (Subtask 2). This dataset will be used only for Subtasks 1 and 2. Part of this dataset will be reserved as test data.
The sentences in the datasets are annotated using XML tags.
The following example illustrates the format of the annotated sentences of the restaurants dataset. ```xml
The possible values of the polarity field are: “positive”, “negative”, “conflict”, “neutral”. The possible values of the category field are: “food”, “service”, “price”, “ambience”, “anecdotes/miscellaneous”.
The following example illustrates the format of the annotated sentences of the laptops dataset. The format is the same as in the restaurant datasets, with the only exception that there are no annotations for aspect categories. Notice that we annotate only aspect terms naming particular aspects (e.g., “everything about it” does not name a particular aspect).
```xml
In the sentences of both datasets, there is an
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Lä wordlist elicitation continues, starting with word for 'mouth'. At aprox. 40 minutes into the recording, a woman shares a story. Recorded by Nick Evans in the Community Centre at Tais, Morehead District. Language as given: Lä
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Key word list for trees and forest that was used to assign sentiment scores to workshop data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Argumentative skills are personally and professionally essential to digest complicated information (CoI) associated with the critical reconstruction of meaning (critical thinking - CT). This is a vital goal, especially in the age of social media and artificial intelligence-mediated information. Recently, the introduction of generative artificial intelligence (GenAI), particularly ChatGPT (OpenAI, 2022), has made it much simpler to collect and exchange knowledge. New tools are desperately needed to deal with the glut of post-digital information without becoming lost.
After exploring the landscape of argumentative skills and techniques for their development, an investigation of practices in use in Italian universities was undertaken. The analysis used university syllabi, which are considered key educational tools, to provide a comprehensive overview of the course to be undertaken. The syllabi contain key information such as objectives, competencies, assignments and assessment strategies.
The research examined education science courses to understand the importance of argumentative skills, using stratified random sampling to proportionally represent all public universities with education science departments. 133 syllabi were selected through web scraping and web crawling techniques using R software (https://cran.r-project.org/bin/windows/base/), with manual addition to overcome technical limitations.
The analysis included text mining techniques to identify documents containing keywords related to argumentative skills. These documents were then subjected to quantitative and qualitative content analysis. Biggs' (2003) "Constructive Alignment" principles were used to assess the alignment of goals, activities, and assessments in the syllabi. Categories of analysis included the detection of argumentative skills, their alignment, and connection to the course, with a focus on presence, level of treatment, and consistency of alignment.
This Zenodo record follows the full analysis process with R and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:
1. List of Universities with URLs - Elenco Università.xlsx
2. Web Scraping Script - WebScarping.R
3. Text Mining Script - TextMining.R
4. List of the most frequent words - Vocabulary.csv
5. Sentiment Analysis of the corpus - Sentiment Analysis.R
6. List of documents from sorting by Keywords - frasi_chiave.docx
7. Codebook qualitative Analysis with Nvivo - Codebook.xlsx
8. Results Nvivo Analysis Syllabi - Codebook-Syllabi.docx
Any comments or improvements are welcome!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sentiment analysis of tech media articles using VADER package and co-occurrence analysis
Sources: Above 140k articles (01.2016-03.2019):
Gigaom 0.5%
Euractiv 0.9%
The Conversation 1.3%
Politico Europe 1.3%
IEEE Spectrum 1.8%
Techforge 4.3%
Fastcompany 4.5%
The Guardian (Tech) 9.2%
Arstechnica 10.0%
Reuters 11%
Gizmodo 17.5%
ZDNet 18.3%
The Register 19.5%
Methodology
The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.
As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.
The process included the following main steps:
The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)
The articles containing the given social issue and co-occurring term are identified
The identified articles are divided into paragraphs
Social issue and co-occurring words are removed from the paragraph
The VADER sentiment analysis is carried out for every identified and modified paragraph
The average for the given word pair is calculated for the final result
Therefore, the procedure has been repeated for 100 words for all identified social issues.
The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.
The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:
The articles containing the given social issue are identified
The paragraphs containing the social issue are selected for sentiment analysis
*Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Files
sentiments_mod11.csv sentiment score based on chosen unigrams
sentiments_mod22.csv sentiment score based on chosen bigrams
sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams