38 datasets found
  1. E

    Sentiment analysis of tech media articles using VADER package and...

    • live.european-language-grid.eu
    • data.europa.eu
    csv
    Updated Aug 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Sentiment analysis of tech media articles using VADER package and co-occurrence analysis [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1351
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 16, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sentiment analysis of tech media articles using VADER package and co-occurrence analysis

    Sources: Above 140k articles (01.2016-03.2019):

    Gigaom 0.5%

    Euractiv 0.9%

    The Conversation 1.3%

    Politico Europe 1.3%

    IEEE Spectrum 1.8%

    Techforge 4.3%

    Fastcompany 4.5%

    The Guardian (Tech) 9.2%

    Arstechnica 10.0%

    Reuters 11%

    Gizmodo 17.5%

    ZDNet 18.3%

    The Register 19.5%

    Methodology

    The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.

    As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.

    The process included the following main steps:

    The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)

    The articles containing the given social issue and co-occurring term are identified

    The identified articles are divided into paragraphs

    Social issue and co-occurring words are removed from the paragraph

    The VADER sentiment analysis is carried out for every identified and modified paragraph

    The average for the given word pair is calculated for the final result

    Therefore, the procedure has been repeated for 100 words for all identified social issues.

    The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.

    The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:

    The articles containing the given social issue are identified

    The paragraphs containing the social issue are selected for sentiment analysis

    *Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

    Files

    sentiments_mod11.csv sentiment score based on chosen unigrams

    sentiments_mod22.csv sentiment score based on chosen bigrams

    sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams

  2. t

    Polarity word lists for financial analyst communication - Vdataset - LDM

    • service.tib.eu
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Polarity word lists for financial analyst communication - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/goe-doi-10-25625-tyuglf
    Explore at:
    Dataset updated
    May 16, 2025
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains polarity word lists that have been specifically induced for dictionary-based sentiment analysis of the communication of financial analysts. For details on the preparation of the polarity word lists, please refer to the publication listed below.

  3. c

    OpeNER Sentiment Lexicon Italian - LMF

    • dspace-clarin-it.ilc.cnr.it
    Updated Oct 18, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Russo; Francesca Frontini; Valeria Quochi (2016). OpeNER Sentiment Lexicon Italian - LMF [Dataset]. https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-73
    Explore at:
    Dataset updated
    Oct 18, 2016
    Authors
    Irene Russo; Francesca Frontini; Valeria Quochi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Il Lessico Italiano dei Sentimenti è stato sviluppato in modo semi-automatico da ItalWordNet v.2 partendo da una lista di 1.000 parole-chiave controllate manualmente. Contiene 24.293 entrate lessicali annotate con polarità positiva/negativa/neutra. E' distribuito in formato LMF.

    The Italian Sentiment Lexicon was semi-automatically developed from ItalWordNet v.2 starting from a list of seed key-words classified manually. It contains 24.293 lexical entries annotated for positive/negative/neutral polarity. It is distributed in XML-LMF format.

  4. E

    Sentiment analysis of tech media articles using VADER package and...

    • live.european-language-grid.eu
    • zenodo.org
    csv
    Updated Aug 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Sentiment analysis of tech media articles using VADER package and co-occurrence analysis (01.2016-04.2019) [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1352
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 16, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sentiment analysis of tech media articles using VADER package and co-occurrence analysis

    Sources with weights:

    Euractiv 5%

    The Conversation 5%

    Politico Europe 5 %

    IEEE Spectrum 5 %

    Techforge 5%

    Fastcompany 5%

    The Guardian (Tech) 12%

    Arstechnica 5%

    Reuters 5%

    Gizmodo 9%

    ZDNet 9%

    The Register 12%

    The Verge 9%

    TechCrunch 9%

    Methodology

    The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.

    As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.

    The process included the following main steps:

    The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)

    The articles containing the given social issue and co-occurring term are identified

    The identified articles are divided into paragraphs

    Social issue and co-occurring words are removed from the paragraph

    The VADER sentiment analysis is carried out for every identified and modified paragraph

    The average for the given word pair is calculated for the final result

    Therefore, the procedure has been repeated for 100 words for all identified social issues.

    The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.

    The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:

    The articles containing the given social issue are identified

    The paragraphs containing the social issue are selected for sentiment analysis

    *Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

  5. Comprehensive Synonyms & Antonyms Dictionary List

    • kaggle.com
    Updated May 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christernyc (2021). Comprehensive Synonyms & Antonyms Dictionary List [Dataset]. http://doi.org/10.34740/kaggle/dsv/2250932
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    christernyc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    For content spinning and sentiment changing in mind

    This is a list in dictionary/json format for smart 'spinning' content.

    Keys: "word" - The word to look up. "pos" - The Part of Speech for the word being looked up. Only for words that can have different Parts of Speech. "sentiment" - The sentiment of the lookup word. "neg": - This gives the negativity score "neu" - This gives the neutrality score "pos" - This gives the positivity score "syn list" - The list of synonyms of the word being looked up. "synonym" - A synonym in syn list. -- Sentiment keys same as for lookup word "compound" - This takes all sentiment values and shows score variance. "talk syn" - List of synonym usage. Words with '_' on the end, ex. '_cedes_' is the synonym in question. "ant list" - The list of antonyms of the word being looked up. "antonym" - A antonym in ant list. -- Sentiment keys same as for lookup word "prep list" - List of usages with prepositions.

    Credits:

    This dictionary is made from: Project Gutenberg's English Synonyms and Antonyms, by James Champlin Fernald https://www.gutenberg.org/cache/epub/28900/pg28900.txt

    For Sentiment nltk.sentiment.vader was used: https://www.nltk.org/_modules/nltk/sentiment/vader.html

    For word similarity I used spaCy's large trained model for vector similarity: https://spacy.io/usage/spacy-101#vectors-similarity https://spacy.io/models/en#en_core_web_lg

  6. f

    Data_Sheet_1_Sentiment Analysis for Words and Fiction Characters From the...

    • frontiersin.figshare.com
    docx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur M. Jacobs (2023). Data_Sheet_1_Sentiment Analysis for Words and Fiction Characters From the Perspective of Computational (Neuro-)Poetics.docx [Dataset]. http://doi.org/10.3389/frobt.2019.00053.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Arthur M. Jacobs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two computational studies provide different sentiment analyses for text segments (e.g., “fearful” passages) and figures (e.g., “Voldemort”) from the Harry Potter books (Rowling, 1997, 1998, 1999, 2000, 2003, 2005, 2007) based on a novel simple tool called SentiArt. The tool uses vector space models together with theory-guided, empirically validated label lists to compute the valence of each word in a text by locating its position in a 2d emotion potential space spanned by the words of the vector space model. After testing the tool's accuracy with empirical data from a neurocognitive poetics study, it was applied to compute emotional figure and personality profiles (inspired by the so-called “big five” personality theory) for main characters from the book series. The results of comparative analyses using different machine-learning classifiers (e.g., AdaBoost, Neural Net) show that SentiArt performs very well in predicting the emotion potential of text passages. It also produces plausible predictions regarding the emotional and personality profile of fiction characters which are correctly identified on the basis of eight character features, and it achieves a good cross-validation accuracy in classifying 100 figures into “good” vs. “bad” ones. The results are discussed with regard to potential applications of SentiArt in digital literary, applied reading and neurocognitive poetics studies such as the quantification of the hybrid hero potential of figures.

  7. Z

    Bootstrapped Lexicon of English Verbal Polarity Shifters

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 9, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiegand, Michael (2020). Bootstrapped Lexicon of English Verbal Polarity Shifters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3364811
    Explore at:
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Roth, Benjamin
    Schulder, Marc
    Wiegand, Michael
    Ruppenhofer, Josef
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An extended version of this dataset that also covers nominal and adjectival polarity shifters can be found at doi:10.5281/zenodo.3365601.

    We provide a bootstrapped lexicon of English verbal polarity shifters. Our lexicon covers 3043 verbs of WordNet v3.1 (Miller et al., 1990) that are single word or particle verbs. Polarity shifter labels are given for each word lemma.

    Data

    The data consists of:

    Two lists of WordNet verbs (Miller et al., 1990), annotated for whether they cause shifting.

    The initial gold standard (§2) of 2000 randomly chosen verbs.

    The bootstrapped 1043 verbs (§5.3) that were labelled as shifters by our best classifier and then manually annotated.

    Data set of verb phrases from the Amazon Product Review Data corpus (Jindal & Liu, 2008), annotated for polarity of phrase and polar noun.

    1. Verbal Shifters

    Files

    The initial gold standard: verbal_shifters.gold_standard.txt

    The bootstrapped verbs: verbal_shifters.bootstrapping.txt

    Format

    Each line contains a verb and its label, separate by a whitespace.

    Multiword expressions are separated by an underscore (WORD_WORD).

    All labels were assigned by an expert annotator.

    1. Sentiment Verb Phrases

    Files

    All annotated verb phrases: sentiment_phrases.txt

    Content

    The file starts with 400 phrases containing shifter verbs, followed by 2231 phrases containing non-shifter verbs.

    Format

    Every item consists of:

    The sentence from which the VP and the polar noun were extracted.

    The VP, polar noun and the verb heading the VP.

    Constituency parse for the VP.

    Gold labels for VP and polar noun by a human annotator.

    Predicted labels for VP and polar noun by RNTN tagger (Socher et al., 2013) and LEX_gold approach.

    Items are separated by a line of asterisks (*)

    Related Resources

    Paper: ACL Anthology or DOI: 10.5281/zenodo.3365609

    Presentation: ACL Anthology

    Word Embedding: DOI: 10.5281/zenodo.3370051

    Attribution

    This dataset was created as part of the following publication:

    Marc Schulder, Michael Wiegand, Josef Ruppenhofer and Benjamin Roth (2017). "Towards Bootstrapping a Polarity Shifter Lexicon using Linguistic Features". Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP). Taipei, Taiwan, November 27 - December 3, 2017. DOI: 10.5281/zenodo.3365609.

    If you use the data in your research or work, please cite the publication.

  8. w

    SentimentWortschatz

    • data.wu.ac.at
    api/sparql +3
    Updated Mar 18, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AKSW (2015). SentimentWortschatz [Dataset]. https://data.wu.ac.at/schema/datahub_io/ODQyZDExYzAtMTVjZC00ZjM3LThiMTUtMmY0ZTc5NGQ5M2Nk
    Explore at:
    example/turtle(991.0), zip(89059.0), ttl(162746.0), api/sparql(9514.0)Available download formats
    Dataset updated
    Mar 18, 2015
    Dataset provided by
    AKSW
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 positive and 1,818 negative words, which sum up to 15,649 positive and 15,632 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.

    See: R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC'10), 2010

  9. E

    Lexicon of English Verbal Polarity Shifters

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    csv
    Updated Aug 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Lexicon of English Verbal Polarity Shifters [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1342
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 25, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We provide a complete lexicon of English verbal polarity shifters and their shifting scope. Our lexicon covers all verbs of WordNet v3.1 that are single word or particle verbs. Polarity shifter and scope labels are given for each lemma-synset pair (i.e. each word sense of a lemma).

    Data

    The data is presented in the following forms:

    A complete lexicon of all verbal shifters and their shifting scopes.

    Two auxiliary lists:

    A list of all lemmas with shifter labels

    A list of all word senses with shifter labels

    All files are in CSV (comma-separated value) format.

    1. Main Lexicon

    File name: shifter_lexicon.csv

    The main lexicon lists all verbal shifters and their shifting scopes. Verbal shifters are modelled as lemma-sense pairs with one or more shifting scopes.

    Each line of the lexicon file contains a single lemma-sense-scope triple, using the format:

    LEMMA,SYNSET,SCOPE

    The elements are defined as follows:

    LEMMA: The lemma form of the verb.

    SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).

    SCOPE: The scope of the shifting:

    subj: The verbal shifter affects its subject.

    dobj: The verbal shifter affects its direct object.

    pobj_*: The verbal shifter affects objects within a prepositional phrase. The preposition in question is included in the annotation. For example a from-preposition scope receives the label pobj_from and a a for-preposition receives pobj_for.

    comp: The verbal shifter affects a clausal complement, such as infinitive clauses or gerunds.

    The lexicon lists all lemma-sense pairs that are verbal shifters. Any lemma-sense pair not listed is not a verbal shifter. When a lemma-sense pair has more than one possible scope, a separate entry is made for each scope.

    2. Auxiliary Lists

    The auxiliary files represent the same shifter information as the main lexicon, but for lemmas and synsets, respectively, instead of for lemma-sense pairs. Due to their nature, these lists are more coarse-grained than the main lexicon and contain no information on shifter scope. They are provided as a convenience for fast experimentation.

    2.1. List of Lemmas

    File name: shifter_lemma_lexicon.csv

    List of all verb lemmas and whether they are shifters in at least one of their word senses.

    LEMMA,LABEL

    LEMMA: The lemma form of the verb.

    LABEL: shifter if the verb is a shifter in at least one of its word senses, otherwise nonshifter.

    Many verbal shifter lemmas only cause shifting in some of their word senses. This list is therefore considerably more coarse-grained than the main lexicon.

    2.2. List of Synsets

    File name: shifter_synset_lexicon.csv

    List of all synsets and whether their lemmas are shifters in this specific word sense.

    SYNSET,LABEL

    SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).

    LABEL: shifter if the word sense causes shifting, otherwise nonshifter.

    Shifting is shared among lemmas of the same word sense. This list, therefore, provides (almost) the same granularity for the shifter label as the main lexicon. However, in a few exceptions, synsets contained words with subtly different senses that did not all cause shifting. These senses are considered shifters in this list, analogous to the generalisation in the list of lemmas.

    Attribution

    This dataset was created as part of the following publication:

    Schulder, Marc and Wiegand, Michael and Ruppenhofer, Josef and Köser, Stephanie (2018). "Introducing a Lexicon of Verbal Polarity Shifters for English". Proceedings of the 11th Conference on Language Resources and Evaluation (LREC). Miyazaki, Japan, May 7-12, 2018. DOI: 10.5281/zenodo.3365683.

    If you use the data in your research or work, please cite the publication.

    This work was partially supported by the German Research Foundation (DFG) under grants RU 1873/2-1 and WI4204/2-1.

  10. Bootstrapped Lexicon of German Verbal Polarity Shifters

    • zenodo.org
    • live.european-language-grid.eu
    • +1more
    bin, txt
    Updated Jul 9, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Schulder; Marc Schulder; Michael Wiegand; Josef Ruppenhofer; Michael Wiegand; Josef Ruppenhofer (2020). Bootstrapped Lexicon of German Verbal Polarity Shifters [Dataset]. http://doi.org/10.5281/zenodo.3365370
    Explore at:
    bin, txtAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Schulder; Marc Schulder; Michael Wiegand; Josef Ruppenhofer; Michael Wiegand; Josef Ruppenhofer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    We provide a bootstrapped lexicon of German verbal polarity shifters. Our lexicon covers 2595 verbs of GermaNet. Polarity shifter labels are given for each word lemma. All labels were assigned by an expert annotator who is a native speaker of German.

    Data

    The data consists of two lists of GermaNet verbs annotated for whether they cause shifting:

    1. verbal_shifters.gold_standard.txt: The initial gold standard (§3) of 2000 randomly sampled verbs.
    2. verbal_shifters.bootstrapping.txt: The bootstrapped 595 verbs (§5.3) that were labelled as shifters by our best classifier and then manually annotated.

    Format

    Each line contains a verb and its label, separate by a whitespace.

    Attribution

    This dataset was created as part of the following publication:

    Marc Schulder, Michael Wiegand, Josef Ruppenhofer (2018). "Automatically Creating a Lexicon of Verbal Polarity Shifters: Mono- and Cross-lingual Methods for German". Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018). Santa Fe, New Mexico, USA, August 20 - August 26, 2018. DOI: 10.5281/zenodo.3365694.

    If you use the data in your research or work, please cite the publication.

  11. E

    Polarity Shifter Resources

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    csv
    Updated Aug 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Polarity Shifter Resources [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1350
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 25, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created as part of Marc Schulder's doctoral thesis Sentiment Polarity Shifters: Creating Lexical Resources through Manual Annotation and Bootstrapped Machine Learning

    The collection of polarity shifter resources presented herein is also connected to a number of publications:

    Schulder et al. (IJCNLP 2017): Lexicon of English Verbal Shifters (bootstrapped, lemma-level) and sentiment verb phrase dataset. doi: 10.5281/zenodo.3364812

    Schulder et al. (LREC 2018): Lexicon of English Verbal Shifters (manual, sense-level). doi: 10.5281/zenodo.3365288

    Schulder et al. (COLING 2018): Lexicon of German Verbal Shifters (bootstrapped, lemma-level). doi: 10.5281/zenodo.3365370

    Schulder et al. (LREC 2020): Lexicon of Polarity Shifting Directions (supervised classification, lemma-level). doi: 10.5281/zenodo.3545947

    Schulder et al. (JNLE 2020): General Lexicon of English Shifters (bootstrapped, lemma-level). doi: 10.5281/zenodo.3365601

    Data

    The repository contains the following resources:

    A general lexicon of English polarity shifters, covering verbs, adjectives and nouns. Provides lemma labels for shifters and for which polarities they can affect.

    A lexicon of English verbal shifters. Provides word sense labels for shifters and their shifting scopes.

    A lexicon of German verbal shifters. Provides lemma labels for shifters.

    A set of verb phrases annotated for shifting polarities.

    1. English Shifter Lexicon (Lemma)

    A lexicon of 9145 English words, annotated for whether they are polarity shifters and which polarities they affect. The lexicon is based on the vocabulary of WordNet v3.1 (Miller et al., 1990). It contains 2631 shifters and 6514 non-shifters.

    File: shifters.english.all.lemma.txt

    The lexicon is a comma-separated value (CSV) table.

    Each line follows the format POS,LEMMA,SHIFTER_LABEL,DIRECTION_LABEL,SOURCE.

    POS: The part of speech of the word (verb, noun, adj)

    LEMMA: The lemma representation of the word in question. Multiword expressions are separated by an underscore (WORD_WORD).

    SHIFTER_LABEL: Whether the word is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).

    DIRECTION_LABEL: Whether the shifter affects only positive polarities (AFFECTS_POSITIVE), only negative polarities. (AFFECTS_NEGATIVE) or can shift in both directions (AFFECTS_BOTH). Non-shifters are all labeled (NONE).

    SOURCE: Whether the word was part of the gold standard. (GOLD_STANDARD) or was bootstrapped (BOOTSTRAPPED). Note that while bootstrapped shifter labels are verified by a human annotator, their direction label is automatically classified without verification.

    2. English Verbal Shifter Lexicon (Word Sense)

    A lexicon of word senses of English verbs, annotated for whether they are polarity shifters and their shifting scope. The lexicon covers all verbs of WordNet v3.1 (Miller et al., 1990) that are single word or particle verbs. Polarity shifter and scope labels are given for each lemma-synset pair (i.e. each word sense of a lemma).

    The data is presented in the following forms:

    A complete lexicon of all verbal shifters and their shifting scopes.

    Two auxiliary lists containing simplified information:

    A list of all lemmas with shifter labels

    A list of all word senses with shifter labels

    All files are in CSV (comma-separated value) format.

    2.1. Complete Lexicon

    The main lexicon lists all verbal shifters and their shifting scopes. Verbal shifters are modeled as lemma-sense pairs with one or more shifting scopes.

    The lexicon lists all lemma-sense pairs that are verbal shifters. Any lemma-sense pair not listed is not a verbal shifter. When a lemma-sense pair has more than one possible scope, a separate entry is made for each scope.

    File name: shifters.english.verb.sense.csv

    Each line contains a single lemma-sense-scope triple, using the format LEMMA,SYNSET,SCOPE.

    LEMMA: The lemma representation of the verb in question. Multiword expressions are separated by an underscore (WORD_WORD).

    SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).

    SCOPE: The scope of the shifting:

    subj: The verbal shifter affects its subject.

    dobj: The verbal shifter affects its direct object.

    pobj_*: The verbal shifter affects objects within a prepositional phrase. The preposition in question is included in the annotation. For example a from-preposition scope receives the label pobj_from and a a for-preposition receives pobj_for.

    comp: The verbal shifter affects a clausal complement, such as infinitive clauses or gerunds.

    2.2. List of Lemmas

    List of all verb lemmas and whether they are shifters in at least one of their word senses. Does not provide shifter scope information.

    Many verbal shifter lemmas only cause shifting in some of their word senses. This list is therefore considerably more coarse-grained than the main lexicon. It is intended as a convenience measure for quick experimentation.

    File name: shifters.english.verb.sense.lemmas_only.csv

    Each line follows the format LEMMA,SHIFTER_LABEL.

    LEMMA: The lemma representation of the verb in question. Multiword expressions are separated by an underscore (WORD_WORD).

    SHIFTER_LABEL: Whether the verb is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).

    2.3. List of Synsets

    List of all synsets and whether their lemmas are shifters in this specific word sense. Does not provide shifter scope information.

    Shifting is shared among lemmas of the same word sense. This list, therefore, provides (almost) the same granularity for the shifter label as the main lexicon. However, in a few exceptions, synsets contained words with subtly different senses that did not all cause shifting. These senses are considered shifters in this list, analogous to the generalization in the list of lemmas.

    File name: shifters.english.verb.sense.synsets_only.csv

    Each line follows the format SYNSET,SHIFTER_LABEL.

    SYNSET: The numeric identifier of the synset, commonly referred to as offset or database location. It consists of 8 digits, including leading zeroes (e.g. 00334568).

    SHIFTER_LABEL: Whether the verb is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).

    3. German Verbal Shifter Lexicon (Lemma)

    A lexicon of 2595 German verbs, annotated for whether they are polarity shifters and which polarities they affect. The lexicon is based on the vocabulary of GermaNet (Hamp and Feldweg, 1997). It contains 677 shifters and 1918 non-shifters.

    File: shifters.german.verb.lemma.txt

    The lexicon is a comma-separated value (CSV) table.

    Each line follows the format LEMMA,SHIFTER_LABEL,SOURCE.

    LEMMA: The lemma representation of the verb in question. Multiword expressions are separated by an underscore (WORD_WORD).

    SHIFTER_LABEL: Whether the verb is a polarity shifter (SHIFTER) or a non-shifter (NONSHIFTER).

    SOURCE: Whether the word was part of the gold standard. (GOLD_STANDARD) or was bootstrapped (BOOTSTRAPPED). In either case the verbs were verified by a human annotator.

    4. Sentiment Verb Phrases

    A set of verb phrases, annotated for the polarity of the verb phrase and the polarity of a polar noun that it contains. Can be used to evaluate whether a polarity classifier correctly recognizes polarity shifting. The file starts with 400 phrases containing shifter verbs, followed by 2231 phrases containing non-shifter verbs.

    File: sentiment_phrases.txt

    Every item consists of:

    The sentence from which the VP and the polar noun were extracted.

    The VP, polar noun and the verb heading the VP.

    Constituency parse for the VP.

    Gold labels for VP and polar noun by a human annotator.

    Predicted labels for VP and polar noun by RNTN tagger (Socher et al., 2013) and LEX_gold approach.

    Items are separated by a line of asterisks (*)

    This work was partially supported by the German Research Foundation (DFG) under grants RU 1873/2-1 and WI4204/2-1.

  12. Kannada Lexicon Dataset

    • kaggle.com
    Updated Jun 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tejasvi Sridhar (2020). Kannada Lexicon Dataset [Dataset]. https://www.kaggle.com/kushtej/kannada-lexicon-dataset/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tejasvi Sridhar
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Building Kannada Lexicons

    A Kannada (kn) lexicon dataset of more than 8k words consisting of positive, neutral and negative lexicons with polarity of +1, 0 and -1 built for the purpose of Sentiment analysis verified by kannada annotators.

    1.Dataset Purpose :

    According to Liu (2012), Sentiment Analysis which is a sub-domain of Natural Language processing is one of the most active research areas in natural language processing and is also widely studied in data mining, web mining, and text mining .

    This dataset was created for the purpose of Kannada Lexicon Sentiment Analysis. This dataset is highly useful in classifying data into positive, neutral or negative for example : kannada movie reviews, Text documents etc can be classified into positive or negative texts and the opinion of the author/person on a perticular subject can be found whether it is positive or negative.

    2. Resources Used :

    For the creation of Kannada lexicons, English Subjectivity lexicons, Opinion lexicons and AFINN-111 lexicons were the three most reliable resources. On reviewing English AFINN-111 and comparing it with other lexicon sources, Huge duplications and word ambiguities were found and removed. Additionally, New words words were added by annotators to the final set of lexicons which were not previously present in the dataset.The resources used are described below:

    |List Name| Word Count(No of Tokens) | |--|--| | AFINN-111 | 2477 | | Subjectivity lexicon | 5569 | | Opinion lexicon | 6789 | | VADER lexicons | 7517 | | Mannualy added lexicons | 30 |

    Next, Google Translate was used to translate the English AIFNNN-111 lexicons, Subjectivity lexicons and Opinion lexicons to kannada. Google Translate translated most of the english words to kannada and translitrated the words which had no realtive native meaning in kannada and the untranslated english words were removed.

    Google Translate is a free multilingual statistical and neural machine translation service developed by Google, to translate text and websites from one language into another.

    The link to Google Translate can be found here : https://translate.google.co.in/

    Flow Design of building Kn-Lexicon Dataset :

    The flow design of building the dataset is given below.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3299806%2Fd8a87bf3f4a9048bb42d907de4b89d45%2Fflow-design.png?generation=1592740676619195&alt=media" alt="Flow Design">

    3.Collaborators :

    I would also like to thank our mentor Prof. Hemanth Kumar A for mentoring in creating this dataset and guiding us throughout this project.

    4.Acknowledgements

    1. Liu, B., 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), pp.1-167.

    2. Ding, X., Liu, B. and Yu, P.S., 2008, February. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining (pp. 231-240).

    3. Conrad, C., 1974. Context effects in sentence comprehension: A study of the subjective lexicon. Memory & Cognition, 2(1), pp.130-138.

    4. Hutto, C.J. and Gilbert, E., 2014, May. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth international AAAI conference on weblogs and social media.

    5. Kannan, A., Mohanty, G. and Mamidi, R., 2016, December. Towards building a sentiwordnet for tamil. In Proceedings of the 13th International Conference on Natural Language Processing (pp. 30-35).

    License

    This Dataset is licenced under CC BY-SA 4.0.

  13. Data from: Analysis of Chinese Tourists in Japan by Text Mining of a Hotel...

    • figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elisa Claire Alemán Carreón; Hirofumi Nonaka; Toru Hiraoka (2023). Analysis of Chinese Tourists in Japan by Text Mining of a Hotel Portal Site [Dataset]. http://doi.org/10.6084/m9.figshare.7831853.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elisa Claire Alemán Carreón; Hirofumi Nonaka; Toru Hiraoka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan, China
    Description

    With an increasingly large number of Chinese tourists in Japan, the hotel industry is in need of an affordable market research tool that does not rely on expensive and time consuming surveys or inter- views. Because this problem is real and relevant to the hotel industry in Japan, and otherwise completely unexplored in other studies, we have extracted a list of potential keywords from Chinese reviews of Japanese hotels in the hotel portal site Ctrip using a mathematical model to then use them in a sentiment analysis with a machine learning classifier. While most studies that use information collected from the internet use pre-existing data analysis tools, in our study we designed the mathematical model to have the highest possible performing results in classification, while also exploring on the potential business implications these may have.

  14. n

    Data from: Development of public dynamic spatio-temporal monitoring and...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jul 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan C. López; Miguel Jaller (2024). Development of public dynamic spatio-temporal monitoring and analysis tool of supply chain vulnerability, resilience, and sustainability [Dataset]. http://doi.org/10.5061/dryad.qjq2bvqqj
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 13, 2024
    Dataset provided by
    University of California, Davis
    Authors
    Juan C. López; Miguel Jaller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Supply chains play a pivotal role in driving economic growth and societal well-being, facilitating the efficient movement of goods from producers to consumers. However, the increasing frequency of disruptions caused by geopolitical events, pandemics, natural disasters, and shifts in commerce poses significant challenges to supply chain resilience. This draft update report discusses the development of a dynamic spatio-temporal monitoring and analysis tool to assess supply chain vulnerability, resilience, and sustainability. Leveraging news data, macroeconomic metrics, inbound cargo data (for sectors in California), and operational conditions of California’s highways, the tool employs Natural Language Processing (NLP) and empirical regression analyses to identify emerging trends and extract valuable information about disruptions to inform decision-making. Key features of the tool include sentiment analysis of news articles, topic classification, visualization of geographic locations, and tracking of macroeconomic indicators. By integrating diverse and dynamic data sources (e.g., news articles) and using empirical and analytical techniques, the tool offers a comprehensive framework to enhance our understanding of supply chain vulnerabilities and resilience, ultimately contributing to more effective strategies for decision-making in supply chain management. The dynamic nature of this tool enables continuous monitoring and adaptation to evolving conditions, thereby enhancing the analysis of resilience and sustainability in global supply chains. Methods The research team implemented a two-stage procedure to streamline the collection, processing, and analysis of news data. The stages are as follows:

    Lexicon Setup: This stage establishes the lexicons required for sentiment and topic analysis. Topics are categorized into eight groups relevant to supply chain risks: political, environmental, financial, supply and demand, logistics, system, infrastructure, and sector. Sentiments are evaluated using a dictionary-based approach with the AFINN lexicon.Three comprehensive lists of countries, states, and cities are used to classify geographical locations at three hierarchical levels: countries, states within the U.S., and cities/municipalities within California.

    News collection and processing: Automated algorithms collect the most recent news daily, with a 24-hour lag, based on the predefined query: (USA or United States) and (supply chain or supply-chain) and (disruption or resilience) and (retailer or warehouse or transportation or factory). Text mining tasks are performed to extract key performance metrics, including n-grams, topics, sentiments, and geographical locations. The process involves several steps:

    Corpus setup, Term Frequency-Inverse Document Frequency (TF-IDF) for measuring word relevance in documents, Entity recognition and consolidation, Conversion of the corpus into a Document-Feature Matrix (DFM), Dictionary-based extraction of sentiments, topics, and geographical locations.

  15. Z

    Produced Data of Naive Bayes Sentiment Classifier

    • data.niaid.nih.gov
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey Resnik (2024). Produced Data of Naive Bayes Sentiment Classifier [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7934163
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset authored and provided by
    Jeffrey Resnik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data produced by the running of the Naive Bayes classifier algorithm. It is a list of every word in the vocabulary of the classifier, as well as the number of occurrences of each word, as well as the likelihood ratio of this word. Please note the likelihood ratio is calculated by taking the likelihood of word given a positive label divided by the likelihood of a word given a negative label. This data is licensed under the CC BY 4.0 international license, and may be taken and used freely with credit given. This data was produced by two different datasets, using a Naive Bayes classifier. These datasets were the Polarity Review v2.0 dataset from Cornell, and the Large Movie Review Dataset from Stanford.

  16. A

    ‘SemEval 2014 Task 4: AspectBasedSentimentAnalysis’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘SemEval 2014 Task 4: AspectBasedSentimentAnalysis’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-semeval-2014-task-4-aspectbasedsentimentanalysis-d634/322bfe9d/?iid=004-095&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘SemEval 2014 Task 4: AspectBasedSentimentAnalysis’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/charitarth/semeval-2014-task-4-aspectbasedsentimentanalysis on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Copied from https://alt.qcri.org/semeval2014/task4/#, all credits to respective authors.

    SemEval-2014 Task 4

    Task Description: Aspect Based Sentiment Analysis (ABSA)

    Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of current approaches, however, attempt to detect the overall polarity of a sentence, paragraph, or text span, regardless of the entities mentioned (e.g., laptops, restaurants) and their aspects (e.g., battery, screen; food, service). By contrast, this task is concerned with aspect based sentiment analysis (ABSA), where the goal is to identify the aspects of given target entities and the sentiment expressed towards each aspect. Datasets consisting of customer reviews with human-authored annotations identifying the mentioned aspects of the target entities and the sentiment polarity of each aspect will be provided.

    In particular, the task consists of the following subtasks:

    Subtask 1: Aspect term extraction

    Given a set of sentences with pre-identified entities (e.g., restaurants), identify the aspect terms present in the sentence and return a list containing all the distinct aspect terms. An aspect term names a particular aspect of the target entity.

    For example, "I liked the service and the staff, but not the food”, “The food was nothing much, but I loved the staff”. Multi-word aspect terms (e.g., “hard disk”) should be treated as single terms (e.g., in “The hard disk is very noisy” the only aspect term is “hard disk”).

    Subtask 2: Aspect term polarity

    For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative, neutral or conflict (i.e., both positive and negative).

    For example:

    “I loved their fajitas” → {fajitas: positive} “I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive} “The fajitas are their first plate” → {fajitas: neutral} “The fajitas were great to taste, but not to see” → {fajitas: conflict}

    Subtask 3: Aspect category detection

    Given a predefined set of aspect categories (e.g., price, food), identify the aspect categories discussed in a given sentence. Aspect categories are typically coarser than the aspect terms of Subtask 1, and they do not necessarily occur as terms in the given sentence.

    For example, given the set of aspect categories {food, service, price, ambience, anecdotes/miscellaneous}:

    “The restaurant was too expensive” → {price} “The restaurant was expensive, but the menu was great” → {price, food}

    Subtask 4: Aspect category polarity

    Given a set of pre-identified aspect categories (e.g., {food, price}), determine the polarity (positive, negative, neutral or conflict) of each aspect category.

    For example:

    “The restaurant was too expensive” → {price: negative} “The restaurant was expensive, but the menu was great” → {price: negative, food: positive}

    Datasets:

    Two domain-specific datasets for laptops and restaurants, consisting of over 6K sentences with fine-grained aspect-level human annotations have been provided for training.

    Restaurant reviews:

    This dataset consists of over 3K English sentences from the restaurant reviews of Ganu et al. (2009). The original dataset of Ganu et al. included annotations for coarse aspect categories (Subtask 3) and overall sentence polarities; we modified the dataset to include annotations for aspect terms occurring in the sentences (Subtask 1), aspect term polarities (Subtask 2), and aspect category-specific polarities (Subtask 4). We also corrected some errors (e.g., sentence splitting errors) of the original dataset. Experienced human annotators identified the aspect terms of the sentences and their polarities (Subtasks 1 and 2). Additional restaurant reviews, not in the original dataset of Ganu et al. (2009), are being annotated in the same manner, and they will be used as test data.

    Laptop reviews:

    This dataset consists of over 3K English sentences extracted from customer reviews of laptops. Experienced human annotators tagged the aspect terms of the sentences (Subtask 1) and their polarities (Subtask 2). This dataset will be used only for Subtasks 1 and 2. Part of this dataset will be reserved as test data.

    Dataset format:

    The sentences in the datasets are annotated using XML tags.

    The following example illustrates the format of the annotated sentences of the restaurants dataset. ```xml

     
    
    The possible values of the polarity field are: “positive”, “negative”, “conflict”, “neutral”. The possible values of the category field are: “food”, “service”, “price”, “ambience”, “anecdotes/miscellaneous”.
    
     
    
    The following example illustrates the format of the annotated sentences of the laptops dataset. The format is the same as in the restaurant datasets, with the only exception that there are no annotations for aspect categories. Notice that we annotate only aspect terms naming particular aspects (e.g., “everything about it” does not name a particular aspect).
    
     
    ```xml
    
    

    In the sentences of both datasets, there is an

    --- Original source retains full ownership of the source dataset ---

  17. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  18. r

    LSNG03-La20130921c - Lä word list elicitation, continued; Lä text

    • researchdata.edu.au
    Updated Apr 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PARADISEC (2021). LSNG03-La20130921c - Lä word list elicitation, continued; Lä text [Dataset]. http://doi.org/10.26278/K76B-FS23
    Explore at:
    Dataset updated
    Apr 22, 2021
    Dataset provided by
    PARADISEC
    Time period covered
    Jan 1, 1970 - Present
    Area covered
    Description

    Lä wordlist elicitation continues, starting with word for 'mouth'. At aprox. 40 minutes into the recording, a woman shares a story. Recorded by Nick Evans in the Community Centre at Tais, Morehead District. Language as given: Lä

  19. f

    Key word list

    • figshare.com
    xlsx
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eleanor Durrant (2024). Key word list [Dataset]. http://doi.org/10.6084/m9.figshare.26135752.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    figshare
    Authors
    Eleanor Durrant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Key word list for trees and forest that was used to assign sentiment scores to workshop data.

  20. Argumentative Skills in Higher Education: An analysis of university course...

    • zenodo.org
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesca Crudele; Francesca Crudele; Juliana Elisa Raffaghelli; Juliana Elisa Raffaghelli (2024). Argumentative Skills in Higher Education: An analysis of university course Syllabus. [Dataset]. http://doi.org/10.5281/zenodo.13255690
    Explore at:
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesca Crudele; Francesca Crudele; Juliana Elisa Raffaghelli; Juliana Elisa Raffaghelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Argumentative skills are personally and professionally essential to digest complicated information (CoI) associated with the critical reconstruction of meaning (critical thinking - CT). This is a vital goal, especially in the age of social media and artificial intelligence-mediated information. Recently, the introduction of generative artificial intelligence (GenAI), particularly ChatGPT (OpenAI, 2022), has made it much simpler to collect and exchange knowledge. New tools are desperately needed to deal with the glut of post-digital information without becoming lost.

    After exploring the landscape of argumentative skills and techniques for their development, an investigation of practices in use in Italian universities was undertaken. The analysis used university syllabi, which are considered key educational tools, to provide a comprehensive overview of the course to be undertaken. The syllabi contain key information such as objectives, competencies, assignments and assessment strategies.

    The research examined education science courses to understand the importance of argumentative skills, using stratified random sampling to proportionally represent all public universities with education science departments. 133 syllabi were selected through web scraping and web crawling techniques using R software (https://cran.r-project.org/bin/windows/base/), with manual addition to overcome technical limitations.

    The analysis included text mining techniques to identify documents containing keywords related to argumentative skills. These documents were then subjected to quantitative and qualitative content analysis. Biggs' (2003) "Constructive Alignment" principles were used to assess the alignment of goals, activities, and assessments in the syllabi. Categories of analysis included the detection of argumentative skills, their alignment, and connection to the course, with a focus on presence, level of treatment, and consistency of alignment.

    This Zenodo record follows the full analysis process with R and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:

    1. List of Universities with URLs - Elenco Università.xlsx

    2. Web Scraping Script - WebScarping.R

    3. Text Mining Script - TextMining.R

    4. List of the most frequent words - Vocabulary.csv

    5. Sentiment Analysis of the corpus - Sentiment Analysis.R

    6. List of documents from sorting by Keywords - frasi_chiave.docx

    7. Codebook qualitative Analysis with Nvivo - Codebook.xlsx

    8. Results Nvivo Analysis Syllabi - Codebook-Syllabi.docx

    Any comments or improvements are welcome!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2023). Sentiment analysis of tech media articles using VADER package and co-occurrence analysis [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1351

Sentiment analysis of tech media articles using VADER package and co-occurrence analysis

Explore at:
csvAvailable download formats
Dataset updated
Aug 16, 2023
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sentiment analysis of tech media articles using VADER package and co-occurrence analysis

Sources: Above 140k articles (01.2016-03.2019):

Gigaom 0.5%

Euractiv 0.9%

The Conversation 1.3%

Politico Europe 1.3%

IEEE Spectrum 1.8%

Techforge 4.3%

Fastcompany 4.5%

The Guardian (Tech) 9.2%

Arstechnica 10.0%

Reuters 11%

Gizmodo 17.5%

ZDNet 18.3%

The Register 19.5%

Methodology

The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.

As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.

The process included the following main steps:

The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)

The articles containing the given social issue and co-occurring term are identified

The identified articles are divided into paragraphs

Social issue and co-occurring words are removed from the paragraph

The VADER sentiment analysis is carried out for every identified and modified paragraph

The average for the given word pair is calculated for the final result

Therefore, the procedure has been repeated for 100 words for all identified social issues.

The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.

The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:

The articles containing the given social issue are identified

The paragraphs containing the social issue are selected for sentiment analysis

*Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Files

sentiments_mod11.csv sentiment score based on chosen unigrams

sentiments_mod22.csv sentiment score based on chosen bigrams

sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams

Search
Clear search
Close search
Google apps
Main menu