7 datasets found

Cricket Commentary Analysis
kaggle.com
zip
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Cricket Commentary Analysis [Dataset]. https://www.kaggle.com/datasets/thedevastator/cricket-commentary-analysis/code
Explore at:
zip(3693673 bytes)Available download formats
Dataset updated
Nov 27, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Cricket Commentary Analysis

Text Classification and Natural Language Processing for Commentary Insights

By Huggingface Hub [source]

About this dataset

This dynamic cricket commentary dataset is a powerful tool for understanding and analyzing the game. With its three distinct datasets - Validation.csv, Train.csv, and Test.csv - this data set provides invaluable insights into the cricket commentary scenario through text classification and natural language processing methods. The comprehensive features available in this dataset enable researchers to examine everything from player performance and team strategies to fan excitement patterns over various matches or tournaments. Furthermore, associating context-specific sentiment analysis results with each individual comment provides a greater depth of insight than ever before possible! Use this source of analysis to uncover previously unseen trends in cricket commentary today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Download the three datasets – Test.csv, Train.csv, Validation.csv – from Kaggle into a single folder on your local computer or Google Drive.

For text pre-processing, clean up the data by removing any punctuation marks or stop words in order to generate a clean dataset for further analysis.

Use either supervised or unsupervised machine learning algorithms such as Naive Bayes or Support Vector Machines (SVMs) to create models based on your chosen technique that can provide predictive classifications of new cricket commentary records within our datasets.

Enter in parameters corresponding with the desired classifications you are attempting to make predictions for; use train and validation data for this process where applicable.
a) If you are using supervised learning algorithms split the data into training set (Train.csv), validation sets (Validation .cs‌v), and test sets (Test .csv). Furthermore define your labels within each of these files before training starts so that all files share matching labels

b) In case of unsupervised learning clusters classification techniques visualize your labelled training set(Train_data_labelled file ) in 2D using PCA from sklearn library before model building continues

5) Generate insights after executing feature engineering appropriate ML algorithm relevant approaches such as sentiment analysis etc; based off what questions you are trying answer through generated results then evaluate performance metrics

6 ) Lastly once results have been analysed and stored as per user needs access deployable server by deploying trained model along with necessary libraries so that it can be accessed remotely from anywhere as an api response depending upon need

Research Ideas

Generating sentiment analysis of the commentators’ remarks about the cricket match - this can help cricket commentators make their derision more unbiased and track dismissed players’ emotional reactions.

Creating analytics on which type of commentary garners the most engagement from viewers, so commentators can know how to drive viewership during a particular match.

Invoking natural language processing techniques to detect actionable insights in commentary data, such as score patterns or winning team correlations that could be used to advise and inform commentating decisions going forward

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

File: train.csv

File: test.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Job Titles and Description
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
9666 (2023). Job Titles and Description [Dataset]. https://www.kaggle.com/datasets/jatinchawda/job-titles-and-description
Explore at:
zip(1831251735 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
9666
Description
This Data set is a cleaned version of International Job Postings along with open-source datasets.

Here are potential ideas for leveraging this dataset from both Machine Learning (ML) and Deep Learning (DL) perspectives:

Job Title Classification: Predicting or classifying job titles based on their respective descriptions. You could use various ML/DL techniques such as text classification models (like RNNs, CNNs, or Transformers) to build classifiers capable of accurately assigning job titles.

Skill Extraction and Matching: Develop algorithms that extract specific skills or qualifications from job descriptions and use them to match candidates to suitable job titles. This could involve NLP techniques for named entity recognition, keyword extraction, and candidate-job matching from resumes.

**Job Recommendation System **: Task participants with creating recommendation systems that suggest job titles to candidates based on their qualifications, preferences, and the content of job descriptions. This could involve collaborative filtering, content-based recommendation algorithms, or hybrid methods that combine multiple techniques.

Ethical Bias Detection and Mitigation: Detect biases within the dataset and develop methods to mitigate them, ensuring fairness in job recommendations or classifications. This could involve fairness metrics, bias detection algorithms, and fairness-aware learning techniques.

Job Description Generation Task: Encourage participants to build models capable of generating job descriptions based on specific criteria or input. This could involve sequence-to-sequence models, generative language models (e.g., GPT), or encoder-decoder architectures.

Job Market Analysis: Performing clustering or topic modeling to group job descriptions into categories or industries. This challenge could involve unsupervised learning techniques such as K-means clustering or latent Dirichlet allocation (LDA) to uncover patterns or segments within the dataset.

Feel free to use this dataset.
IMDb Movie Review Sentiment
kaggle.com
zip
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). IMDb Movie Review Sentiment [Dataset]. https://www.kaggle.com/datasets/thedevastator/imdb-movie-review-sentiment-dataset
Explore at:
zip(52028315 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
IMDb Movie Review Sentiment

Movie Review Sentiment

By imdb (From Huggingface) [source]

About this dataset

The IMDb Large Movie Review Dataset is a comprehensive collection of movie reviews used for sentiment classification. The dataset includes a wide range of movie reviews along with their corresponding sentiment labels, which indicate whether the review is positive or negative in nature. This invaluable dataset is aimed at facilitating sentiment analysis and classification tasks in the field of natural language processing.

The main purpose of the train.csv file within this dataset is to provide a curated collection of movie reviews, each accompanied by its respective sentiment label. This file proves particularly useful for training machine learning models to accurately predict sentiment and classify reviews based on their emotional tone.

Similarly, the test.csv file contains another set of movie reviews along with corresponding sentiment labels. Meant for testing and validating the performance of trained models, this dataset enables researchers and developers to evaluate their models' effectiveness in real-world scenarios.

Additionally, the unsupervised.csv file offers an alternative subset within the dataset. Unlike train.csv and test.csv, unsupervised.csv does not include any associated sentiment labels for individual movie reviews. This specific subset serves as a valuable resource for exploring unsupervised learning techniques within the domain of sentiment classification.

By utilizing this meticulously compiled IMDb Large Movie Review Dataset, researchers and data scientists can delve into various aspects related to analyzing sentiments in textual data. With its carefully labeled data points covering both positive and negative sentiments expressed in diverse film critiques, this dataset empowers users to develop sophisticated machine learning algorithms that accurately assess subjective opinions from text data

How to use the dataset

Introduction:

Dataset Overview: - Train.csv: This file contains a set of movie reviews along with their sentiment labels. It is intended for training your sentiment analysis models. - Test.csv: This file provides another set of movie reviews along with their corresponding sentiment labels. You can use this file to evaluate the performance of your trained models. - Unsupervised.csv: This file includes movie reviews without any associated sentiment labels. It can be used for unsupervised sentiment classification tasks.

Columns in the Dataset: - text: The main column containing the text of each movie review. - label: The sentiment label assigned to each review, indicating whether it is positive or negative.

Guidelines for Using the Dataset:

Training Your Model:

Begin by loading and preprocessing the data from train.csv

Treat 'text' as your input feature and 'label' as your target variable

Explore different machine learning or deep learning algorithms suitable for text classification

Train your model using various techniques, such as bag-of-words, word embeddings, or transformers

Evaluate and fine-tune your model's performance using test.csv

Evaluating Your Model:

Load test.csv and preprocess the data similar to what you did with train.csv

Use this preprocessed test data to evaluate the accuracy, precision, recall, F1 score or other relevant metrics of your trained model on unseen data

Analyze these metrics to understand how well your model is performing in predicting sentiments

Advancing Your Model (Unsupervised Classification):

Utilize unsupervised.csv for unsupervised sentiment classification tasks

Preprocess the movie reviews in this file and explore techniques like clustering, topic modeling, or self-supervised learning

Extract patterns, themes, or sentiments from the reviews without any guidance from labeled data

Conclusion:

Research Ideas

Sentiment Analysis: This dataset can be used to train models for sentiment analysis, where the goal is to predict whether a movie review is positive or negative based on its text.

NLP Research: The dataset can be used for various natural language processing (NLP) tasks such as text classification, information extraction, or named entity recognition. Researchers and practitioners can leverage this dataset to develop and evaluate new algorithms and techniques in the field of NLP.

Recommendation Systems: The sentiment labels in this dataset can be used as a source of feedback or user preferences for recommendation systems. By analyzing the sentiments expressed in reviews,...
MultiNLI (Multi-Genre Natural Language Inference)
kaggle.com
zip
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). MultiNLI (Multi-Genre Natural Language Inference) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-language-understanding-with-the-multin/discussion
Explore at:
zip(114684000 bytes)Available download formats
Dataset updated
Nov 29, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MultiNLI (Multi-Genre Natural Language Inference)

Crowdsourced collection of 433k sentence pairs annotated with textual entailment

By Huggingface Hub [source]

About this dataset

The Multi-Genre Natural Language Inference (MultiNLI) corpus provides a revolutionary resource for machine learning researchers exploring natural language understanding and processing. Offering a vast collection of 433,000 sentence pairs each annotated with textual entailment information, this dataset enables exploration into the interpretive powers of natural language across genres such as spoken and written. Moreover, with its cross-genre evaluation capabilities, MultiNLI has opened up exciting new possibilities that have never before been explored in the field of natural language inference. From examining distinct linguistic patterns to discovering new examples from different sources or genres, this dataset is unlocking the future of machine learning by providing an extraordinary gateway into this fast expanding world

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use the MultiNLI Corpus

The MultiNLI Corpus is an invaluable resource for machine learning researchers who are exploring the power of natural language inference and understanding. This dataset contains 433,000 sentence pairs annotated with textual entailment information, genre, and label. Follow these steps to utilize this dataset for research purposes:

Identify the columns you require from the dataset. The columns available in this dataset are premise, premise_binary_parse, premise_parse, hypothesis, hypothesis_binary_parse, hypothesis_parse, genre and label.

Select a subset or entirety of data that you require from either train.csv or validation matched/mismatched files in the MultiNLI Dataset depending on whether you intend to use it for training or testing respectively.

Pre-process your sentences by tokenization (splitting long texts into tokens e.g words) and then run them through a parser which will produce linguistic representations like dependency trees or binary parse trees corresponding to every sentence pair that can be used as features later in your model building process instead of manual features extraction/engineering which is labour intensive .
4further build your model using appropriate deep learining architecture adequate for NLP tasks like attentive RNNs that learn contextual representation fromraw text given their inherent ability ot aoolylocal context at each step when processing withinpue texts . Then train , evaluate ane tune hyperparameters accordingly until desired results are achieved..

By utilizing this powerful resource appropriately with cutting edge models , substantial progress towards reliabley inferring natural language can be made unlocking critical research possibilities while granting further insights into real world applications involving choice comprehension…

Research Ideas

Investigating the effects of out-of-domain and cross-genre evaluation on natural language processing tasks such as sentiment analysis, text classification, and summarization.

Exploring unsupervised methods of identifying textual entailment relationships between sentences.

Developing applications that can detect genre or context specific semantic inference systems to identify relationships across different types of language usage (spoken vs written)

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------------------|:---------------------------------------------------------------------------------------| | premise | The premise of the sentence pair. (String) | | premise_binary_parse | The binary parse of the premise sentence. (String) | | premise_parse | The parse of the premise sentence. (String) | | hypothesis | The hypothes...
email-blog
kaggle.com
zip
Updated Oct 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Schmidt-Avemac (2021). email-blog [Dataset]. https://www.kaggle.com/mikeschmidtavemac/emailblog
Explore at:
zip(15977463 bytes)Available download formats
Dataset updated
Oct 9, 2021
Authors
Mike Schmidt-Avemac
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Context

Supervised classification dataset produced as part of a blog series on classifying corporate email for morale and professional alignment. Series covers raw data extraction, analysis, unsupervised topic discovery and supervised model development.

The blog posts are available at:

Part 1. Raw email processing. https://www.avemacconsulting.com/2021/08/24/email-insights-from-data-science-techniques-part-1/ Part 2. Data analysis. https://www.avemacconsulting.com/2021/08/27/email-insights-from-data-science-part-2/ Part 3. Unsupervised topic classification (creates this dataset). https://www.avemacconsulting.com/2021/09/23/email-insights-from-data-science-part-3/ Part 4. Supervised modeling (uses this dataset). https://www.avemacconsulting.com/2021/10/12/email-insights-from-data-science-part-4/

** Note. This data is part of a blog series so is not vetted 100%. Specifically the unsupervised topic extraction step should be further tuned for accuracy.

Content

Original email content taking from the public Enron email repository located at https://www.cs.cmu.edu/~enron/.

Dataset contains email body text, various supporting features (email addresses, data/time, etc.) plus multiple classification labels.

Three (3) labels were generated for sentiment with three (3) classes (positive/negative/(neutral/unknown)). Three (3) labels were also created for alignment(business/personal) with two (2) classes (fun/work)).

Acknowledgements

Uses sentiment lexicon from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA,

Uses VADER from https://www.nltk.org/api/nltk.sentiment.html?highlight=vader#module-nltk.sentiment.vader

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Uses AFINN from http://corpustext.com/reference/sentiment_afinn.html

Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May.
IMDB Reviews Stanford
kaggle.com
zip
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Tarek (2023). IMDB Reviews Stanford [Dataset]. https://www.kaggle.com/datasets/ahmedeko/imdb-reviews-stanford
Explore at:
zip(142950886 bytes)Available download formats
Dataset updated
Jun 18, 2023
Authors
Ahmed Tarek
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
this is original IMDB used in many model by Stanford

overview This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.

Dataset The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Files There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset.

We also include the IMDb URLs for each review in a separate [urls_[pos, neg, unsup].txt] file. A review with unique id 200 will have its URL on line 200 of this file. Due the ever-changing IMDb, we are unable to link directly to the review, but only to the movie's review page.

In addition to the review text files, we include already-tokenized bag of words (BoW) features that were used in our experiments. These are stored in .feat files in the train/test directories. Each .feat file is in LIBSVM format, an ascii sparse-vector format for labeled data. The feature indices in these files start from 0, and the text tokens corresponding to a feature index is found in [imdb.vocab]. So a line with 0:7 in a .feat file means the first word in imdb.vocab appears 7 times in that review.
Large Movie Review Dataset 1.0
kaggle.com
zip
Updated Oct 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel Angel Cotrina Espinoza (2018). Large Movie Review Dataset 1.0 [Dataset]. https://www.kaggle.com/macespinoza/large-movie-review-dataset-10
Explore at:
zip(83080326 bytes)Available download formats
Dataset updated
Oct 30, 2018
Authors
Miguel Angel Cotrina Espinoza
Description
Context

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.

Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Files

There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset.

We also include the IMDb URLs for each review in a separate [urls_[pos, neg, unsup].txt] file. A review with unique id 200 will have its URL on line 200 of this file. Due the ever-changing IMDb, we are unable to link directly to the review, but only to the movie's review page.

In addition to the review text files, we include already-tokenized bag of words (BoW) features that were used in our experiments. These are stored in .feat files in the train/test directories. Each .feat file is in LIBSVM format, an ascii sparse-vector format for labeled data. The feature indices in these files start from 0, and the text tokens corresponding to a feature index is found in [imdb.vocab]. So a line with 0:7 in a .feat file means the first word in imdb.vocab appears 7 times in that review.

LIBSVM page for details on .feat file format: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

We also include [imdbEr.txt] which contains the expected rating for each token in [imdb.vocab] as computed by (Potts, 2011). The expected rating is a good way to get a sense for the average polarity of a word in the dataset.

Citing the dataset

When using this dataset please cite our ACL 2011 paper which introduces it. This paper also contains classification results which you may want to compare against.

@InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20, 636-659.

Contact

For questions/comments/corrections please contact Andrew Maas amaas@cs.stanford.edu
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). Cricket Commentary Analysis [Dataset]. https://www.kaggle.com/datasets/thedevastator/cricket-commentary-analysis/code

Cricket Commentary Analysis

Text Classification and Natural Language Processing for Commentary Insights

Explore at:

zip(3693673 bytes)Available download formats

Dataset updated

Nov 27, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Cricket Commentary Analysis

Text Classification and Natural Language Processing for Commentary Insights

By Huggingface Hub [source]

About this dataset

This dynamic cricket commentary dataset is a powerful tool for understanding and analyzing the game. With its three distinct datasets - Validation.csv, Train.csv, and Test.csv - this data set provides invaluable insights into the cricket commentary scenario through text classification and natural language processing methods. The comprehensive features available in this dataset enable researchers to examine everything from player performance and team strategies to fan excitement patterns over various matches or tournaments. Furthermore, associating context-specific sentiment analysis results with each individual comment provides a greater depth of insight than ever before possible! Use this source of analysis to uncover previously unseen trends in cricket commentary today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Download the three datasets – Test.csv, Train.csv, Validation.csv – from Kaggle into a single folder on your local computer or Google Drive.

For text pre-processing, clean up the data by removing any punctuation marks or stop words in order to generate a clean dataset for further analysis.

Use either supervised or unsupervised machine learning algorithms such as Naive Bayes or Support Vector Machines (SVMs) to create models based on your chosen technique that can provide predictive classifications of new cricket commentary records within our datasets.

Enter in parameters corresponding with the desired classifications you are attempting to make predictions for; use train and validation data for this process where applicable.
a) If you are using supervised learning algorithms split the data into training set (Train.csv), validation sets (Validation .cs‌v), and test sets (Test .csv). Furthermore define your labels within each of these files before training starts so that all files share matching labels

b) In case of unsupervised learning clusters classification techniques visualize your labelled training set(Train_data_labelled file ) in 2D using PCA from sklearn library before model building continues

5) Generate insights after executing feature engineering appropriate ML algorithm relevant approaches such as sentiment analysis etc; based off what questions you are trying answer through generated results then evaluate performance metrics

6 ) Lastly once results have been analysed and stored as per user needs access deployable server by deploying trained model along with necessary libraries so that it can be accessed remotely from anywhere as an api response depending upon need

Research Ideas

Generating sentiment analysis of the commentators’ remarks about the cricket match - this can help cricket commentators make their derision more unbiased and track dismissed players’ emotional reactions.

Creating analytics on which type of commentary garners the most engagement from viewers, so commentators can know how to drive viewership during a particular match.

Invoking natural language processing techniques to detect actionable insights in commentary data, such as score patterns or winning team correlations that could be used to advise and inform commentating decisions going forward

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

File: train.csv

File: test.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Clear search

Close search

Google apps

Main menu

Cricket Commentary Analysis

Cricket Commentary Analysis

Text Classification and Natural Language Processing for Commentary Insights

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Job Titles and Description

IMDb Movie Review Sentiment

IMDb Movie Review Sentiment

Movie Review Sentiment

About this dataset

How to use the dataset

Research Ideas

MultiNLI (Multi-Genre Natural Language Inference)

MultiNLI (Multi-Genre Natural Language Inference)

Crowdsourced collection of 433k sentence pairs annotated with textual entailment

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

How to Use the MultiNLI Corpus

Research Ideas

Acknowledgements

License

Columns

email-blog

Context

Content

Acknowledgements

IMDB Reviews Stanford

this is original IMDB used in many model by Stanford

Large Movie Review Dataset 1.0

Context

Contact

Cricket Commentary Analysis

Text Classification and Natural Language Processing for Commentary Insights

Cricket Commentary Analysis

Text Classification and Natural Language Processing for Commentary Insights

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements