Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes
Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people
Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : High fidelity microphone; Binocular camera
Language : 20 languages
Transcription content : text
Accuracy rate : 98%
Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Questions from Reddit.com/r/AskNYC, downloaded from PushShift, filtered to direct responses from humans, where the post net score is >= 3. Collected one month of posts from each year 2015-2019 (i.e. no content from July 2019 onward) Adapted from the CSV used to fine-tune https://huggingface.co/monsoon-nlp/gpt-nyc Blog about the original model: https://medium.com/geekculture/gpt-nyc-part-1-9cb698b2e3d
The data for the smoking challenge consisted exclusively of discharge summaries from Partners HealthCare which were preprocessed and converted into XML format, and separated into training and test sets. I2B2 is a data warehouse containing clinical data on over 150k patients, including outpatient DX, lab results, medications, and inpatient procedures. ETL processes authored to pull data from EMR and finance systems Institutional review boards of Partners HealthCare approved the challenge and the data preparation process. The data were annotated by pulmonologists and classified patients into Past Smokers, Current Smokers, Smokers, Non-smokers, and unknown. Second-hand smokers were considered non-smokers. Other institutions involved include Massachusetts Institute of Technology, and the State University of New York at Albany. i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement. In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1. A similar set of notes from the Second i2b2 Challenge will be released on the one year anniversary of that Challenge (November, 2010).
Recording environment : quiet indoor environment, low background noise, without echo.
Recording content (read speech) : generic category; human-machine interaction category; smart home command and control category; in-car command and control category; numbers.
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : Android mobile phone, iPhone.
Language : American English, British English, Canadian English, Australian English, French English, German English, Spanish English, Italian English, Portuguese English, Russian English, Indian English, Japanese English, Korean English, Singaporean English and etc.
Application scenarios : speech recognition; voiceprint recognition.
Dataset Summary
SWE-bench Lite is subset of SWE-bench, a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 300 test Issue-Pull Request pairs from 11 popular Python. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues? This dataset SWE-bench_Lite_bm25_27K includes a formatting of each instance… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite_bm25_27K.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.
Corpus Directory Structure
annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.
hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.
GROUP0/: contains raw manual annotations made by curator GROUP0.
GROUP1/: contains raw manual annotations made by curator GROUP1.
GROUP2/: contains raw manual annotations made by curator GROUP2.
IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.
dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task.
test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task.
train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.
JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format. README.md: a detailed description of all the annotation formats.
articles/: contains the full-text articles annotated in Europe PMC corpus.
Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles.
docs/: contains related documents that were used for generating the corpus.
Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation. demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. Training set development.pdf: initial document that details the paper selection procedures.
pilot/: contains annotations and articles that were used in a pilot study.
annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. articles/: contains the full-text articles annotated in the pilot study.
Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser.
XML/: contains XML articles directly fetched using Europe PMC Article Restful API.
README.md: a detailed description of the sentencising and fetching of XML articles.
src/: source codes for cleaning annotations and generating IOB files
metrics/ner_metrics.py: Python script contains SemEval evaluation metrics. annotations.py: Python script used to extract annotations from raw Hypothes.is annotations. generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format. generate_json_dataset.py: Python script used to extract annotations to JSON format. hypothesis.py: Python script used to fetch raw Hypothes.is annotations.
License
CCBY
Feedback
For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generation of multiple true-false questions
This project provides a Natural Language Pipeline for processing German Textbook sections as an input generating Multiple True-False Questions using GPT2.
Assessments are an important part of the learning cycle and enable the development and promotion of competencies. However, the manual creation of assessments is very time-consuming. Therefore, the number of tasks in learning systems is often limited. In this repository, we provide an algorithm that can automatically generate an arbitrary number of German True False statements from a textbook using the GPT-2 model. The algorithm was evaluated with a selection of textbook chapters from four academic disciplines (see `data` folder) and rated by individual domain experts. One-third of the generated MTF Questions are suitable for learning. The algorithm provides instructors with an easier way to create assessments on chapters of textbooks to test factual knowledge.
As a type of Multiple-Choice question, Multiple True False (MTF) Questions are, among other question types, a simple and efficient way to objectively test factual knowledge. The learner is challenged to distinguish between true and false statements. MTF questions can be presented differently, e.g. by locating a true statement from a series of false statements, identifying false statements among a list of true statements, or separately evaluating each statement as either true or false. Learners must evaluate each statement individually because a question stem can contain both incorrect and correct statements. Thus, MTF Questions as a machine-gradable format have the potential to identify learners’ misconceptions and knowledge gaps.
Example MTF question:
Check the correct statements:
[ ] All trees have green leafs.
[ ] Trees grow towards the sky.
[ ] Leafes can fall from a tree.
Features
- generation of false statements
- automatic selection of true statements
- selection of an arbitrary similarity for true and false statements as well as the number of false statements
- generating false statements by adding or deleting negations as well as using a german gpt2
Setup
Installation
1. Create a new environment: `conda create -n mtfenv python=3.9`
2. Activate the environment: `conda activate mtfenv`
3. Install dependencies using anaconda:
```
conda install -y -c conda-forge pdfplumber
conda install -y -c conda-forge nltk
conda install -y -c conda-forge pypdf2
conda install -y -c conda-forge pylatexenc
conda install -y -c conda-forge packaging
conda install -y -c conda-forge transformers
conda install -y -c conda-forge essential_generators
conda install -y -c conda-forge xlsxwriter
```
3. Download spacy: `python3.9 -m spacy download de_core_news_lg`
Getting started
After installation, you can execute the bash script `bash run.sh` in the terminal to compile MTF questions for the provided textbook chapters.
To create MTF questions for your own texts use the following command:
`python3 main.py --answers 1 --similarity 0.66 --input ./
The parameter `answers` indicates how many false answers should be generated.
By configuring the parameter `similarity` you can determine what portion of a sentence should remain the same. The remaining portion will be extracted and used to generate a false part of the sentence.
## History and roadmap
* Outlook third iteration: Automatic augmentation of text chapters with generated questions
* Second iteration: Generation of multiple true-false questions with improved text summarizer and German GPT2 sentence generator
* First iteration: Generation of multiple true false questions in the Bachelor thesis of Mirjam Wiemeler
Publications, citations, license
Publications
Citation of the Dataset
The source code and data are maintained at GitHub: https://github.com/D2L2/multiple-true-false-question-generation
Contact
License Distributed under the MIT License. See [LICENSE.txt](https://gitlab.pi6.fernuni-hagen.de/la-diva/adaptive-assessment/generationofmultipletruefalsequestions/-/blob/master/LICENSE.txt) for more information.
Acknowledgments This research was supported by CATALPA - Center of Advanced Technology for Assisted Learning and Predictive Analytics of the FernUniversität in Hagen, Germany.
This project was carried out as part of research in the CATALPA project [LA DIVA](https://www.fernuni-hagen.de/forschung/schwerpunkte/catalpa/forschung/projekte/la-diva.shtml)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
QASPER: NLP Questions and Evidence Discovering Answers with Expertise By Huggingface Hub [source]
About this dataset QASPER is an incredible collection of over 5,000 questions and answers on a vast range of Natural Language Processing (NLP) papers -- all crowdsourced from experienced NLP practitioners. Each question in the dataset is written based only on the titles and abstracts of the corresponding paper, providing an insight into how the experts understood and parsed various materials. The answers to each query have been expertly enriched by evidence taken directly from the full text of each paper. Moreover, QASPER comes with carefully crafted fields that contain relevant information including ‘qas’ – questions and answers; ‘evidence’ – evidence provided for answering questions; title; abstract; figures_and_tables, and full_text. All this adds up to create a remarkable dataset for researchers looking to gain insights into how practitioners interpret NLP topics while providing effective validation when it comes to finding clear-cut solutions to problems encountered in existing literature
More Datasets For more datasets, click here.
Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This guide will provide instructions on how to use the QASPER dataset of Natural Language Processing (NLP) questions and evidence. The QASPER dataset contains 5,049 questions over 1,585 papers that has been crowdsourced by NLP practitioners. To get the most out of this dataset we will show you how to access the questions and evidence, as well as provide tips for getting started.
Step 1: Accessing the Dataset To access the data you can download it from Kaggle's website or through a code version control system like Github. Once downloaded, you will find five files in .csv format; two test data sets (test.csv and validation.csv), two train data sets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv) as well as one figure data set (figures_and_tables_.json). Each .csv file contains different datasets with columns representing titles, abstracts, full texts and Q&A fields with evidence for each paper mentioned in each row of each file respectively
**Step 2: Analyzing Your Data Sets ** Now would be a good time to explore your datasets using basic descriptive statistics or more advanced predictive analytics such as logistic regression or naive bayes models depending on what kind of analysis you would like to undertake with this dataset You can start simple by summarizing some basic crosstabs between any two variables comprise your dataset; titles abstracts etc.). As an example try correlating title lengths with certain number of words in their corresponding abstracts then check if there is anything worth investigating further
**Step 3: Define Your Research Questions & Perform Further Analysis ** Once satisfied with your initial exploration it is time to dig deeper into the underlying QR relationship among different variables comprising your main documents One way would be using text mining technologies such as topic modeling machine learning techniques or even automated processes that may help summarize any underlying patterns Yet another approach could involve filtering terms that are relevant per specific research hypothesis then process such terms via web crawlers search engines document similarity algorithms etc
Finally once all relevant parameters are defined analyzed performed searched it would make sense to draw preliminary connsusison linking them back together before conducting replicable tests ensuring reproducible results
Research Ideas Developing AI models to automatically generate questions and answers from paper titles and abstracts. Enhancing machine learning algorithms by combining the answers with the evidence provided in the dataset to find relationships between papers. Creating online forums for NLP practitioners that uses questions from this dataset to spark discussion within the community
CC0
Original Data Source: QASPER: NLP Questions and Evidence
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Don't forget to hit the UpVote🙏🙏 The DataSet consists of user reviews of ChatGPT, including Textual Feedback, Ratings, and Review Dates. The Reviews Range from brief comments to more detailed feedback by covering a wide range of user sentiments. The ratings are on a scale of 1 to 5, representing varying levels of Satisfaction. The dataset spans multiple months, providing a temporal dimension for analysis. Each review is accompanied by a timestamp, allowing for Time-Series analysis of sentiment trends.
Original Data Source: ChatGPT Users Reviews
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
movies_overview.csv:
movies_genres.csv:
This setup allows you to map each movie’s genre_ids from the movies_overview file to its actual genre names using the movies_genres mapping.
Objective:
Create an NLP model that, given a movie’s overview, predicts the correct genre(s) for that movie. Since a movie may belong to multiple genres, this is a multi-label classification task.
Participants are tasked with designing and training an NLP model that takes as input the movie overview text and outputs one or more genres. The challenge could encourage approaches that span from classical text classification methods (e.g., TF-IDF with logistic regression) to modern transformer-based models (e.g., BERT, RoBERTa).
Encourage participants to start with simple models (e.g., bag-of-words, TF-IDF combined with logistic regression or random forest) and progress towards deep learning approaches like LSTM-based networks or transformer models.
Since this is a multi-label task, consider evaluation metrics such as:
- F1 Score (Macro / Micro): Balances precision and recall.
- Hamming Loss: Measures how many labels are incorrectly predicted.
- Subset Accuracy: For stricter evaluation (all labels must match exactly).
Baseline Code & Notebooks:
Provide a starter notebook with initial data loading, preprocessing, and a simple baseline model. This helps lower the entry barrier for participants who may be new to multi-label NLP tasks.
Evaluation Server & Leaderboard:
Ensure that your Kaggle competition setup allows for automatic evaluation using the selected metrics and that a public leaderboard is available for continuous feedback.
Documentation & Discussion:
Include detailed documentation describing the datasets, the task requirements, and the evaluation procedure. Additionally, host a discussion forum to foster collaboration among participants.
This challenge not only tests participants’ ability to handle multi-label classification and text processing but also encourages them to explore advanced NLP techniques and model evaluation strategies. The combination of movie overviews and genre mapping offers a rich and interesting dataset for an engaging Kaggle competition.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Glove word vectors, trained from twitter data, taken from https://nlp.stanford.edu/projects/glove/ and translated to gensim kv format for efficient loading in gensim.
Unclear what time period and tweets are covered. The vocabulary is in several languages.
https://nlp.stanford.edu/projects/glove/
I'm trying to solve Twitter based NLP tasks with these.
Overview Off-the-shelf parallel corpus data (Translation Data) covers many fields including spoken language, traveling, medical treatment,news, and finance. Data cleaning, desensitization, and quality inspection have been carried out.
Specifications Storage format : TXT Data content : Parallel Corpus Data Data size : 200 million pairs Language : 20 languages Application scenario : machine translation Accuracy rate : 90%
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Translation Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/nlu?source=Datarade
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
VideoRefer-700K
VideoRefer-700K is a large-scale, high-quality object-level video instruction dataset. Curated using a sophisticated multi-agent data engine to fill the gap for high-quality object-level video instruction data.
VideoRefer consists of three types of data:
Object-level Detailed Caption Object-level Short Caption Object-level QA
Video sources:
Detailed&Short Caption Panda-70M.
QA MeViSA2D Youtube-VOS
Data format: [ { "video": "videos/xxx.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A curated dataset of 241,000+ English-language comments labeled for sentiment (negative, neutral, positive). Ideal for training and evaluating NLP models in sentiment analysis.
1. text: Contains individual English-language comments or posts sourced from various online platforms.
2. label: Represents the sentiment classification assigned to each comment. It uses the following encoding:
0 — Negative sentiment 1 — Neutral sentiment 2 — Positive sentiment
This dataset is ideal for a variety of applications:
1. Sentiment Analysis Model Training: Train machine learning or deep learning models to classify text as positive, negative, or neutral.
2. Text Classification Projects: Use as a labeled dataset for supervised learning in text classification tasks.
3. Customer Feedback Analysis: Train models to automatically interpret user reviews, support tickets, or survey responses.
Geographic Coverage: Primarily English-language content from global online platforms
Time Range: The exact time range of data collection is unspecified; however, the dataset reflects contemporary online language patterns and sentiment trends typically observed in the 2010s to early 2020s.
Demographics: Specific demographic information (e.g., age, gender, location, industry) is not included in the dataset, as the focus is purely on textual sentiment rather than user profiling.
CC0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an Albanian named entities annotation corpus generated automatically (silver-standard) from Wikipedia and WikiData. It is offered in Apache OpenNLP annotation format.
Details of the generation approach may be found in the respective published paper: https://doi.org/10.2478/cait-2018-0009
Attached are also the files that were used for generating the Albanian named entities gazetteer and the gazetteer itself in JSON format.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Introduce
We provided, designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks. The dataset is in question and answer format, using structured json format for understanding tasks and unstructured text format for generation tasks. We also provide some multiple-choice questions to test the cognitive ability of the model in different vertical fields.… See the full description on the dataset page: https://huggingface.co/datasets/Multilingual-Multimodal-NLP/SEVENLLM-Dataset.
The Bangla News Classification dataset is a large collection of text articles in the Bengali (Bangla) language from the Jamuna TV website. This dataset, which contains over 11,000 rows, is designed for machine learning and natural language processing (NLP) tasks.
Key Features:
Text Articles: A wide variety of news articles, including those on events, updates, and diverse topics.
Categories: Articles are organized into five main categories:
Sports All-Bangladesh International Entertainment National Metadata: Each article includes:
Title: The headline. Published Date: The date and time of publication. Reporter: The name of the reporter (if available). Category: The category of the article. URL: The link to the full article. Content: A brief summary or excerpt. Language: The dataset is entirely in Bengali, focusing on NLP tasks specific to this language.
Applications:
Text Classification: Training models to automatically categorize articles. Sentiment Analysis: Assessing the sentiment expressed in articles. Information Retrieval: Developing systems to find relevant articles based on queries. Language Modeling: Creating language models and tools for Bengali. Usage:
Research: A useful resource for NLP research related to the Bengali language. Education: Employed in educational settings for teaching machine learning and NLP. Application Development: Assists in developing applications for processing Bengali text, such as news aggregators and recommendation systems. Availability:
Access: Usually available through academic institutions, research repositories, or directly from publishers. Format: Provided in csv format for easy integration with NLP tools. Conclusion:
The Bangla News Classification dataset from Jamuna TV is a valuable resource for advancing research and applications in NLP for the Bengali language. It helps improve text classification, sentiment analysis, and the understanding of linguistic nuances in Bangladeshi media.
File format: csv
CC By 4.0
Original Data Source: Over 11,500 Bangla News for NLP
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.
This large dataset is ideal for:
Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.
The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.
Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3a
Task 3b
Output data format
Task 3a
Sample File
public_id, predicted_rating
1, false
2, true
Task 3b
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: https://competitions.codalab.org/competitions/31238
Related Work
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a comprehensive collection of tweets designed for multilingual sentiment analysis tasks. It consists of three primary columns: "tweet","language" and "sentiment". The dataset's goal is to facilitate research and development in natural language processing and multi-task learning applications.
The "tweet" column contains the actual tweet scraped. They encompass a wide range of topics, opinions and sentiments expressed by users across the social media platforms. THe tweets are provided in their original text format.
The "language" column specifies the language in which each tweet is written. The dataset is carefully curated to include tweets from multiple languages.
The "sentiment" column contains sentiment ratings for each tweet in the range of 1 to 5 stars. 1 star represent strongly negative sentiment whereas 5 stars represent strongly positive sentiment. Intermediate values like 2,3 and 4 stars represent negative, neutral and positive sentiment.
Researchers and developers can leverage this dataset for a wide range of NLP and MTL tasks where they can jointly try to predict language and sentiment given a tweet.
The dataset is well structured and formatted in CSV file format.
Recording Environment : In-car;1 quiet scene, 1 low noise scene, 3 medium noise scenes and 2 high noise scenes
Recording Content : It covers 5 fields: navigation field, multimedia field, telephone field, car control field and question and answer field; 500 sentences per people
Speaker : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Device : High fidelity microphone; Binocular camera
Language : 20 languages
Transcription content : text
Accuracy rate : 98%
Application scenarios : speech recognition, Human-computer interaction; Natural language processing and text analysis; Visual content understanding, etc.