100+ datasets found

British English Language Datasets | 150+ Years of Research | Natural...
datarade.ai
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). British English Language Datasets | 150+ Years of Research | Natural Language Processing (NLP) Data | LLMs | TTS | Dictionary Display | EU Coverage [Dataset]. https://datarade.ai/data-products/british-english-language-datasets-150-years-of-research-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
United Kingdom
Description
Our British English language datasets are meticulously curated and annotated by experienced linguistics and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:

British English Monolingual Dictionary Data

British English Synonyms and Antonyms Data

British English Pronunciations with Audio

Key Features (approximate numbers):

British English Monolingual Dictionary Data

Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.

Headwords: 146,000

Senses: 230,000

Sentence examples: 149,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: twice a year

British English Synonyms and Antonyms Data

This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.

Synonyms: 600,000

Antonyms: 22,000

Usage Examples: 39,000

Format: XML and JSON format

Delivery: Email (link-based file sharing)

Updated frequency: annually

British English Pronunciations with audio (word-level)

This dataset provides IPA transcriptions and mapped audio files for words in contemporary British English, with a focus on UK speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

Transcriptions (IPA): 250,000

Audio files: 180,000

Format: XLSX (for transcriptions), MP3 and WAV (audio files)

Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
Dataset for Generation of multiple true false questions
zenodo.org
zip
Updated Nov 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij (2022). Dataset for Generation of multiple true false questions [Dataset]. http://doi.org/10.5281/zenodo.7303300
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7303300
Dataset updated
Nov 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generation of multiple true-false questions

This project provides a Natural Language Pipeline for processing German Textbook sections as an input generating Multiple True-False Questions using GPT2.

Assessments are an important part of the learning cycle and enable the development and promotion of competencies. However, the manual creation of assessments is very time-consuming. Therefore, the number of tasks in learning systems is often limited. In this repository, we provide an algorithm that can automatically generate an arbitrary number of German True False statements from a textbook using the GPT-2 model. The algorithm was evaluated with a selection of textbook chapters from four academic disciplines (see `data` folder) and rated by individual domain experts. One-third of the generated MTF Questions are suitable for learning. The algorithm provides instructors with an easier way to create assessments on chapters of textbooks to test factual knowledge.

As a type of Multiple-Choice question, Multiple True False (MTF) Questions are, among other question types, a simple and efficient way to objectively test factual knowledge. The learner is challenged to distinguish between true and false statements. MTF questions can be presented differently, e.g. by locating a true statement from a series of false statements, identifying false statements among a list of true statements, or separately evaluating each statement as either true or false. Learners must evaluate each statement individually because a question stem can contain both incorrect and correct statements. Thus, MTF Questions as a machine-gradable format have the potential to identify learners’ misconceptions and knowledge gaps.

Example MTF question:

Check the correct statements:

[ ] All trees have green leafs.

[ ] Trees grow towards the sky.

[ ] Leafes can fall from a tree.

Features

- generation of false statements

- automatic selection of true statements

- selection of an arbitrary similarity for true and false statements as well as the number of false statements

- generating false statements by adding or deleting negations as well as using a german gpt2

Setup

Installation

1. Create a new environment: `conda create -n mtfenv python=3.9`

2. Activate the environment: `conda activate mtfenv`

3. Install dependencies using anaconda:

```

conda install -y -c conda-forge pdfplumber

conda install -y -c conda-forge nltk

conda install -y -c conda-forge pypdf2

conda install -y -c conda-forge pylatexenc

conda install -y -c conda-forge packaging

conda install -y -c conda-forge transformers

conda install -y -c conda-forge essential_generators

conda install -y -c conda-forge xlsxwriter

```

3. Download spacy: `python3.9 -m spacy download de_core_news_lg`

Getting started

After installation, you can execute the bash script `bash run.sh` in the terminal to compile MTF questions for the provided textbook chapters.

To create MTF questions for your own texts use the following command:

`python3 main.py --answers 1 --similarity 0.66 --input ./

The parameter `answers` indicates how many false answers should be generated.

By configuring the parameter `similarity` you can determine what portion of a sentence should remain the same. The remaining portion will be extracted and used to generate a false part of the sentence.

## History and roadmap

* Outlook third iteration: Automatic augmentation of text chapters with generated questions

* Second iteration: Generation of multiple true-false questions with improved text summarizer and German GPT2 sentence generator

* First iteration: Generation of multiple true false questions in the Bachelor thesis of Mirjam Wiemeler

Publications, citations, license

Publications

Kasakowskij, R., Kasakowskij, T. & Seidel, N., (2022). Generation of Multiple True False Questions. In: Henning, P. A., Striewe, M. & Wölfel, M. (Hrsg.), 20. Fachtagung Bildungstechnologien (DELFI). Bonn: Gesellschaft für Informatik e.V.. (S. 147-152). DOI: [10.18420/delfi2022-026](https://dl.gi.de/handle/20.500.12116/38826)

Citation of the Dataset

Kasakowskij, R., Kasakowskij, T., & Seidel, N. (2022). Dataset for Generation of multiple true false questions. Zenodo. https://doi.org/10.5281/zenodo.7303300

The source code and data are maintained at GitHub: https://github.com/D2L2/multiple-true-false-question-generation

Contact

Regina Kasakowskij (M.A.) - regina.kasakowskij@fernuni-hagen.de

Dr. Niels Seidel - niels.seidel@fernuni-hagen.de

License Distributed under the MIT License. See [LICENSE.txt](https://gitlab.pi6.fernuni-hagen.de/la-diva/adaptive-assessment/generationofmultipletruefalsequestions/-/blob/master/LICENSE.txt) for more information.

Acknowledgments This research was supported by CATALPA - Center of Advanced Technology for Assisted Learning and Predictive Analytics of the FernUniversität in Hagen, Germany.

This project was carried out as part of research in the CATALPA project [LA DIVA](https://www.fernuni-hagen.de/forschung/schwerpunkte/catalpa/forschung/projekte/la-diva.shtml)
h
instructions-dataset-adapted-from-stanford-alpaca-for-gpt-j
huggingface.co
Updated Mar 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLP Cloud (2023). instructions-dataset-adapted-from-stanford-alpaca-for-gpt-j [Dataset]. https://huggingface.co/datasets/nlpcloud/instructions-dataset-adapted-from-stanford-alpaca-for-gpt-j
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2023
Dataset authored and provided by
NLP Cloud
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
This dataset is an adaptation of the Stanford Alpaca dataset in order to turn a text generation model like GPT-J into an "instruct" model. The initial dataset was slightly reworked in order to match the GPT-J fine-tuning format with Mesh Transformer Jax on TPUs.
s
Smoking NLP Challenge Data
scicrunch.org
neuinfo.org
+2more
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Smoking NLP Challenge Data [Dataset]. http://identifiers.org/RRID:SCR_008644?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008644 https://identifiers.org/RRID:SCR_008644?q=&i=rrid
Dataset updated
Mar 7, 2024
Description
The data for the smoking challenge consisted exclusively of discharge summaries from Partners HealthCare which were preprocessed and converted into XML format, and separated into training and test sets. I2B2 is a data warehouse containing clinical data on over 150k patients, including outpatient DX, lab results, medications, and inpatient procedures. ETL processes authored to pull data from EMR and finance systems Institutional review boards of Partners HealthCare approved the challenge and the data preparation process. The data were annotated by pulmonologists and classified patients into Past Smokers, Current Smokers, Smokers, Non-smokers, and unknown. Second-hand smokers were considered non-smokers. Other institutions involved include Massachusetts Institute of Technology, and the State University of New York at Albany. i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement. In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1. A similar set of notes from the Second i2b2 Challenge will be released on the one year anniversary of that Challenge (November, 2010).
LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...
datarade.ai
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data | TTS | Dictionary Display | Translation Data | LATAM Coverage [Dataset]. https://datarade.ai/data-products/latam-data-suite-1-8m-sentences-nlp-tts-dictionary-d-oxford-languages
Explore at:
.csv, .json, .mp3, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 22, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Mexico, Uruguay, Peru, Panama, Ecuador, Colombia, Spain, Dominican Republic, Puerto Rico, Bolivia (Plurinational State of)
Description
Discover our expertly curated language datasets in the LATAM Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

Sentences Curated examples of real-world usage with contextual annotations.

Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

Audio Data Native speaker recordings for TTS and pronunciation modeling.

Word Lists Frequency-ranked and thematically grouped lists.

Learn more about the datasets included in the data suite:

Portuguese Monolingual Dictionary Data

Portuguese Bilingual Dictionary Data

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Spanish Synonyms and Antonyms Data

Spanish Audio Data

Spanish Word List Data

American English Monolingual Dictionary Data

American English Synonyms and Antonyms Data

American English Pronunciations with Audio

Key Features (approximate numbers):

Portuguese Monolingual Dictionary Data

Our Portuguese monolingual covers both European and Latin American varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.

Headwords: 143,600

Senses: 285,500

Sentence examples: 69,300

Format: XML format

Delivery: Email (link-based file sharing)

Portuguese Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both European and Latin American Portuguese varieties.

Translations: 300,000

Senses: 158,000

Example sentences: 117,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

American English Monolingual Dictionary Data

Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure rel...
f
Dataset for Fine-Tuning Code Generation Models: Kannada-English Algorithmic...
figshare.com
json
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Goutami Sooda; Arya Hariharan; Ahana Patil; Arya Vinod; Chandana S; Manvith L B; Nihar Mandahas; Siri H; Shobha G; Deepamala N; Jyoti Shetty (2025). Dataset for Fine-Tuning Code Generation Models: Kannada-English Algorithmic Statements and Python Code [Dataset]. http://doi.org/10.6084/m9.figshare.28401488.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28401488.v1
Dataset updated
Jul 6, 2025
Dataset provided by
figshare
Authors
Goutami Sooda; Arya Hariharan; Ahana Patil; Arya Vinod; Chandana S; Manvith L B; Nihar Mandahas; Siri H; Shobha G; Deepamala N; Jyoti Shetty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was designed to fine-tune code-generation models for converting algorithmic statements written in Kannada or English into Python code. It consists of 1,250 triplets of Kannada algorithmic statements, their English translations, and corresponding Python code. The data is stored in JSON format under the following labels:"kannada text" – Kannada algorithmic statement"text" – English algorithmic statement"code" – Expected Python outputDataset Highlights: -Triplet Structure: Kannada algorithmic statements, their English translations, and Python code.Diverse Algorithmic Concepts: Covers variable assignments, print statements, conditionals, loops, and comparison operations.Translation Methodology: English statements were generated from Kannada using the Google Translate API (via Python’s deep_translator module).This dataset supports Kannada-to-Python and English-to-Python code generation tasks, enabling step-by-step code generation from algorithmic statements.
Z
Datasets of "An Automatically Generated Annotated Corpus for Albanian Named...
data.niaid.nih.gov
zenodo.org
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baxhaku, Artur (2022). Datasets of "An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7339198
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
Hoxha, Klesti
Baxhaku, Artur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Albania
Description
This is an Albanian named entities annotation corpus generated automatically (silver-standard) from Wikipedia and WikiData. It is offered in Apache OpenNLP annotation format.

Details of the generation approach may be found in the respective published paper: https://doi.org/10.2478/cait-2018-0009

Attached are also the files that were used for generating the Albanian named entities gazetteer and the gazetteer itself in JSON format.

Social Media Posts in Arabic Dialect

kaggle.com

Updated Jul 11, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

UM6P Open Data (2024). Social Media Posts in Arabic Dialect [Dataset]. https://www.kaggle.com/datasets/um6popendata/sentiment-analysis-for-sm-posts-in-arabic-dialect

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 11, 2024

Dataset provided by

Kaggle

Authors

UM6P Open Data

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset: Sentiment Analysis for Social Media Posts in Arabic Dialect

Overview

This dataset contains a labeled collection of approximately 50,000 social media posts in various Arabic dialects. Each post has been manually annotated with sentiment labels, providing a rich resource for natural language processing and sentiment analysis research.

Dataset Owner

UM6P College of Computing

Content

Posts: The dataset includes raw text data of social media posts written in different Arabic dialects.
Sentiment Labels: Each post is labeled with one of the following sentiment categories:
- Positive
- Negative
- Neutral

Features

Post ID: A unique identifier for each social media post.
Text: The content of the social media post in Arabic.
Sentiment: The sentiment label assigned to the post (Positive, Negative, Neutral).

Format

The dataset is provided in a CSV format with the following columns: - Post_ID: Integer - Text: String - Sentiment: String (Positive, Negative, Neutral)

Usage

This dataset is ideal for tasks such as: - Training sentiment analysis models - Studying sentiment trends in Arabic social media - Exploring the linguistic characteristics of Arabic dialects - Benchmarking sentiment analysis tools

Example Data

Post_ID	Text	Sentiment
1	"هذا المنتج رائع جدًا وأحببته كثيرًا"	Positive
2	"لم يعجبني هذا الفيلم، كان مملًا جدًا"	Negative
3	"الطقس اليوم عادي، لا يوجد شيء مميز"	Neutral

Licensing

Please refer to the dataset license included in the dataset files for information on usage rights and restrictions.

Citation

An open access NLP dataset for Arabic dialects: data collection, labeling, and model construction, Elmehdi Boujou, Hamza Chataoui, Abdellah El Mekki, Saad Benjelloun, Ikram Chairi and Ismail Berrada MENACIS 2020 conference, In press.

h
SWE-bench_Lite_bm25_27K
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Princeton NLP group, SWE-bench_Lite_bm25_27K [Dataset]. https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite_bm25_27K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Princeton NLP group
Description
Dataset Summary

SWE-bench Lite is subset of SWE-bench, a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 300 test Issue-Pull Request pairs from 11 popular Python. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues? This dataset SWE-bench_Lite_bm25_27K includes a formatting of each instance… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite_bm25_27K.
h
SEVENLLM-Dataset
huggingface.co
Updated Dec 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multilingual-Multimodal-NLP (2024). SEVENLLM-Dataset [Dataset]. https://huggingface.co/datasets/Multilingual-Multimodal-NLP/SEVENLLM-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2024
Dataset authored and provided by
Multilingual-Multimodal-NLP
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Introduce

We provided, designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks. The dataset is in question and answer format, using structured json format for understanding tasks and unstructured text format for generation tasks. We also provide some multiple-choice questions to test the cognitive ability of the model in different vertical fields.… See the full description on the dataset page: https://huggingface.co/datasets/Multilingual-Multimodal-NLP/SEVENLLM-Dataset.
f
Europe PMC Full Text Corpus
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre (2023). Europe PMC Full Text Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.22848380.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22848380.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.

Corpus Directory Structure

annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.

hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.
GROUP0/: contains raw manual annotations made by curator GROUP0. GROUP1/: contains raw manual annotations made by curator GROUP1. GROUP2/: contains raw manual annotations made by curator GROUP2.

IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.
dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.

JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format. README.md: a detailed description of all the annotation formats.

articles/: contains the full-text articles annotated in Europe PMC corpus.

Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles.

docs/: contains related documents that were used for generating the corpus.

Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation. demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. Training set development.pdf: initial document that details the paper selection procedures.

pilot/: contains annotations and articles that were used in a pilot study.

annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. articles/: contains the full-text articles annotated in the pilot study.

Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API.

README.md: a detailed description of the sentencising and fetching of XML articles.

src/: source codes for cleaning annotations and generating IOB files

metrics/ner_metrics.py: Python script contains SemEval evaluation metrics. annotations.py: Python script used to extract annotations from raw Hypothes.is annotations. generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format. generate_json_dataset.py: Python script used to extract annotations to JSON format. hypothesis.py: Python script used to fetch raw Hypothes.is annotations.

License

CCBY

Feedback

For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.
f
Data_Sheet_1_Auto-CORPus: A Natural Language Processing Tool for...
datasetcatalog.nlm.nih.gov
Updated Feb 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McQuibban, Nicholas A. R.; Posma, Joram M.; Hu, Yan; Shorter, Tom; Li, Zhuoyu; Sun, Shujian; Popovici, Casiana M.; Yeung, Cheng S.; Beck, Tim; Makraduli, Filip; Rowlands, Thomas (2022). Data_Sheet_1_Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000308109
Explore at:
Dataset updated
Feb 15, 2022
Authors
McQuibban, Nicholas A. R.; Posma, Joram M.; Hu, Yan; Shorter, Tom; Li, Zhuoyu; Sun, Shujian; Popovici, Casiana M.; Yeung, Cheng S.; Beck, Tim; Makraduli, Filip; Rowlands, Thomas
Description
To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.
Z
MESINESP: Medical Semantic Indexing in Spanish - Train dataset
data.niaid.nih.gov
live.european-language-grid.eu
Updated Nov 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miranda-Escalada, Antonio (2022). MESINESP: Medical Semantic Indexing in Spanish - Train dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3826491
Explore at:
Dataset updated
Nov 5, 2022
Dataset provided by
Rana, Ankush
Krallinger, Martin
Gonzalez-Agirre, Aitor
Miranda-Escalada, Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please use the MESINESP2 corpus (the second edition of the shared-task) since it has a higher level of curation, quality and is organized by document type (scientific articles, patents and clinical trials).

INTRODUCTION:

The Mesinesp (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) training set has a total of 369,368 records.

The training dataset contains all records from LILACS and IBECS databases at the Virtual Health Library (VHL) with a non-empty abstract written in Spanish. The URL used to retrieve records is as follows: http://pesquisa.bvsalud.org/portal/?output=xml&lang=es&sort=YEAR_DESC&format=abstract&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw&

We have filtered out empty abstracts and non-Spanish abstracts.

The training dataset was crawled on 10/22/2019. This means that the data is a snapshot of that moment and that may change over time. In fact, it is very likely that the data will undergo minor changes as the different databases that make up LILACS and IBECS may add or modify the indexes.

ZIP STRUCTURE:

The training data sets contain 369,368 records from 26,609 different journals. Two different data sets are distributed as described below:

Original Train set with 369,368 records that also include the qualifiers, as retrieved from VHL.

Pre-processed Train set with the 318,658 records with at least one DeCS code and with no qualifiers.

STATISTICS:

Abstracts’ length (measured in characters) Min: 12 Avg: 1140.41 Median: 1094 Max: 9428

Number of DeCS codes per file Min: 1 Avg: 8.12 Median: 7 Max: 53

CORPUS FORMAT:

The training data sets are distributed as a JSON file with the following format:

{ "articles": [ { "id": "Id of the article", "title": "Title of the article", "abstractText": "Content of the abstract", "journal": "Name of the journal", "year": 2018, "db": "Name of the database", "decsCodes": [ "code1", "code2", "code3" ] } ] }

Note that the decsCodes field lists the DeCs Ids assigned to a record in the source data. Since the original XML data contain descriptors (no codes), we provide a DeCs conversion table (https://temu.bsc.es/mesinesp/wp-content/uploads/2019/12/DeCS.2019.v5.tsv.zip) with:

DeCs codes

Preferred descriptor (the label used in the European DeCs 2019 set)

List of synonyms (the descriptors and synonyms from both European and Latin Spanish DeCs 2019 data sets, separated by pipes)

For more details on the Latin and European Spanish DeCs codes see: http://decs.bvs.br and http://decses.bvsalud.org/ respectively.

Please, cite: Krallinger M, Krithara A, Nentidis A, Paliouras G, Villegas M. BioASQ at CLEF2020: Large-Scale Biomedical Semantic Indexing and Question Answering. InEuropean Conference on Information Retrieval 2020 Apr 14 (pp. 550-556). Springer, Cham.

Copyright (c) 2020 Secretaría de Estado de Digitalización e Inteligencia Artificial
h
SWE-bench
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Princeton NLP group, SWE-bench [Dataset]. https://huggingface.co/datasets/princeton-nlp/SWE-bench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Princeton NLP group
Description
Dataset Summary

SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Want to run inference now?

This dataset only contains the problem_statement… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench.
Unlocking User Sentiment: The App Store Reviews Dataset
crawlfeeds.com
json, zip
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Unlocking User Sentiment: The App Store Reviews Dataset [Dataset]. https://crawlfeeds.com/datasets/app-store-reviews-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
This dataset offers a focused and invaluable window into user perceptions and experiences with applications listed on the Apple App Store. It is a vital resource for app developers, product managers, market analysts, and anyone seeking to understand the direct voice of the customer in the dynamic mobile app ecosystem.

Dataset Specifications:

Investment: $45.0

Status: Published and immediately available.

Category: Ratings and Reviews Data

Format: Compressed ZIP archive containing JSON files, ensuring easy integration into your analytical tools and platforms.

Volume: Comprises 10,000 unique app reviews, providing a robust sample for qualitative and quantitative analysis of user feedback.

Timeliness: Last crawled: (This field is blank in your provided info, which means its recency is currently unknown. If this were a real product, specifying this would be critical for its value proposition.)

Richness of Detail (11 Comprehensive Fields):

Each record in this dataset provides a detailed breakdown of a single App Store review, enabling multi-dimensional analysis:

Review Content:

review: The full text of the user's written feedback, crucial for Natural Language Processing (NLP) to extract themes, sentiment, and common keywords.

title: The title given to the review by the user, often summarizing their main point.

isEdited: A boolean flag indicating whether the review has been edited by the user since its initial submission. This can be important for tracking evolving sentiment or understanding user behavior.

Reviewer & Rating Information:

username: The public username of the reviewer, allowing for analysis of engagement patterns from specific users (though not personally identifiable).

rating: The star rating (typically 1-5) given by the user, providing a quantifiable measure of satisfaction.

App & Origin Context:

app_name: The name of the application being reviewed.

app_id: A unique identifier for the application within the App Store, enabling direct linking to app details or other datasets.

country: The country of the App Store storefront where the review was left, allowing for geographic segmentation of feedback.

Metadata & Timestamps:

_id: A unique identifier for the specific review record in the dataset.

crawled_at: The timestamp indicating when this particular review record was collected by the data provider (Crawl Feeds).

date: The original date the review was posted by the user on the App Store.

Expanded Use Cases & Analytical Applications:

This dataset is a goldmine for understanding what users truly think and feel about mobile applications. Here's how it can be leveraged:

Product Development & Improvement:

Bug Detection & Prioritization: Analyze negative review text to identify recurring technical issues, crashes, or bugs, allowing developers to prioritize fixes based on user impact.

Feature Requests & Roadmap Prioritization: Extract feature suggestions from positive and neutral review text to inform future product roadmap decisions and develop features users actively desire.

User Experience (UX) Enhancement: Understand pain points related to app design, navigation, and overall usability by analyzing common complaints in the review field.

Version Impact Analysis: If integrated with app version data, track changes in rating and sentiment after new app updates to assess the effectiveness of bug fixes or new features.

Market Research & Competitive Intelligence:

Competitor Benchmarking: Analyze reviews of competitor apps (if included or combined with similar datasets) to identify their strengths, weaknesses, and user expectations within a specific app category.

Market Gap Identification: Discover unmet user needs or features that users desire but are not adequately provided by existing apps.

Niche Opportunities: Identify specific use cases or user segments that are underserved based on recurring feedback.

Marketing & App Store Optimization (ASO):

Sentiment Analysis: Perform sentiment analysis on the review and title fields to gauge overall user satisfaction, pinpoint specific positive and negative aspects, and track sentiment shifts over time.

Keyword Optimization: Identify frequently used keywords and phrases in reviews to optimize app store listings, improving discoverability and search ranking.

Messaging Refinement: Understand how users describe and use the app in their own words, which can inform marketing copy and advertising campaigns.

Reputation Management: Monitor rating trends and identify critical reviews quickly to facilitate timely responses and proactive customer engagement.

Academic & Data Science Research:

Natural Language Processing (NLP): The review and title fields are excellent for training and testing NLP models for sentiment analysis, topic modeling, named entity recognition, and text summarization.

User Behavior Analysis: Study patterns in rating distribution, isEdited status, and date to understand user engagement and feedback cycles.

Cross-Country Comparisons: Analyze country-specific reviews to understand regional differences in app perception, feature preferences, or cultural nuances in feedback.

This App Store Reviews dataset provides a direct, unfiltered conduit to understanding user needs and ultimately driving better app performance and greater user satisfaction. Its structured format and granular detail make it an indispensable asset for data-driven decision-making in the mobile app industry.
Z
Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...
data.niaid.nih.gov
zenodo.org
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berg, Johanna (2024). OpenChart-SE: A corpus of artificial Swedish electronic health records for imagined emergency care patients written by physicians in a crowd-sourcing project [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7499830
Explore at:
Dataset updated
Jul 15, 2024
Dataset provided by
Berg, Johanna
Appelgren Thorell, Björn
Aasa, Carl Ollvik
Aits, Sonja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Sweden
Description
Electronic health records (EHRs) are a rich source of information for medical research and public health monitoring. Information systems based on EHR data could also assist in patient care and hospital management. However, much of the data in EHRs is in the form of unstructured text, which is difficult to process for analysis. Natural language processing (NLP), a form of artificial intelligence, has the potential to enable automatic extraction of information from EHRs and several NLP tools adapted to the style of clinical writing have been developed for English and other major languages. In contrast, the development of NLP tools for less widely spoken languages such as Swedish has lagged behind. A major bottleneck in the development of NLP tools is the restricted access to EHRs due to legitimate patient privacy concerns. To overcome this issue we have generated a citizen science platform for collecting artificial Swedish EHRs with the help of Swedish physicians and medical students. These artificial EHRs describe imagined but plausible emergency care patients in a style that closely resembles EHRs used in emergency departments in Sweden. In the pilot phase, we collected a first batch of 50 artificial EHRs, which has passed review by an experienced Swedish emergency care physician. We make this dataset publicly available as OpenChart-SE corpus (version 1) under an open-source license for the NLP research community. The project is now open for general participation and Swedish physicians and medical students are invited to submit EHRs on the project website (https://github.com/Aitslab/openchart-se), where additional batches of quality-controlled EHRs will be released periodically.

Dataset content

OpenChart-SE, version 1 corpus (txt files and and dataset.csv)

The OpenChart-SE corpus, version 1, contains 50 artificial EHRs (note that the numbering starts with 5 as 1-4 were test cases that were not suitable for publication). The EHRs are available in two formats, structured as a .csv file and as separate textfiles for annotation. Note that flaws in the data were not cleaned up so that it simulates what could be encountered when working with data from different EHR systems. All charts have been checked for medical validity by a resident in Emergency Medicine at a Swedish hospital before publication.

Codebook.xlsx

The codebook contain information about each variable used. It is in XLSForm-format, which can be re-used in several different applications for data collection.

suppl_data_1_openchart-se_form.pdf

OpenChart-SE mock emergency care EHR form.

suppl_data_3_openchart-se_dataexploration.ipynb

This jupyter notebook contains the code and results from the analysis of the OpenChart-SE corpus.

More details about the project and information on the upcoming preprint accompanying the dataset can be found on the project website (https://github.com/Aitslab/openchart-se).
VideoRefer-700K
huggingface.co
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Lab at Alibaba DAMO Academy (2025). VideoRefer-700K [Dataset]. https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K
Explore at:
Dataset updated
Jun 26, 2025
Dataset provided by
Damo Academy
Authors
Language Technology Lab at Alibaba DAMO Academy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
VideoRefer-700K

VideoRefer-700K is a large-scale, high-quality object-level video instruction dataset. Curated using a sophisticated multi-agent data engine to fill the gap for high-quality object-level video instruction data.

VideoRefer consists of three types of data:

Object-level Detailed Caption Object-level Short Caption Object-level QA

Video sources:

Detailed&Short Caption Panda-70M.

QA MeViSA2D Youtube-VOS

Data format: [ { "video": "videos/xxx.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K.
Portuguese Language Datasets | 300K Translations | Natural Language...
datarade.ai
.json, .xml
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Portuguese Language Datasets | 300K Translations | Natural Language Processing (NLP) Data | Dictionary Display | Translation | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/portuguese-language-datasets-140k-words-300k-translations-oxford-languages
Explore at:
.json, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Timor-Leste, Cabo Verde, Macao, Guinea-Bissau, Portugal, Sao Tome and Principe, Angola, Mozambique, Brazil
Description
Our Portuguese language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets in Portuguese are available for license:

Portuguese Monolingual Dictionary Data

Portuguese Bilingual Dictionary Data

Key Features (approximate numbers):

Portuguese Monolingual Dictionary Data

Our Portuguese monolingual covers both EU and LATAM varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.

Headwords:143,600

Senses: 285,500

Example sentences: 69,300

Format: XML format

Delivery: Email (link-based file sharing)

Portuguese Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both EU and LATAM Portuguese varieties.

Translations: 300,000

Senses: 158,000

Example sentences: 117,800

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
Z
CT-FAN: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Wiegand (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
Explore at:
Dataset updated
Oct 23, 2022
Dataset provided by
Thomas Mandl
Gautam Kishore Shahi
Juliane Köhler
Julia Maria Struß
Michael Wiegand
Melanie Siegel
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
Labelled data for fine tuning a geological Named Entity Recognition and...
metadata.bgs.ac.uk
ckan.publishing.service.gov.uk
+1more
html
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
British Geological Survey (2024). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://metadata.bgs.ac.uk/geonetwork/srv/api/records/15ac4ca9-3be0-119e-e063-0937940a8990
Explore at:
htmlAvailable download formats
Dataset updated
Feb 15, 2024
Dataset authored and provided by
British Geological Surveyhttps://www.bgs.ac.uk/
License
http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations
Time period covered
Nov 1, 2023 - Feb 15, 2024
Description
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

Facebook

Twitter

Click to copy link

Link copied

Cite

British English Language Datasets | 150+ Years of Research | Natural Language Processing (NLP) Data | LLMs | TTS | Dictionary Display | EU Coverage

Explore at:

.json, .xml, .csv, .xls, .mp3, .wavAvailable download formats

Dataset updated

Jul 30, 2025

Dataset authored and provided by

Oxford Languageshttps://lexico.com/es

Area covered

United Kingdom

Description

Our British English language datasets are meticulously curated and annotated by experienced linguistics and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:

British English Monolingual Dictionary Data
British English Synonyms and Antonyms Data
British English Pronunciations with Audio

Key Features (approximate numbers):

British English Monolingual Dictionary Data

Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.

Headwords: 146,000
Senses: 230,000
Sentence examples: 149,000
Format: XML and JSON format
Delivery: Email (link-based file sharing) and REST API
Updated frequency: twice a year

British English Synonyms and Antonyms Data

This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.

Synonyms: 600,000
Antonyms: 22,000
Usage Examples: 39,000
Format: XML and JSON format
Delivery: Email (link-based file sharing)
Updated frequency: annually

British English Pronunciations with audio (word-level)

This dataset provides IPA transcriptions and mapped audio files for words in contemporary British English, with a focus on UK speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

Transcriptions (IPA): 250,000
Audio files: 180,000
Format: XLSX (for transcriptions), MP3 and WAV (audio files)
Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

Clear search

Close search

Google apps

Main menu

British English Language Datasets | 150+ Years of Research | Natural...

Dataset for Generation of multiple true false questions

instructions-dataset-adapted-from-stanford-alpaca-for-gpt-j

Smoking NLP Challenge Data

LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...

Dataset for Fine-Tuning Code Generation Models: Kannada-English Algorithmic...

Datasets of "An Automatically Generated Annotated Corpus for Albanian Named...

Social Media Posts in Arabic Dialect

Dataset: Sentiment Analysis for Social Media Posts in Arabic Dialect

Overview

Dataset Owner

Content

Features

Format

Usage

Example Data

Licensing

Citation

SWE-bench_Lite_bm25_27K

SEVENLLM-Dataset

Europe PMC Full Text Corpus

Data_Sheet_1_Auto-CORPus: A Natural Language Processing Tool for...

MESINESP: Medical Semantic Indexing in Spanish - Train dataset

SWE-bench

Unlocking User Sentiment: The App Store Reviews Dataset

Data from: OpenChart-SE: A corpus of artificial Swedish electronic health...

VideoRefer-700K

Portuguese Language Datasets | 300K Translations | Natural Language...

CT-FAN: A Multilingual dataset for Fake News Detection

Labelled data for fine tuning a geological Named Entity Recognition and...

British English Language Datasets | 150+ Years of Research | Natural Language Processing (NLP) Data | LLMs | TTS | Dictionary Display | EU Coverage