Facebook
TwitterDataset Card for poetry
Dataset Summary
It contains poems from subjects: Love, Nature and Mythology & Folklore that belong to two periods namely Renaissance and Modern
Supported Tasks and Leaderboards
[Needs More Information]
Languages
[Needs More Information]
Dataset Structure
Data Instances
[Needs More Information]
Data Fields
Has 5 columns:
Content Author Poem name Age Type
Data Splits
Only training… See the full description on the dataset page: https://huggingface.co/datasets/merve/poetry.
Facebook
TwitterThis dataset comprises a collection of 450 poems, curated to facilitate the analysis of emotional content in textual form. Each poem is labeled with one of six emotional classes: Anger, Disgust, Fear, Joy, Neutral, and Sadness. This classification enables the development and testing of models for sentiment analysis, emotional understanding, and literary studies. The dataset is designed to provide a diverse range of poetic expressions, making it a valuable resource for machine learning researchers and computational linguists interested in emotion detection and the nuances of poetic language.
Applications:
Classes:
Foto von Thought Catalog auf Unsplash
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Allison Parrish's Gutenberg Poetry Corpus This corpus was originally published under the CC0 license by Allison Parrish. Please visit Allison's fantastic accompanying GitHub repository for usage inspiration as well as more information on how the data was mined, how to create your own version of the corpus, and examples of projects using it.
This dataset contains 3,085,117 lines of poetry from hundreds of Project Gutenberg books. Each line has a corresponding gutenberg_id (1191 unique values) from project Gutenberg.
A row of data looks like this:
{'s': 'And retreated, baffled, beaten,', 'gutenberg_id': 19}
Facebook
TwitterCapturing emotion from reviews and tweets is a well studied task. reviews and tweets are not abundant with emotions, where poetry is a text which is abundant with emotions, so capturing emotions from poetry is an interesting task. In this regard we have collected poems from Poemhunter.com(we thank the website owners) and created a dataset and manually annotated the poems with 5 emotions namely Fear, Sad, Surprise, Happy and Angry. This dataset comprise of 3 files 1. ABIEMO: American, British and Indian poets poems 2. CAPEMO: Augmented Poems to resolve class imbalance problem using NLPAUG library(we thank the library developers) 3. BAPEMO: Extended Augmented poems to resolve class imbalance problem
along with emotion country of poem is also assigned. We can use this dataset to perform poet style analysis, emotion analysis country wise differences in poetry etc.
Facebook
TwitterThis data I get from Here
The Arabic dataset is scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is 1,831,770 poetic verses. Each verse is labeled by its meter, the poet who wrote it, and the age which it was written in. There are 22 meters, 3701 poets and 11 ages: Pre-Islamic, Islamic, Umayyad, Mamluk, Abbasid, Ayyubid, Ottoman, Andalusian, era between Umayyad and Abbasid, Fatimid, and finally the modern age. We are only interested in the 16 classic meters which are attributed to Al-Farahidi, and they comprise the majority of the dataset with a total number around 1.7M verses. It is important to note that the verses diacritic states are not consistent. This means that a verse can carry full, semi diacritics, or it can carry nothing.
Facebook
Twitterhttps://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/
From: https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems Poetry Foundation Poems Dataset Overview This dataset contains a collection of 13.9k poems sourced from the Poetry Foundation website. Each poem entry includes its title, author, and associated tags (if available). The dataset provides a robust resource for exploring poetry, analyzing thematic trends, or creating applications such as poem generators. Dataset Structure The dataset consists of the following columns: 1. Title:… See the full description on the dataset page: https://huggingface.co/datasets/suayptalha/Poetry-Foundation-Poems.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Poetry Foundation Emotion-Annotated Dataset is a collection of poems scraped from the Poetry Foundation website. It comprises four main columns: Title, Poem, Poet, and Genre. This dataset has been enriched by incorporating emotion annotations derived from a fine-tuned BERT model trained to classify emotions in text.
Title: This column contains the titles of the poems included in the dataset. Poem: The Poem column stores the text of the poems scraped from the Poetry Foundation website. Poet: This column lists the poets who authored the poems. Genre: The Genre column represents the emotional classification assigned to each poem based on the text content.
The emotion annotation process employed a state-of-the-art BERT-based model specifically trained to recognize emotions in text. By leveraging this model, each poem was analyzed to identify the prevalent emotions conveyed within its text. These emotions were then mapped to corresponding emotional genres, providing insights into the overarching emotional themes of each poem.
The Poetry Foundation Emotion-Annotated Dataset offers a valuable resource for researchers, poets, literary enthusiasts, and AI practitioners interested in exploring the intersection of poetry and emotional expression. By associating emotional genres with individual poems, this dataset enables nuanced analyses of emotional themes and provides inspiration for further exploration in the realms of literature, psychology, and computational linguistics.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was made by scraping the Poetry Foundation website, for classification.
It contains five different topics : nature, art & sciences, love, relationships and religion, which are fairly well distributed.
Facebook
Twitterhttps://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Ozziey/poems_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis data I get from Here
The English dataset is scraped from many different web resources. It consists of 199,002 verses, each of them is labeled with one of these four meters: Iambic, Trochee, Dactyl and Anapaestic. The Iambic class dominates the dataset; they are 186,809 Iambic verses, 5418 Trochee verses, 5378 Anapaestic verses, 1397 Dactyl verses.
Facebook
TwitterAutomatic analysis of rhythmic poetry with applications to generation and translation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
dataPOEM.csv
The dataPOEM.csv data set contains data on the level of each poem.
scoresAes = factor scores of moving, beauty, and melodious ratings.
participant = participant number
poemVersion = Version of poem presented: (A = original poem with rhyme and meter, B = poem variant with only rhyme, C = poem variant with only meter, D = poem variant without rhyme and meter)
poemIdentity = poem number
avgWFreq = average word frequency of poem
totalGazeSlopeLineLength
totalGazeWordMeanNAByWordLen
totalGazeWordMeanNADiff
order = order of presentation (1 = from A to D, 2 = from D to A; between participant factor)
firstFixDurMS_MINFIX_AVG = first fixation duration
totalGazeMS_MINFIX_AVG = total gaze durations
fixDurMS_MINFIX_NUM = number of fixations
sacLenMS_MINFIX_AVG = average saccade length
percRegMS_MINFIX_AVG = percentage of regressive eye movements
pupilDial_AVG = average pupil dilation
blink_NUM_TotalRT = number of blinks relative to total reading time
totalReadingTime = total reading time of the poem
areaTT = total score of the Aesthetic Responsiveness Assessment questionnaire
dataIntegrity = percentage of valid position measurements by eye tracker during reading of a poem
moving = rating of how moving the poem was
beauty = rating of how beautiful the poem was
melodious = rating of how melodious the poem was
dataROI.csv
The dataROI.csv data set contains data on the level of each line within a poem.
order = order of presentation (1 = from A to D, 2 = from D to A; between participant factor)
participant = participant number
poemIdentity = poem number
lineNr = line number within poem
poemVersion = Version of poem presented: (A = original poem with rhyme and meter, B = poem variant with only rhyme, C = poem variant with only meter, D = poem variant without rhyme and meter)
verseEnd = wheter a particular word/line was the last line of a stanza (0 = word/line within a stanza, 1 = last word/line of a stanza)
BeginCloseRhyme = whether a particular line’s final word marked the opening or closing of a rhyme pair (1 = opening of rhyme, 2 = closing of rhyme)
lastFix = whether a particular line or word was the last one of the poem (0 = word/line within a poem, 1 = last word/line of poem)
totalGazeByWordNA = total gaze duration of final word of a line relative to word length
gazeByLineLengthNA = total gaze duration of a line relative to line length
dataIntegrity = percentage of valid position measurements by eye tracker during reading of a poem
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.
It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.
Example Image:
https://i.imgur.com/sZT516a.png" alt="Example Image">
A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.
Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.
Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is The poetry life : ten stories. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the book is The poetry of praise. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterThe Chinese Poetry dataset is a dataset of Chinese poems used for language modeling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is Poetry and the meaning of life : reading and writing poetry in language arts classrooms. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Understanding how the brain engages with poetic language is key to advancing empirical research on aesthetic and creative cognition. This experiment involved 64-channel EEG recordings and behavioural ratings from 51 participants who read and evaluated 210 short English-language texts — 70 Haiku (nature-themed), 70 Senryu (emotion-themed), and 70 non-poetic Control texts. Each poem/text was rated on five subjective dimensions: Aesthetic Appeal, Vivid Imagery, Being Moved, Originality, and Creativity — using a 7-point scale.
The full study involved 51 participants, and the data were divided into two BIDS-compliant datasets to ensure technical validation and facilitate upload to OpenNeuro.
Poetry Assessment EEG Dataset 1 (this dataset) contains data from 47 participants whose continuous EEG recordings passed technical validation and were used in the primary analyses. In this dataset, the participants.tsv file maps anonymized BIDS IDs (sub-001 to sub-047) to the original participant codes used during data collection (P101–P151)
Poetry Assessment EEG Dataset 2 includes the remaining 4 participants (P105, P141, P142, P146), whose EEG recordings were acquired in segments due to session interruptions and later concatenated during preprocessing. These participants were excluded from the PSD analysis to avoid potential artifacts but are included here for completeness and transparency.
Dataset Structure and Navigation: Each subject folder contains four core EEG files:
channels.tsv – EEG channel metadata eeg.json – EEG recording metadata eeg.set – Raw EEG data (EEGLAB format) events.tsv – Event markers aligned with poem presentation
The /code/ directory includes:
Preprocessing.m – MATLAB preprocessing script BioSemi64.loc – 64-channel coordinate file
The /derivatives/ directory contains:
Behavioural_Ratings/ – One .csv file per participant (e.g., P101.csv), including trial-by-trial ratings across five dimensions: Aesthetic Appeal, Vivid Imagery, Emotional Impact (labeled as 'being moved'), Originality, and Creativity.
Psychometric_Responses/ – A single .csv file with demographic and trait-level questionnaire responses per participant, including: PANAS (mood), Openness, Curiosity, VVIQ (visual imagery), AVIQ (auditory imagery), MAAS (mindfulness), and AReA (aesthetic responsiveness).
Also includes questionnaires.pdf with full questionnaire texts and scoring keys
The /stimuli/ directory includes:
All 210 texts used in the experiment: 70 Haiku (nature-themed poetry), 70 Senryu (emotion-themed poetry), 70 Control (non-poetic matched prose).
Block-wise trial assignments for all seven blocks
Resting-state EEG was recorded at the beginning and end of each session. These segments are embedded within the raw EEG files and can be identified using the following trigger codes in events.tsv:
65285, 65286 → Resting state (before experiment); 65287, 65288 → Resting state (after experiment)
Interested users may also consult Poetry Assessment EEG Dataset 2 to access recordings from the remaining 4 participants excluded from the main analyses. All preprocessing steps, event markers, and metadata structures were applied identically across both datasets (Poetry Assessment EEG Dataset 1 and Poetry Assessment EEG Dataset 2), ensuring consistency. This enables users to apply their own quality control pipelines and include these data if desired.
Of note, the anonymized participant IDs (e.g., PXXX) are used consistently across all data modalities, enabling reliable cross-referencing between EEG data, behavioural ratings, and psychometric responses. Data collection took place at the Department of Psychology at Goldsmiths, University of London, UK. The project was approved by the Local Ethics Committee at the Department of Psychology, Goldsmiths University of London. The experiment was conducted in accordance with the Declaration of Helsinki.
All EEG, behavioural, and psychometric data were anonymized. Participant identifiers were coded (P101–P151), and no names, dates of birth, or other direct identifiers are included.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset estimates the duration of Malayalam Poem syllables written in three Vruthas, Kakali, Manjari, and Keka.
Facebook
TwitterDataset Card for poetry
Dataset Summary
It contains poems from subjects: Love, Nature and Mythology & Folklore that belong to two periods namely Renaissance and Modern
Supported Tasks and Leaderboards
[Needs More Information]
Languages
[Needs More Information]
Dataset Structure
Data Instances
[Needs More Information]
Data Fields
Has 5 columns:
Content Author Poem name Age Type
Data Splits
Only training… See the full description on the dataset page: https://huggingface.co/datasets/merve/poetry.