Dataset Card for poetry
Dataset Summary
It contains poems from subjects: Love, Nature and Mythology & Folklore that belong to two periods namely Renaissance and Modern
Supported Tasks and Leaderboards
[Needs More Information]
Languages
[Needs More Information]
Dataset Structure
Data Instances
[Needs More Information]
Data Fields
Has 5 columns:
Content Author Poem name Age Type
Data Splits
Only training… See the full description on the dataset page: https://huggingface.co/datasets/merve/poetry.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Allison Parrish's Gutenberg Poetry Corpus
This corpus was originally published under the CC0 license by Allison Parrish. Please visit Allison's fantastic accompanying GitHub repository for usage inspiration as well as more information on how the data was mined, how to create your own version of the corpus, and examples of projects using it. This dataset contains 3,085,117 lines of poetry from hundreds of Project Gutenberg books. Each line has a corresponding gutenberg_id (1191… See the full description on the dataset page: https://huggingface.co/datasets/biglam/gutenberg-poetry-corpus.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Pham Tuyet
Released under MIT
Dataset Description This dataset is a curated collection of poems, each categorized by a specific emotion: Anger, Courage, Fear, Joy, Love, Peace, Sad, and Surprise. Each line of poetry captures the depth and essence of human emotions, making this dataset valuable for:
Dataset Highlights - Structure: Two columns — "Poem" (a single poetic line) and "Emotion" (the associated emotional category). - Versatility: Combines artistic creativity with analytical rigor, suitable for academic, creative, and technical applications. - Volume: A comprehensive and growing repository of poetry that bridges art and machine learning. This dataset is designed to inspire both humans and machines to understand, generate, and respond to the spectrum of human emotions in literature.
This file contains nearly all poems from the Poetry Foundation Website. Content All poems have a title and author. Most poems are also labeled with the tags as available from the Poetry Foundation Website. The word cloud above shows the most used tags! Inspiration This dataset can be used for a variety of tasks related to poetry writing.
Automatic analysis of rhythmic poetry with applications to generation and translation.
dataPOEM.csv The dataPOEM.csv data set contains data on the level of each poem. scoresAes = factor scores of moving, beauty, and melodious ratings. participant = participant number poemVersion = Version of poem presented: (A = original poem with rhyme and meter, B = poem variant with only rhyme, C = poem variant with only meter, D = poem variant without rhyme and meter) poemIdentity = poem number avgWFreq = average word frequency of poem totalGazeSlopeLineLength totalGazeWordMeanNAByWordLen totalGazeWordMeanNADiff order = order of presentation (1 = from A to D, 2 = from D to A; between participant factor) firstFixDurMS_MINFIX_AVG = first fixation duration totalGazeMS_MINFIX_AVG = total gaze durations fixDurMS_MINFIX_NUM = number of fixations sacLenMS_MINFIX_AVG = average saccade length percRegMS_MINFIX_AVG = percentage of regressive eye movements pupilDial_AVG = average pupil dilation blink_NUM_TotalRT = number of blinks relative to total reading time totalReadingTime = total reading time of the poem areaTT = total score of the Aesthetic Responsiveness Assessment questionnaire dataIntegrity = percentage of valid position measurements by eye tracker during reading of a poem moving = rating of how moving the poem was beauty = rating of how beautiful the poem was melodious = rating of how melodious the poem was dataROI.csv The dataROI.csv data set contains data on the level of each line within a poem. order = order of presentation (1 = from A to D, 2 = from D to A; between participant factor) participant = participant number poemIdentity = poem number lineNr = line number within poem poemVersion = Version of poem presented: (A = original poem with rhyme and meter, B = poem variant with only rhyme, C = poem variant with only meter, D = poem variant without rhyme and meter) verseEnd = wheter a particular word/line was the last line of a stanza (0 = word/line within a stanza, 1 = last word/line of a stanza) BeginCloseRhyme = whether a particular line’s final word marked the opening or closing of a rhyme pair (1 = opening of rhyme, 2 = closing of rhyme) lastFix = whether a particular line or word was the last one of the poem (0 = word/line within a poem, 1 = last word/line of poem) totalGazeByWordNA = total gaze duration of final word of a line relative to word length gazeByLineLengthNA = total gaze duration of a line relative to line length dataIntegrity = percentage of valid position measurements by eye tracker during reading of a poem
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is Poetry : an introduction. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Since I could not find a good dataset online for Hindi poems, I decided to scrape public sites to find some beautiful poems. This dataset is the result of tha scraping process undertook using scrapy module in python.
The dataset can be loaded as a python list of dictionaries by reading JSON line by line and converting each line using json module.
Example:
data = []
with open("scraped_all.json", "r") as f:
for line in f:
data.append(json.loads(line))
Dataset is scraped from: https://www.amarujala.com/kavya/kavita.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Frontier poetry is one of the most important themes in classical Chinese poetry, focusing on life and scenery in border regions. Imagery is a semantic composite of subjective and objective interactions, representing the objective objects of the poet's subjective emotions. The imagery system of frontier poetry exhibits significant regional convergence and cultural symbolism. This paper constructs a dataset of imagery sentiment in frontier poetry, which includes 40,000 frontier poems from the pre-Qin period to the present. It uses a combination of textual criticism and computational linguistics theories and methods to annotate and proofread the imagery and sentiments expressed in frontier poetry. This dataset not only provides rich research data for the study of frontier poetry, but also provides a macro perspective for in-depth exploration of the evolution of imagery sentiment in poetry.This dataset crawled 42,836 frontier poems from the Internet, covering war poems from the Book of Songs in the pre-Qin period to contemporary new poems, spanning the pre-Qin to modern and contemporary periods, striving to be complete, accurate, and reliable. The crawled data was cleaned and standardized, non-text symbols and redundant format tags were removed, a table of variant characters was established, and ancient texts were used to restore garbled characters through exegesis. Incorrectly identified poems were deleted, and finally, sentence segmentation and error correction were performed, with each sentence separated by commas and periods. In the end, a total of 42,807 high-quality frontier poems were obtained. Based on the collected poem texts, we constructed a data annotation system containing the encoding, author, name, imagery, and sentiment information of the poems. Each poem has a unique number, with the first two digits representing the dynasty number, such as “01” for the pre-Qin period, the middle four digits representing the author number, with poets sorted by their birth and death years, and the last two digits representing the serial number of the work, sorted by the first letter of the title. The imagery data of the poems and lyrics is annotated using a pre-trained model and manual review, while the sentiment is annotated manually.The final dataset consists of 11 CSV tables, with one table for each dynasty, and the files are named after the dynasty. Each data point consists of six parts: code, author, name, text, imagery, and sentiment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Poetry Vision
dataset is a synthetic dataset created by Roboflow for OCR tasks.
It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.
Example Image:
https://i.imgur.com/sZT516a.png" alt="Example Image">
A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.
Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.
Use the fork
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Understanding how the brain engages with poetic language is key to advancing empirical research on aesthetic and creative cognition. This experiment involved 64-channel EEG recordings and behavioural ratings from 51 participants who read and evaluated 210 short English-language texts — 70 Haiku (nature-themed), 70 Senryu (emotion-themed), and 70 non-poetic Control texts. Each poem/text was rated on five subjective dimensions: Aesthetic Appeal, Vivid Imagery, Being Moved, Originality, and Creativity — using a 7-point scale.
The full study involved 51 participants, and the data were divided into two BIDS-compliant datasets to ensure technical validation and facilitate upload to OpenNeuro.
Poetry Assessment EEG Dataset 1 contains data from 47 participants whose continuous EEG recordings passed technical validation and were used in the primary analyses.
Poetry Assessment EEG Dataset 2 (this dataset) includes the remaining 4 participants (P105, P141, P142, P146), whose EEG recordings were acquired in segments due to session interruptions and later concatenated during preprocessing. These participants were excluded from the PSD analysis to avoid potential artifacts but are included here for completeness and transparency. In this dataset, the participants.tsv file maps anonymized BIDS IDs (sub-001 to sub-004) to the original participant codes used during data collection (P105–P146), as follows:
sub-001 → P105
sub-002 → P141
sub-003 → P142
sub-004 → P146
Dataset Structure and Navigation: Each subject folder contains four core EEG files:
channels.tsv – EEG channel metadata eeg.json – EEG recording metadata eeg.set – Raw EEG data (EEGLAB format) events.tsv – Event markers aligned with poem presentation
The /code/ directory includes:
Preprocessing.m – MATLAB preprocessing script BioSemi64.loc – 64-channel coordinate file
The /derivatives/ directory contains:
Behavioural_Ratings/ – One .csv file per participant (e.g., P105.csv), including trial-by-trial ratings across five dimensions: Aesthetic Appeal, Vivid Imagery, Emotional Impact (labeled as 'being moved'), Originality, and Creativity.
Psychometric_Responses/ – A single .csv file with demographic and trait-level questionnaire responses per participant, including: PANAS (mood), Openness, Curiosity, VVIQ (visual imagery), AVIQ (auditory imagery), MAAS (mindfulness), and AReA (aesthetic responsiveness).
Also includes questionnaires.pdf with full questionnaire texts and scoring keys
The /stimuli/ directory includes:
All 210 texts used in the experiment: 70 Haiku (nature-themed poetry), 70 Senryu (emotion-themed poetry), 70 Control (non-poetic matched prose).
Block-wise trial assignments for all seven blocks
Resting-state EEG was recorded at the beginning and end of each session. These segments are embedded within the raw EEG files and can be identified using the following trigger codes in events.tsv:
65285, 65286 → Resting state (before experiment); 65287, 65288 → Resting state (after experiment)
Interested users are encouraged to consult Poetry Assessment EEG Dataset 1 to gain a complete understanding of the full experiment and its validated main dataset. All preprocessing steps, event markers, and metadata structures were applied identically across both datasets (Poetry Assessment EEG Dataset 1 and Poetry Assessment EEG Dataset 2), ensuring consistency. This enables users to apply their own quality control pipelines and include these data if desired.
Of note, the anonymized participant IDs (e.g., PXXX) are used consistently across all data modalities, enabling reliable cross-referencing between EEG data, behavioural ratings, and psychometric responses. Data collection took place at the Department of Psychology at Goldsmiths, University of London, UK. The project was approved by the Local Ethics Committee at the Department of Psychology, Goldsmiths University of London. The experiment was conducted in accordance with the Declaration of Helsinki.
All EEG, behavioural, and psychometric data were anonymized. Participant identifiers were coded (P101–P151), and no names, dates of birth, or other direct identifiers are included.
Tourism
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the book is Some of me poetry. It features 7 columns including author, publication date, language, and book publisher.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Understanding how the brain engages with poetic language is key to advancing empirical research on aesthetic and creative cognition. This experiment involved 64-channel EEG recordings and behavioural ratings from 51 participants who read and evaluated 210 short English-language texts — 70 Haiku (nature-themed), 70 Senryu (emotion-themed), and 70 non-poetic Control texts. Each poem/text was rated on five subjective dimensions: Aesthetic Appeal, Vivid Imagery, Being Moved, Originality, and Creativity — using a 7-point scale.
The full study involved 51 participants, and the data were divided into two BIDS-compliant datasets to ensure technical validation and facilitate upload to OpenNeuro.
Poetry Assessment EEG Dataset 1 (this dataset) contains data from 47 participants whose continuous EEG recordings passed technical validation and were used in the primary analyses. In this dataset, the participants.tsv file maps anonymized BIDS IDs (sub-001 to sub-047) to the original participant codes used during data collection (P101–P151)
Poetry Assessment EEG Dataset 2 includes the remaining 4 participants (P105, P141, P142, P146), whose EEG recordings were acquired in segments due to session interruptions and later concatenated during preprocessing. These participants were excluded from the PSD analysis to avoid potential artifacts but are included here for completeness and transparency.
Dataset Structure and Navigation: Each subject folder contains four core EEG files:
channels.tsv – EEG channel metadata eeg.json – EEG recording metadata eeg.set – Raw EEG data (EEGLAB format) events.tsv – Event markers aligned with poem presentation
The /code/ directory includes:
Preprocessing.m – MATLAB preprocessing script BioSemi64.loc – 64-channel coordinate file
The /derivatives/ directory contains:
Behavioural_Ratings/ – One .csv file per participant (e.g., P101.csv), including trial-by-trial ratings across five dimensions: Aesthetic Appeal, Vivid Imagery, Emotional Impact (labeled as 'being moved'), Originality, and Creativity.
Psychometric_Responses/ – A single .csv file with demographic and trait-level questionnaire responses per participant, including: PANAS (mood), Openness, Curiosity, VVIQ (visual imagery), AVIQ (auditory imagery), MAAS (mindfulness), and AReA (aesthetic responsiveness).
Also includes questionnaires.pdf with full questionnaire texts and scoring keys
The /stimuli/ directory includes:
All 210 texts used in the experiment: 70 Haiku (nature-themed poetry), 70 Senryu (emotion-themed poetry), 70 Control (non-poetic matched prose).
Block-wise trial assignments for all seven blocks
Resting-state EEG was recorded at the beginning and end of each session. These segments are embedded within the raw EEG files and can be identified using the following trigger codes in events.tsv:
65285, 65286 → Resting state (before experiment); 65287, 65288 → Resting state (after experiment)
Interested users may also consult Poetry Assessment EEG Dataset 2 to access recordings from the remaining 4 participants excluded from the main analyses. All preprocessing steps, event markers, and metadata structures were applied identically across both datasets (Poetry Assessment EEG Dataset 1 and Poetry Assessment EEG Dataset 2), ensuring consistency. This enables users to apply their own quality control pipelines and include these data if desired.
Of note, the anonymized participant IDs (e.g., PXXX) are used consistently across all data modalities, enabling reliable cross-referencing between EEG data, behavioural ratings, and psychometric responses. Data collection took place at the Department of Psychology at Goldsmiths, University of London, UK. The project was approved by the Local Ethics Committee at the Department of Psychology, Goldsmiths University of London. The experiment was conducted in accordance with the Declaration of Helsinki.
All EEG, behavioural, and psychometric data were anonymized. Participant identifiers were coded (P101–P151), and no names, dates of birth, or other direct identifiers are included.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Dnyanesh Walwadkar
Released under Apache 2.0
The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a table of word counts for a collection of 75,297 English-language poems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is The complete poetry of Catullus. It features 7 columns including author, publication date, language, and book publisher.
VidaEdco/prompt-poem-dataset-20240921_004141 dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for poetry
Dataset Summary
It contains poems from subjects: Love, Nature and Mythology & Folklore that belong to two periods namely Renaissance and Modern
Supported Tasks and Leaderboards
[Needs More Information]
Languages
[Needs More Information]
Dataset Structure
Data Instances
[Needs More Information]
Data Fields
Has 5 columns:
Content Author Poem name Age Type
Data Splits
Only training… See the full description on the dataset page: https://huggingface.co/datasets/merve/poetry.