Dataset Card for poetry
Dataset Summary
It contains poems from subjects: Love, Nature and Mythology & Folklore that belong to two periods namely Renaissance and Modern
Supported Tasks and Leaderboards
[Needs More Information]
Languages
[Needs More Information]
Dataset Structure
Data Instances
[Needs More Information]
Data Fields
Has 5 columns:
Content Author Poem name Age Type
Data Splits
Only training… See the full description on the dataset page: https://huggingface.co/datasets/merve/poetry.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The largest Arabic poetry dataset that contains more than 2.09 million verses. The dataset is comprehensive and contains additional information associated for each verse such as poet's name, poem's title, era, meter, sub-meter, etc.
This file contains nearly all poems from the Poetry Foundation Website. Content All poems have a title and author. Most poems are also labeled with the tags as available from the Poetry Foundation Website. The word cloud above shows the most used tags! Inspiration This dataset can be used for a variety of tasks related to poetry writing.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A blackout poetry dataset constructed from publicly available short stories and large poems. The dataset consists of two variants: 8K and 16K examples of passages along with a poem generated from the passage and the indices of the words in the passage from which words in the poem have been selected. The dataset also contains perplexity scores for each of the poems indicating the language quality of the poems.
The dataset was constructed synthetically, and hence contains multiple poor poems and frequent grammatical errors. However, it is a great starting point for the task of applying machine learning to blackout poetry generation.
The dataset was first introduced in MAPLE – MAsking words to generate blackout Poetry using sequence-to-sequence LEarning.
The dataset has two variants: - 8K (sampled poems from the 16K dataset with the lowest perplexity scores) - 16K
Both variants contain data in the following format:
passage | poem | indices |
---|---|---|
Did the CIA tell the FBI that it knows the wor... | cia fbi the biggest weapon | [2, 5, 9, 24, 25] |
A vigilante lacking of heroic qualities that | ||
... | lacking qualities that damn criminals | [2, 5, 6, 11, 12] |
The passage is the text from which the poem is generated. The poem is the generated poem. The indices are the indices of the words in the text that are chosen for the poem.
This dataset was generated synthetically using Liza Daly's pattern matching based blackout poetry generation.
VidaEdco/prompt-poem-dataset-20240921_004141 dataset hosted on Hugging Face and contributed by the HF Datasets community
Introduction
CCPM is a large Chinese classical poetry matching dataset that can be used for poetry matching, understanding and translation.
The main task of this dataset is: given a description in modern Chinese, the model is supposed to select one line of Chinese classical poetry from four candidates that semantically match the given description most.
Size
It contains 27,218 instances in total, which are split into training (21,778), validation (2,720) and test (2,720) sets.
Format
Each instance is composed of translation (the description in modern Chinese, a string), choice (four candidate lines of Chinese classical poetry, a list) and answer (the index of the correct line, an integer between 0 and 3).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Poetry Vision
dataset is a synthetic dataset created by Roboflow for OCR tasks.
It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.
Example Image:
https://i.imgur.com/sZT516a.png" alt="Example Image">
A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.
Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.
Use the fork
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 4,938 rows and is filtered where the book subjects is English poetry. It features 9 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set includes three Excel data sheets, namely, famous mountain table, poetry table and poet table. The famous mountain list includes fields such as famous mountain number, famous mountain type, social characteristics, famous mountain name and province; The poetry table includes fields such as poetry number, poetry name, author, Dynasty and creation time; The poet list includes the poet's number, name, alias, time of birth and time of death.
This is a table of word counts for a collection of 75,297 English-language poems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 12 rows and is filtered where the book subjects is Italian poetry. It features 9 columns including author, publication date, language, and book publisher.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Hindi Transliteration of Urdu Poetry Dataset
Welcome to the Hindi Transliteration of Urdu Poetry Dataset! This dataset features Hindi transliterations of traditional Urdu poetry. Each entry in the dataset includes two columns:
Title: The transliterated title of the poem in Hindi. Poem: The transliterated text of the Urdu poem rendered in Hindi script.
This dataset is perfect for researchers and developers working on cross-script language processing, transliteration models, and… See the full description on the dataset page: https://huggingface.co/datasets/ReySajju742/Hindi-Poetry-Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Poetry : reading, reacting, writing. It features 7 columns including author, publication date, language, and book publisher.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Herein is a data set comprising 98k limericks scraped from the The Omnificent English Dictionary In Limerick Form - OEDILF. It is a subset of the full data set, filtered to pass a basic test of standard limerick form (i.e., ensuring five lines, no emojis, no symbols). Each limerick was written by a human contributor whose work has passed through a rigorous moderation. This dataset is released alongside two companion papers: "BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence" (Abdibayev, Riddell, Rockmore, RANLP 2021) and "Automating the Detection of Poetic Features: The Limerick as Model Organism" (Abdibayev, Riddell, Igarashi, Rockmore, SIGHUM 2021). The dataset is primarily released for use by NLP researchers interested in studying formal structure of poetry and more generally, interested in computational poetics. Each limerick is accompanied by metadata: author information, id within the website and "is_limerick" field, which denotes if limerick was recognized by our custom filter that was built to check for formal limerick properties (this tagging was a goal of the SIGHUM paper and reflects the results reported there - see the paper for details). Thus, if "is_limerick"=True this is a true positive, "is_limerick"=False is (almost surely) a false negative. We identify 70% of these as limericks and provide the tagging as a benchmark for the community to improve upon. With these considerations in mind we hope that NLP community will use this dataset to study poetical knowledge of language models trained on large corpora as many of their properties still remain a mystery to the community at large. We are excited for the possibilities ahead!
UPDATE: we released a new version of our dataset that contains all of the limericks that we planned to publish. Previous version (v2) was created using code that contained a bug which in turn lowered the number of available limericks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the cleaned fragmented dataset described in the paper "Classical Arabic Poetry: Classification based on Era". The dataset was originally scraped from Adab.com in April 2020.
This dataset was created by Likai Peng
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Frontier poetry is one of the most important themes in classical Chinese poetry, focusing on life and scenery in border regions. Imagery is a semantic composite of subjective and objective interactions, representing the objective objects of the poet's subjective emotions. The imagery system of frontier poetry exhibits significant regional convergence and cultural symbolism. This paper constructs a dataset of imagery sentiment in frontier poetry, which includes 40,000 frontier poems from the pre-Qin period to the present. It uses a combination of textual criticism and computational linguistics theories and methods to annotate and proofread the imagery and sentiments expressed in frontier poetry. This dataset not only provides rich research data for the study of frontier poetry, but also provides a macro perspective for in-depth exploration of the evolution of imagery sentiment in poetry.This dataset crawled 42,836 frontier poems from the Internet, covering war poems from the Book of Songs in the pre-Qin period to contemporary new poems, spanning the pre-Qin to modern and contemporary periods, striving to be complete, accurate, and reliable. The crawled data was cleaned and standardized, non-text symbols and redundant format tags were removed, a table of variant characters was established, and ancient texts were used to restore garbled characters through exegesis. Incorrectly identified poems were deleted, and finally, sentence segmentation and error correction were performed, with each sentence separated by commas and periods. In the end, a total of 42,807 high-quality frontier poems were obtained. Based on the collected poem texts, we constructed a data annotation system containing the encoding, author, name, imagery, and sentiment information of the poems. Each poem has a unique number, with the first two digits representing the dynasty number, such as “01” for the pre-Qin period, the middle four digits representing the author number, with poets sorted by their birth and death years, and the last two digits representing the serial number of the work, sorted by the first letter of the title. The imagery data of the poems and lyrics is annotated using a pre-trained model and manual review, while the sentiment is annotated manually.The final dataset consists of 11 CSV tables, with one table for each dynasty, and the files are named after the dynasty. Each data point consists of six parts: code, author, name, text, imagery, and sentiment.
The data is scraped from a website consist of Gulzaar's pukhraaj , Rahat's Dhoop Bahut hai and Naaraz.
These files are related to three poetry series of Gulzaar and Rahat Indauri. Series' are as follows: Pukhraaj Dhoop Bahut Hai Naaraz All these poetries are written in mixture of Hindi -Urdu words. These files are incremental and has good overlapping.
This is all possible because of these wonderful poets and also people who made them available online.
Scrapping the hindi poetry came in to my mind after the launch of OpenAI GPT-3. I decided to check the output on Hindi language mainly on poetries and the output was really good some of them were core Urdu words I was not able to understand them. But the overall experience was good. You can also use this dataset to explore more in field of Natural Language Generation and Analysis of Hindi-Urdu Literature.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
dataPOEM.csv
The dataPOEM.csv data set contains data on the level of each poem.
scoresAes = factor scores of moving, beauty, and melodious ratings.
participant = participant number
poemVersion = Version of poem presented: (A = original poem with rhyme and meter, B = poem variant with only rhyme, C = poem variant with only meter, D = poem variant without rhyme and meter)
poemIdentity = poem number
avgWFreq = average word frequency of poem
totalGazeSlopeLineLength
totalGazeWordMeanNAByWordLen
totalGazeWordMeanNADiff
order = order of presentation (1 = from A to D, 2 = from D to A; between participant factor)
firstFixDurMS_MINFIX_AVG = first fixation duration
totalGazeMS_MINFIX_AVG = total gaze durations
fixDurMS_MINFIX_NUM = number of fixations
sacLenMS_MINFIX_AVG = average saccade length
percRegMS_MINFIX_AVG = percentage of regressive eye movements
pupilDial_AVG = average pupil dilation
blink_NUM_TotalRT = number of blinks relative to total reading time
totalReadingTime = total reading time of the poem
areaTT = total score of the Aesthetic Responsiveness Assessment questionnaire
dataIntegrity = percentage of valid position measurements by eye tracker during reading of a poem
moving = rating of how moving the poem was
beauty = rating of how beautiful the poem was
melodious = rating of how melodious the poem was
dataROI.csv
The dataROI.csv data set contains data on the level of each line within a poem.
order = order of presentation (1 = from A to D, 2 = from D to A; between participant factor)
participant = participant number
poemIdentity = poem number
lineNr = line number within poem
poemVersion = Version of poem presented: (A = original poem with rhyme and meter, B = poem variant with only rhyme, C = poem variant with only meter, D = poem variant without rhyme and meter)
verseEnd = wheter a particular word/line was the last line of a stanza (0 = word/line within a stanza, 1 = last word/line of a stanza)
BeginCloseRhyme = whether a particular line’s final word marked the opening or closing of a rhyme pair (1 = opening of rhyme, 2 = closing of rhyme)
lastFix = whether a particular line or word was the last one of the poem (0 = word/line within a poem, 1 = last word/line of poem)
totalGazeByWordNA = total gaze duration of final word of a line relative to word length
gazeByLineLengthNA = total gaze duration of a line relative to line length
dataIntegrity = percentage of valid position measurements by eye tracker during reading of a poem
This statistic shows the share of adults reading poetry in the United States in 2012 and 2017, broken down by ethnicity. The data reveals that the share of surveyed Asian Americans in the U.S. reading poetry more than doubled in five years, increasing from 4.8 percent in 2012 to 12.6 percent in 2017. In fact, there was a significant increase in poetry readership among all surveyed ethnic groups.
Dataset Card for poetry
Dataset Summary
It contains poems from subjects: Love, Nature and Mythology & Folklore that belong to two periods namely Renaissance and Modern
Supported Tasks and Leaderboards
[Needs More Information]
Languages
[Needs More Information]
Dataset Structure
Data Instances
[Needs More Information]
Data Fields
Has 5 columns:
Content Author Poem name Age Type
Data Splits
Only training… See the full description on the dataset page: https://huggingface.co/datasets/merve/poetry.