100+ datasets found
  1. h

    autonlp-data-song-lyrics

    • huggingface.co
    Updated Mar 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Simon (2022). autonlp-data-song-lyrics [Dataset]. https://huggingface.co/datasets/juliensimon/autonlp-data-song-lyrics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2022
    Authors
    Julien Simon
    Description

    AutoNLP Dataset for project: song-lyrics

      Table of content
    

    Dataset Description Languages

    Dataset Structure Data Instances Data Fields Data Splits

      Dataset Descritpion
    

    This dataset has been automatically processed by AutoNLP for project song-lyrics.

      Languages
    

    The BCP-47 code for the dataset's language is en.

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    

    A sample from this dataset looks as follows: [ { "target": 2… See the full description on the dataset page: https://huggingface.co/datasets/juliensimon/autonlp-data-song-lyrics.

  2. Indian Hindi songs lyrics dataset

    • kaggle.com
    Updated Aug 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MakarandVelankar (2020). Indian Hindi songs lyrics dataset [Dataset]. https://www.kaggle.com/datasets/makvel/indian-hindi-songs-lyrics-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    MakarandVelankar
    Area covered
    India
    Description

    Context

    Context Analysis of Hindi lyrics using Natural Language Processing techniques for Hindi language(Devanagari Script). The algorithms developed will be useful to summarize Hindi literary work and context-based classification.

    Content

    People willing to work on a project related to the Devanagari script find it difficult to get hold of a suitable data set. After an extensive search, as per our observations, not much work has been done with the Devanagari script in the field of natural language programming.

    Acknowledgements

    Rachita Kotian, Chaitrali Mote and Anuja Patil were instrumental in preparing the data set.

    Inspiration

    People have always found songs/music significant in their lives. Lyrics can be a source of information to understand music. Lyrics provide high-level information about a song. Aim is contextual analysis on Hindi lyrics and to automate this process.

  3. Data from: Song Interpretation Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon; Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon (2023). Song Interpretation Dataset [Dataset]. http://doi.org/10.5281/zenodo.7019124
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 9, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon; Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Song Interpretation Dataset combines data from two sources: (1) music and metadata from the Music4All Dataset and (2) lyrics and user interpretations from SongMeanings.com. We design a music metadata-based matching algorithm that aligns matching items in the two datasets with each other. In the end, we successfully match 25.47% of the tracks in the Music4All Dataset.

    The dataset contains audio excerpts from 27,834 songs (30 seconds each, recorded at 44.1 kHz), the corresponding music metadata, about 490,000 user interpretations of the lyric text, and the number of votes given for each of these user interpretations. The average length of the interpretations is 97 words. Music in the dataset covers various genres, of which the top 5 are: Rock (11,626), Pop (6,071), Metal (2,516), Electronic (2,213) and Folk (1,760).

    For more details, please refer to our paper "Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model".

  4. Top Artist Songs with Lyrics (2017–2024)

    • kaggle.com
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    uvais saifi (2025). Top Artist Songs with Lyrics (2017–2024) [Dataset]. https://www.kaggle.com/datasets/uvaissaifi/top-artist-songs-with-lyrics-20172024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 24, 2025
    Dataset provided by
    Kaggle
    Authors
    uvais saifi
    Description

    🎵 Top Artist Song Lyrics Dataset (2017–2024) This dataset contains a curated collection of song lyrics from top global and trending artists between 2017 and 2024, intended strictly for educational, research, and NLP development purposes.

    It includes the following:

    🎤 Artist Name

    🎶 Song Title

    📜 Full Song Lyrics

    ColumnDescription
    artistName of the musical artist or band
    songsTitle of the song
    lyricsFull lyrics text (for language modeling, text analysis, etc.)

    |------------------------------------------------------------------------|

    🎯 Use Cases This dataset is designed for developers, data scientists, and researchers working on:

    Natural Language Processing (NLP)

    Song lyric generation or completion models

    Sentiment or emotion analysis

    Lyrics-based recommendation systems

    Music trend and theme exploration

    Language model fine-tuning with artistic text

    📅 Dataset Scope Covers top and trending artists primarily from 2017 to 2024

    Focuses on modern songs (older songs before 2017 have been filtered out)

    Contains artists known for significant impact in global music culture

    🔐 License & Disclaimer ⚠️ Disclaimer: This dataset is for educational and non-commercial research use only. All lyrics are the intellectual property of their respective copyright holders. No copyright ownership is claimed. If any rights holder objects to the inclusion of their content, this dataset will be promptly removed upon request.

    🛠️ Data Collection Method The lyrics were collected from publicly available sources. This dataset is provided as-is, without any guarantee of completeness or accuracy. Please review local copyright laws and platform terms of service before using this dataset for redistribution or commercial purposes.

    🙋‍♂️ Author Note This dataset was created to support language-focused research and music-related AI projects. If you find this dataset useful, consider citing it or sharing feedback!

  5. h

    turkish-lyric-to-genre

    • huggingface.co
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Efe (2023). turkish-lyric-to-genre [Dataset]. https://huggingface.co/datasets/Veucci/turkish-lyric-to-genre
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2023
    Authors
    Efe
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Song Lyrics Dataset

      Description
    

    This dataset contains a collection of song lyrics from various artists and genres in Turkish. It is intended to be used for research, analysis, and other non-commercial purposes.

      Dataset Details
    

    The dataset is organized in a tabular format with the following columns:

    Genre (int): Genre of the lyrics

    Lyrics (str): The lyrics of the song.

    Pop: 1085 rows

    Rock: 765 rows

    Hip-Hop: 969 rows

    Arabesk: 353 rows

      Usage… See the full description on the dataset page: https://huggingface.co/datasets/Veucci/turkish-lyric-to-genre.
    
  6. Z

    LFM2b Lyrics Descriptor Analyses

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elisabeth Lex (2024). LFM2b Lyrics Descriptor Analyses [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7740044
    Explore at:
    Dataset updated
    Apr 15, 2024
    Dataset provided by
    Markus Schedl
    Stefan Brandl
    Maximilian Mayerl
    Marcin Skowron
    Emilia Parada-Cabaleiro
    Eva Zangerle
    Elisabeth Lex
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LFM2b Lyrics Descriptor Analyses

    This dataset provides lyrics descriptors for 580,000 songs, including lexical, structural, diversity-related, readability, rhyme, structural, and emotional descriptors. This dataset was the basis of an analysis of the evolution of song lyrics over the course of five decades and five genres (pop, rock, rap, country, and R&B).

    Dataset Generation As a basis for the dataset, we relied on the LFM-2b dataset (http://www.cp.jku.at/datasets/LFM-2b) of listening events by Last.fm. It contains more than two billion listening records, and more than fifty million songs by more than five million artists. We enrich the dataset with information about songs' release year, genre, lyrics, and popularity information. For quantifying the popularity of tracks and lyrics, we distinguish between the listening count, i.e., the number of listening events in the LFM-2b dataset, and lyrics view count, i.e., the number of views of lyrics on the Genius platform (https://genius.com). Release years, genre information, and lyrics are obtained from the Genius platform. Genres are expressed by one primary genre. We used https://polyglot.readthedocs.io/ to automatically infer the language of the lyrics and considered only English lyrics. Adopting this procedure, we ultimately obtain complete information for 582,759 songs.

    Data and Features We provide the full dataset, containing features for 582,759 songs (full_dataset.json.gz). For each song, the dataset contains track title and artist information, genre, popularity, and release date information, and a wide variety of lexical, structural, diversity-related, readability, rhyme, structural, and emotional descriptors.

    For further information on the semantics of the features, we provided a short overview in the following. Please check the implementation of the feature extractor at https://github.com/MaximilianMayerl/CorrelatesOfSongLyrics/ for further details.

    • Track and artist
    • Genre
    • Popularity descriptors:
      • Lyrics view count
      • Last.fm playcount
    • Lexical descriptors:
      • Line counts: Total number of lines, blank lines, unique lines, ratio of blank and repeated lines
      • Token counts: Number of tokens, characters, repeated token ratio, unique tokens per line, and avg. tokens per line
      • Character counts: Number of \texttt{[!?.,:;"-()]} (total amount of these characters and individual counts per character) and digits, ratio of punctuation and digits
      • Token length: Average length of tokens
      • n-gram ratios: Ratio of unique bigrams and trigrams
      • Legomenon ratios: Ratio of hapax legomena, dis legomena and tris legomena
      • Parts of speech: Frequency of adjectives, adverbs, nouns, pronouns, verbs
      • Past tense: Ratio of verbs in past tense to other verbs
      • Stop words: Number and ratio of stop words, stop words per line
      • Uncommon words: Number of uncommon words (i.e., words not contained WordNet)
    • Diversity descriptors
      • Compression ratio: Ratio of the size of zlib compressed lyrics vs the original lyrics
      • Diversity measures: Measure of Textual Lexical Diversity (MTLD), Herdan's C, Summer's S, Dugast's U^2, and Maas' a^2
    • Readability Descriptors
      • Readability formulas: Flesch Reading Ease, Flesch Kincaid Grade, SMOG (Simple Measure of Gobbledygook), Automated Readability Index, Coleman Liau Index, Dale Chall Readability Score, Linsear Write Formula, Gunning Fog, Fernandez Huerta, Szigriszt Pazos and Gutierrez Polini
      • Difficult words: Number of difficult words (three or more syllables)
    • Rhyme Descriptors
      • Rhyme structures: Numbers of couplets, clerihews, alternating rhymes and nested rhymes
      • Rhyme words: Number of unique rhyming words, percentage of rhyming lines in the lyrics
      • Alliterations: Number of alliterations of length two, three, and four or more
    • Structural Descriptors
      • Element counts: Number of sections and verses
      • Distribution: Relation between the number of verses vs. sections and number of choruses vs sections
      • Title occurrences: Number of times the song's title appears
      • Pattern: Verse and chorus alternating, two verses and at least one chorus, two choruses and at least one verse
      • Start: Starts with chorus (binary attribute)
      • Ending: Ends with two chorus repetitions (binary attribute)
    • Emotional/Affective Descriptors
      • Sentiment scores: Positivity and negativity scores via AFINN, the sentiment lexicon by Bing Liu et al., the MPQA opinion corpus, the sentiment140 dataset, and the SentiWordNetlexicon
      • NRC: Emotion scores according to the NRC affect intensity lexicon
      • LIWC: Descriptors provided by LIWC
      • Happiness: Happiness score according to labMT
  7. Z

    MUSDB18 lyrics extension

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jun 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaël Richard (2021). MUSDB18 lyrics extension [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3989266
    Explore at:
    Dataset updated
    Jun 25, 2021
    Dataset provided by
    Roland Badeau
    Gaël Richard
    Kilian Schulze-Forster
    Clement S. J. Doire
    Description

    This is a set of annotated lyrics transcripts for songs belonging to the MUSDB18 dataset. The set comprises lyrics of all songs which have English lyrics, i.e. 96 out of 100 songs for the training set and 45 out of 50 songs for the test set. MUSDB18 is a dataset for music source separation and provides the following separated tracks for each song: vocals, bass, drums, other (rest of the accompaniment), mixture.

    The lyrics transcripts, together with the audio files of MUSDB18, are a valuable resource for research on tasks such as text-informed singing voice separation, automatic lyrics alignment, automatic lyrics transcription, and singing voice synthesis and analysis. The provided data should be used for research purposes only.

    Disclaimer

    The lyrics were transcribed manually by the authors who are not native English speakers. It is likely that the transcriptions are not 100% correct. The composers of the songs are the copyright holders of the original lyrics.

    The songs were divided into sections of lengths between 3 and 12 seconds. The priority when choosing the section boundaries was that they correspond to natural pauses and do not cut vocal sounds. The sections do not necessarily correspond to lyrically meaningful lines. Most of the sections do not overlap, some have an overlap of 1 second. In some difficult cases, e.g. shouting in metal songs or mumbled words, where the words are barely intelligible, we made an effort to make the transcriptions as accurate as possible phonetically and did not prioritize semantically meaningful phrases.

    Citation

    The dataset was built for the paper

    Schulze-Forster, K., Doire, C., Richard, G., & Badeau, R. "Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (2021).

    If you use the data for your research, please cite the corresponding paper:

    @article{schulze2021phoneme, title={Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation}, author={Schulze-Forster, Kilian and Doire, Clement and Richard, Ga{"e}l and Badeau, Roland}, journal={IEEE/ACM Transactions on Audio, Speech and Language Processing}, year={2021}, publisher={IEEE} }

    Annotations

    For each section, the annotations comprise: the start and end time, the corresponding lyrics, and a label indicating one of the following four properties:

    (a) only one person is singing (b) several singers are pronouncing the same phonemes at the same time (possibly singing different notes) (c) several singers are pronouncing different phonemes simultaneously (possibly singing different notes) (d) no singing

    Segments that are labelled with the property (b) or (c) do not necessarily have this property over the whole segment duration. As soon as somewhere in a segment several singers are present, label (b) was assigned; as soon as they sung different phonemes somewhere at the same time, label (c) was assigned. Property (a) and (d) are valid for the entire segment. Furthermore, segments with property (c) can contain either some (lead) singer(s) singing some words in the presence of background singers singing long vowels such as ’ah’ or ’oh’ or they can contain multiple singers who sing different words at the same time. In the latter case, it was very difficult to recognise the sung words and to decide in which order to transcribe words or phrases sung simultaneously. These segments are marked with a '*' and it is recommended to reject them for most use cases.

    The annotations have the following format:

    Example: 00:18 00:23 a i know the reasons why --> starts at 18 sec., ends at 23 sec., vocals type (a), lyrics: i know the reasons why

    The Python script musdb_lyrics_cut_audio.py is provided to automatically cut the MUSDB songs into the annotated segments. The script requires the musdb and soundfile package. The user needs to update the paths and select the desired sources and vocals types in lines 19-26. The script saves wav-files for each selected source for each annotated segment as well as the corresponding lyrics as txt-file. The MUSDB training partition is divided into a training and validation set. The tracks for the validation set can be changed below line 29.

    The file words_and_phonemes.txt contains a list of all words and their decomposition into phonemes. The phonemes are written in 2-letter ARPABET style and obtained with the LOGIOS Lexicon Tool.

    License

    The data is licensed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, read the provided LICENSE.txt file, visit https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

    The creators of MUSDB18 lyrics extension and their corresponding affiliation institutes are not liable for, and expressly exclude, all liability for loss or damage however and whenever caused to anyone by any use of MUSDB18 lyrics extension or any part of it.

    Acknowledgment

    The authors would like to thank Olumide Okubadejo and Sinead Namur for their help with transcribing and correcting part of the lyrics.

  8. h

    artist-lyrics-dataset

    • huggingface.co
    Updated Apr 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Homayouni (2024). artist-lyrics-dataset [Dataset]. https://huggingface.co/datasets/SpartanCinder/artist-lyrics-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Authors
    Connor Homayouni
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    SpartanCinder/artist-lyrics-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. Lyrics Dataset

    • kaggle.com
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanika Dhayabar (2024). Lyrics Dataset [Dataset]. https://www.kaggle.com/datasets/sanikadhayabar/lyrics-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sanika Dhayabar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Sanika Dhayabar

    Released under MIT

    Contents

  10. Doja Cat Song Lyrics

    • kaggle.com
    Updated Nov 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashish Singh Chauhan (2022). Doja Cat Song Lyrics [Dataset]. https://www.kaggle.com/datasets/ashish51ngh/doja-cat-lyrics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2022
    Dataset provided by
    Kaggle
    Authors
    Ashish Singh Chauhan
    Description

    Context

    The following albums were included:

    1. Purrr! (EP) (2014)
    2. Amala (2018)
    3. Hot Pink (2019)
    4. Streets (Remixes) (EP) (2021)
    5. Planet Her (2021)

    Content

    To understand our data better, let's define each column.

    • Album Name - Name of the album
    • Track Title - Name of the song
    • Track Number - Track number
    • Lyric - Lyric at each line
    • Year Released - Release year of the album

    Acknowledgements

    The dataset was extracted from genius.com.

  11. Jingju Lyrics Datasets

    • zenodo.org
    • data.europa.eu
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R. Caro Repetto; R. Caro Repetto (2020). Jingju Lyrics Datasets [Dataset]. http://doi.org/10.5281/zenodo.1285632
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    R. Caro Repetto; R. Caro Repetto
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In order to study the expressive functions of jingju metrical patterns according to its lyrics, a series of different datasets have been created from the Jingju Lyrics Collection, that has been collected through scraping the online repository of jingju libretti Zhongguo jingju xikao 中国京剧戏考. These datasets have been created for the analysis of lyrics of the banshi yuanban, manban, kuaiban and yaoban both in the shengqiang xipi and erhuang (kuaiban is not used in erhuang) by applying NLP techniques, namely topic modelling and document classification.

    Using this dataset

    We are interested in knowing if you find our datasets useful! If you use our dataset please email us at mtg-info@upf.edu and tell us about your research.

    http://compmusic.upf.edu/jingju-lyrics-datasets

  12. Desi Hip Hop Lyrics- verses reverse prompt

    • kaggle.com
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav Inani (2024). Desi Hip Hop Lyrics- verses reverse prompt [Dataset]. https://www.kaggle.com/datasets/pranavinani/desi-hip-hop-lyrics-verses-reverse-prompt/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    Kaggle
    Authors
    Pranav Inani
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Lyrics Datasets for Creative and Linguistic Applications

    Overview

    This repository contains two datasets of song lyrics, meticulously curated and organized for diverse applications in natural language processing, machine learning, and creative AI. These datasets include song verses, descriptive prompts, and romanized lyrics, providing rich resources for tasks such as text generation, sentiment analysis, transliteration, and more. All the songs are from Hip Hop genre specificallly from Indian Subcontinent also known as DHH-Desi Hip Hop.

    Dataset 1: lyrics_described.csv

    This dataset features song verses paired with descriptive prompts for creative generation. It is ideal for applications in AI-generated songwriting, lyric analysis, or reverse-prompt engineering.

    • Number of Entries: [Add total entries here after inspection]
    • Columns:
      • artist: Name of the artist.
      • title: Title of the song.
      • verse: Specific verses from the song.
      • reverse_prompt: Descriptions or creative prompts associated with the verses.

    Dataset 2: lyrics_romanised.csv

    This dataset contains full lyrics in their original and romanized scripts, enabling transliteration studies and multilingual NLP tasks.

    • Number of Entries: [Add total entries here after inspection]
    • Columns:
      • title: Title of the song.
      • lyrics: Full lyrics in the original script.
      • artist: Name of the artist.
      • romanized_lyrics: Lyrics transliterated into the Roman script.

    Usage

    Applications

    • Creative AI: Train models to generate new song lyrics inspired by existing ones.
    • Text-to-Text Generation: Fine-tune models for generating new lyrics based on existing verses or prompts.
    • Sentiment Analysis: Analyze emotional tone and sentiment across songs and artists.
    • Transliteration Models: Develop and benchmark transliteration systems using the romanized_lyrics column.
    • Cultural Analysis: Study lyrical themes and trends across different artists and genres.

    Loading the Data

    The datasets are provided in CSV format and can be loaded using Python libraries such as pandas:

    import pandas as pd
    
    # Load lyrics_described.csv
    described = pd.read_csv('lyrics_described.csv')
    
    # Load lyrics_romanised.csv
    romanised = pd.read_csv('lyrics_romanised.csv')
    

    Citation

    If you use these datasets in your research or applications, please credit the creator:

    @dataset{pranav_inani_2024,
     title={Lyrics Datasets for Creative and Linguistic Applications},
     author={Pranav Inani},
     year={2024},
     note={Available at Hugging Face}
    }
    

    License

    MIT License

    Feedback and Contributions

    If you have any feedback or suggestions, feel free to reach out or submit a pull request. Contributions are always welcome!

  13. e

    English-Czech parallel song lyrics - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Aug 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). English-Czech parallel song lyrics - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/5e67a9cf-2e57-5ea8-b8cc-f42038d9cbff
    Explore at:
    Dataset updated
    Aug 19, 2025
    Description

    English–Czech parallel corpus of song lyrics, aligned section by section. The songs are sourced from musical films. The dataset is provided in JSON format with the following structure: { "language": { "song_id": { "section_id": [list of lines in the section] } } }

  14. m

    Music Dataset: Lyrics and Metadata from 1950 to 2019

    • data.mendeley.com
    • narcis.nl
    Updated Oct 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luan Moura (2020). Music Dataset: Lyrics and Metadata from 1950 to 2019 [Dataset]. http://doi.org/10.17632/3t9vbwxgr5.3
    Explore at:
    Dataset updated
    Oct 23, 2020
    Authors
    Luan Moura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was studied on Temporal Analysis and Visualisation of Music paper, in the following link:

           https://sol.sbc.org.br/index.php/eniac/article/view/12155
    

    This dataset provides a list of lyrics from 1950 to 2019 describing music metadata as sadness, danceability, loudness, acousticness, etc. We also provide some informations as lyrics which can be used to natural language processing.

    The audio data was scraped using Echo Nest® API integrated engine with spotipy Python’s package. The spotipy API permits the user to search for specific genres, artists,songs, release date, etc. To obtain the lyrics we used the Lyrics Genius® API as baseURL for requesting data based on the song title and artist name.

  15. E

    Arab-Andalusian music lyrics dataset

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    json
    Updated Oct 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Arab-Andalusian music lyrics dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7554
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 28, 2023
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset contains lyrics for the songs in the Arab-Anadalusian music collection curated within the CompMusic project, that belong to the nawbas "Isbahan", "Maya”, “Raml Maya”, “Gharibat al-Husayn”, “Hijaz Kabir”, “Hijaz Msharqi”, “Istihlal”, “Rasd”, and ”Rasd Dayl”.

    Lyrics are stored in two formats: as Tab Separated Values (TSV) files and as JSON files.

    Each file is identified by its MusicBrainz recording ID (MBID).

    The lyrics are stored both in their original Arabic script (folder 'original') and a romanized/transliterated version (folder 'transliterated') using the American Library of Congress (ALA-LC standard).

    Corresponding audio files are available from the Arab-Andalusian music corpus, as well as the Internet Archive URL included in the metadata file ('metadata.csv').

    For more information about the exact format and contents of the dataset, please consult the README provided in the archive.

    For more information, please refer to http://compmusic.upf.edu/corpora.

  16. Reasons for getting lyrics to songs worldwide 2017

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reasons for getting lyrics to songs worldwide 2017 [Dataset]. https://www.statista.com/statistics/799899/music-song-lyrics-reasons/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2017
    Area covered
    Worldwide
    Description

    The statistic shows the most common reasons why music consumers get lyrics to songs worldwide as of *************. During the survey, ** percent of respondents stated that they got the lyrics to songs in order to be able to sing along.

  17. h

    Lyrics_Dataset

    • huggingface.co
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nave Cohen (2025). Lyrics_Dataset [Dataset]. https://huggingface.co/datasets/nave1616/Lyrics_Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Nave Cohen
    Description

    nave1616/Lyrics_Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. Multi-Lingual Lyrics for Genre Classification

    • kaggle.com
    Updated Jan 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matei Bejan (2021). Multi-Lingual Lyrics for Genre Classification [Dataset]. https://www.kaggle.com/datasets/mateibejan/multilingual-lyrics-for-genre-classification/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Matei Bejan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Gathered this dataset as part of my work for the Information Retrieval and Text Mining course at the Faculty of Mathematics and Computer Science, University of Bucharest.

    Content

    The data is composed of four sources. The initial data was forwarded from Sparktech's 2018 Textract Hackathon. This was enhanced with data from other three kaggle datasets: 150K Lyrics Labeled with Spotify Valence, dataset lyrics musics and AZLyrics song lyrics.

    Apart from the original Sparktech data, the other datasets did not provide a Genre feature. In order to deal with the lack of Genre labeling , I have built a labeling function using the spotipy library, which uses the Spotify API in order to retrieve the genre of an Artist. Please note that the Spotify API returns a list of genres for one artist, so I considered the most common genre to be said artists dominant genre.

    Aditionally, the AZLyrics data was badly encoded, namely the column delimiter character, the comma, was also used as a verse delimiter in the Lyrics column. Fortunately, the dataset comes with two URL columns that conveniently separate the Artist, Song and Lyrics columns, so with a bit of regex magic I was able to extract the useful data using https:// as a delimiter.

    On a last note, I used Nakatani Shuyo's langdetect library to automatically label the lyrics with a language. In total, the lyrics come in 34 languages.

    Acknowledgements

    I am greatful to the kaggle users edenbd, Italo Marcelo and Albert Suarez, as well as the Sparktech team who gathered the original data and to my professor who provided it for the project.

    Inspiration

    In case you stumble across this dataset in the wild, I encourage you to try the Genre classification task on it and different feature engineering approaches. I am excited to see how inventive you can get!

  19. The WASABI Dataset and RDF Knowledge Graph

    • zenodo.org
    • data.niaid.nih.gov
    tar
    Updated Feb 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michel Buffa; Elena Cabrio; Michael Fell; Fabien Gandon; Fabien Gandon; Alain Giboin; Alain Giboin; Romain Hennequin; Romain Hennequin; Fabrice Jauvat; Elmahdi Korfed; Franck Michel; Franck Michel; Johan Pauwels; Johan Pauwels; Guillaume Pellerin; Maroua Tikat; Marco Winckler; Marco Winckler; Michel Buffa; Elena Cabrio; Michael Fell; Fabrice Jauvat; Elmahdi Korfed; Guillaume Pellerin; Maroua Tikat (2022). The WASABI Dataset and RDF Knowledge Graph [Dataset]. http://doi.org/10.5281/zenodo.5603369
    Explore at:
    tarAvailable download formats
    Dataset updated
    Feb 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michel Buffa; Elena Cabrio; Michael Fell; Fabien Gandon; Fabien Gandon; Alain Giboin; Alain Giboin; Romain Hennequin; Romain Hennequin; Fabrice Jauvat; Elmahdi Korfed; Franck Michel; Franck Michel; Johan Pauwels; Johan Pauwels; Guillaume Pellerin; Maroua Tikat; Marco Winckler; Marco Winckler; Michel Buffa; Elena Cabrio; Michael Fell; Fabrice Jauvat; Elmahdi Korfed; Guillaume Pellerin; Maroua Tikat
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The WASABI Dataset and RDF Knowledge Graph is rich dataset describing more than 2 millions commercial songs, 200K albums and 77K artists (mainly from pop/rock culture). It comprises data extracted from music databases on the Web, and resulting from the processing of song lyrics and from audio analysis.

    This is version 2 of the dataset. It consists of two representation formats:

    • The JSON format provides all data extracted from the MongoDB database that backs up the web application
    • The RDF Knowledge Graph that represents the same data following the WASABI ontology.

    WASABI project homepage: http://wasabihome.i3s.unice.fr/

    Github: https://github.com/micbuffa/WasabiDataset

  20. h

    Data from: Music-Lyrics

    • huggingface.co
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sweaterdog (2025). Music-Lyrics [Dataset]. https://huggingface.co/datasets/Sweaterdog/Music-Lyrics
    Explore at:
    Dataset updated
    Sep 27, 2025
    Authors
    Sweaterdog
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Sweaterdog/Music-Lyrics dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Julien Simon (2022). autonlp-data-song-lyrics [Dataset]. https://huggingface.co/datasets/juliensimon/autonlp-data-song-lyrics

autonlp-data-song-lyrics

juliensimon/autonlp-data-song-lyrics

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2022
Authors
Julien Simon
Description

AutoNLP Dataset for project: song-lyrics

  Table of content

Dataset Description Languages

Dataset Structure Data Instances Data Fields Data Splits

  Dataset Descritpion

This dataset has been automatically processed by AutoNLP for project song-lyrics.

  Languages

The BCP-47 code for the dataset's language is en.

  Dataset Structure







  Data Instances

A sample from this dataset looks as follows: [ { "target": 2… See the full description on the dataset page: https://huggingface.co/datasets/juliensimon/autonlp-data-song-lyrics.

Search
Clear search
Close search
Google apps
Main menu