100+ datasets found
  1. 5 Million Song Lyrics Dataset

    • kaggle.com
    zip
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Nayak (2022). 5 Million Song Lyrics Dataset [Dataset]. https://www.kaggle.com/datasets/nikhilnayak123/5-million-song-lyrics-dataset
    Explore at:
    zip(3316858407 bytes)Available download formats
    Dataset updated
    Apr 22, 2022
    Authors
    Nikhil Nayak
    Description

    All (I think) of the song lyrics from genius.com. If you find a specific song/artist that isn't in the dataset but is in Genius lyrics, let me know and I can check if the scraper scraped that song.

  2. m

    Music Dataset: Lyrics and Metadata from 1950 to 2019

    • data.mendeley.com
    Updated Aug 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luan Moura (2020). Music Dataset: Lyrics and Metadata from 1950 to 2019 [Dataset]. http://doi.org/10.17632/3t9vbwxgr5.2
    Explore at:
    Dataset updated
    Aug 24, 2020
    Authors
    Luan Moura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a list of lyrics from 1950 to 2019 describing music metadata as sadness, danceability, loudness, acousticness, etc. We also provide some informations as lyrics which can be used to natural language processing.

    The audio data was scraped using Echo Nest® API integrated engine with spotipy Python’s package. The spotipy API permits the user to search for specific genres, artists,songs, release date, etc. To obtain the lyrics we used the Lyrics Genius® API as baseURL for requesting data based on the song title and artist name.

  3. Rap Lyrics Dataset

    • kaggle.com
    zip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CeeBloop (2024). Rap Lyrics Dataset [Dataset]. https://www.kaggle.com/datasets/ceebloop/rap-lyrics-for-nlp
    Explore at:
    zip(907275 bytes)Available download formats
    Dataset updated
    Apr 4, 2024
    Authors
    CeeBloop
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset was compiled by me for a personal project. It contains lyrics from 11 different artists including: Drake, J. Cole, Kendrick Lamar, Eminem, Nas, Skepta, Rapsody, Nicki Minaj, Dave, 2Pac, and Future.

    All data was compiled using Spotify's API and Genius' API.

    FEATURES

    • track_name: the name of each track
    • artist: the name of each artist
    • raw_lyrics: raw text of lyrics scraped from Genius website
    • artist_verses: text extracted from raw_lyrics — verses performed by each artist only

    NOTE: Some entires in raw_lyrics may contain a different formatting structure to others, so text consistency will vary.



    What can this dataset be used for?

    • Text analysis
    • Text pre-processing
    • Text EDA
    • Text classification
  4. h

    autonlp-data-song-lyrics

    • huggingface.co
    Updated Mar 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Simon (2022). autonlp-data-song-lyrics [Dataset]. https://huggingface.co/datasets/juliensimon/autonlp-data-song-lyrics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2022
    Authors
    Julien Simon
    Description

    AutoNLP Dataset for project: song-lyrics

      Table of content
    

    Dataset Description Languages

    Dataset Structure Data Instances Data Fields Data Splits

      Dataset Descritpion
    

    This dataset has been automatically processed by AutoNLP for project song-lyrics.

      Languages
    

    The BCP-47 code for the dataset's language is en.

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    

    A sample from this dataset looks as follows: [ { "target": 2… See the full description on the dataset page: https://huggingface.co/datasets/juliensimon/autonlp-data-song-lyrics.

  5. h

    lyrics-dataset

    • huggingface.co
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younes Matrab (2024). lyrics-dataset [Dataset]. https://huggingface.co/datasets/mrYou/lyrics-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Younes Matrab
    Description

    mrYou/lyrics-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. Music Dataset: Song Information and Lyrics

    • kaggle.com
    zip
    Updated May 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suraj (2023). Music Dataset: Song Information and Lyrics [Dataset]. https://www.kaggle.com/datasets/suraj520/music-dataset-song-information-and-lyrics
    Explore at:
    zip(1992670 bytes)Available download formats
    Dataset updated
    May 22, 2023
    Authors
    Suraj
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Dataset's Purpose: This dataset's goal is to give a complete collection of music facts and lyrics for study and development. It aspires to be a useful resource for a variety of applications such as music analysis, natural language processing, sentiment analysis, recommendation systems, and others. This dataset, which combines song information and lyrics, can help academics, developers, and music fans examine and analyse the link between listeners' preferences and lyrical content.

    Dataset Description:

    The music dataset contains around 660 songs, each with its own set of characteristics. The following characteristics are included in the dataset:

    Name: The title of the song. Lyrics: The lyrics of the song. Singer: The name of the singer or artist who performed the song. Movie: The movie or album associated with the song (if applicable). Genre: The genre or genres to which the song belongs. Rating: The rating or popularity score of the song from Spotify.

    The dataset is intended to give a wide variety of songs from various genres, performers, and films. It includes popular songs from numerous ages and places, as well as a wide spectrum of musical styles. The lyrics were obtained from publically accessible services such as Spotify and Soundcloud, and were converted from audio to text using speech recognition algorithms. While every attempt has been taken to assure correctness, please keep in mind that owing to the limits of the data sources and voice recognition algorithms, there may be inaccuracies or missing lyrics encountered upon transcribing.

    Use Cases in Research and Development:

    This music dataset has several research and development applications. Among the possible applications are:

    1. Music Analysis: By analysing the links between song elements such as genre, vocalist, and rating, researchers can acquire insights into the features and patterns of various music genres.
    2. Natural Language Processing (NLP): NLP researchers may use the lyrics to create language models, sentiment analysis algorithms, topic modelling approaches, and other text-based music studies.
    3. Recommendation Systems: Using the information, developers may create recommendation systems that offer music based on user preferences, lyrics sentiment, or genre similarities.
    4. Music Generating Machine Learning Models: The dataset may be used to train machine learning models for generating new lyrics or making music compositions.
    5. Music Sentiment Analysis: To get insights into the emotional components of music and its influence on listeners, researchers might analyse the feelings conveyed in song lyrics.
    6. Movie Soundtracks Analysis: Researchers can explore the association between song attributes and their use in movie soundtracks by investigating the movie attribute.

    Overall, the goal of this music dataset is to provide a rich resource for academics, developers, and music fans to investigate the complicated relationships between song features, lyrics, and numerous research and development applications in the music domain.

  7. Data from: Song Interpretation Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon; Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon (2023). Song Interpretation Dataset [Dataset]. http://doi.org/10.5281/zenodo.7019124
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 9, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon; Yixiao Zhang; Junyan Jiang; Gus Xia; Simon Dixon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Song Interpretation Dataset combines data from two sources: (1) music and metadata from the Music4All Dataset and (2) lyrics and user interpretations from SongMeanings.com. We design a music metadata-based matching algorithm that aligns matching items in the two datasets with each other. In the end, we successfully match 25.47% of the tracks in the Music4All Dataset.

    The dataset contains audio excerpts from 27,834 songs (30 seconds each, recorded at 44.1 kHz), the corresponding music metadata, about 490,000 user interpretations of the lyric text, and the number of votes given for each of these user interpretations. The average length of the interpretations is 97 words. Music in the dataset covers various genres, of which the top 5 are: Rock (11,626), Pop (6,071), Metal (2,516), Electronic (2,213) and Folk (1,760).

    For more details, please refer to our paper "Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model".

  8. MUSDB18 lyrics extension

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    text/x-python, txt +1
    Updated Jun 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kilian Schulze-Forster; Clement S. J. Doire; Gaël Richard; Roland Badeau; Kilian Schulze-Forster; Clement S. J. Doire; Gaël Richard; Roland Badeau (2021). MUSDB18 lyrics extension [Dataset]. http://doi.org/10.5281/zenodo.3989267
    Explore at:
    zip, txt, text/x-pythonAvailable download formats
    Dataset updated
    Jun 25, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kilian Schulze-Forster; Clement S. J. Doire; Gaël Richard; Roland Badeau; Kilian Schulze-Forster; Clement S. J. Doire; Gaël Richard; Roland Badeau
    Description

    This is a set of annotated lyrics transcripts for songs belonging to the MUSDB18 dataset. The set comprises lyrics of all songs which have English lyrics, i.e. 96 out of 100 songs for the training set and 45 out of 50 songs for the test set. MUSDB18 is a dataset for music source separation and provides the following separated tracks for each song: vocals, bass, drums, other (rest of the accompaniment), mixture.

    The lyrics transcripts, together with the audio files of MUSDB18, are a valuable resource for research on tasks such as text-informed singing voice separation, automatic lyrics alignment, automatic lyrics transcription, and singing voice synthesis and analysis. The provided data should be used for research purposes only.

    Disclaimer

    The lyrics were transcribed manually by the authors who are not native English speakers. It is likely that the transcriptions are not 100% correct. The composers of the songs are the copyright holders of the original lyrics.

    The songs were divided into sections of lengths between 3 and 12 seconds. The priority when choosing the section boundaries was that they correspond to natural pauses and do not cut vocal sounds. The sections do not necessarily correspond to lyrically meaningful lines. Most of the sections do not overlap, some have an overlap of 1 second. In some difficult cases, e.g. shouting in metal songs or mumbled words, where the words are barely intelligible, we made an effort to make the transcriptions as accurate as possible phonetically and did not prioritize semantically meaningful phrases.

    Citation

    The dataset was built for the paper

    Schulze-Forster, K., Doire, C., Richard, G., & Badeau, R. "Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (2021).

    If you use the data for your research, please cite the corresponding paper:

    @article{schulze2021phoneme,
     title={Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation},
     author={Schulze-Forster, Kilian and Doire, Clement and Richard, Ga{\"e}l and Badeau, Roland},
     journal={IEEE/ACM Transactions on Audio, Speech and Language Processing},
     year={2021},
     publisher={IEEE}
    }

    Annotations

    For each section, the annotations comprise: the start and end time, the corresponding lyrics, and a label indicating one of the following four properties:

    (a) only one person is singing
    (b) several singers are pronouncing the same phonemes at the same time (possibly singing different notes)
    (c) several singers are pronouncing different phonemes simultaneously (possibly singing different notes)
    (d) no singing

    Segments that are labelled with the property (b) or (c) do not necessarily have this property over the whole segment duration. As soon as somewhere in a segment several singers are present, label (b) was assigned; as soon as they sung different phonemes somewhere at the same time, label (c) was assigned. Property (a) and (d) are valid for the entire segment. Furthermore, segments with property (c) can contain either some (lead) singer(s) singing some words in the presence of background singers singing long vowels such as ’ah’ or ’oh’ or they can contain multiple singers who sing different words at the same time. In the latter case, it was very difficult to recognise the sung words and to decide in which order to transcribe words or phrases sung simultaneously. These segments are marked with a '*' and it is recommended to reject them for most use cases.

    The annotations have the following format:

    Example:
    00:18 00:23 a i know the reasons why --> starts at 18 sec., ends at 23 sec., vocals type (a), lyrics: i know the reasons why

    The Python script musdb_lyrics_cut_audio.py is provided to automatically cut the MUSDB songs into the annotated segments. The script requires the musdb and soundfile package. The user needs to update the paths and select the desired sources and vocals types in lines 19-26. The script saves wav-files for each selected source for each annotated segment as well as the corresponding lyrics as txt-file. The MUSDB training partition is divided into a training and validation set. The tracks for the validation set can be changed below line 29.

    The file words_and_phonemes.txt contains a list of all words and their decomposition into phonemes. The phonemes are written in 2-letter ARPABET style and obtained with the LOGIOS Lexicon Tool.

    License

    The data is licensed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, read the provided LICENSE.txt file, visit https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

    The creators of MUSDB18 lyrics extension and their corresponding affiliation institutes are not liable for, and expressly exclude, all liability for loss or damage however and whenever caused to anyone by any use of MUSDB18 lyrics extension or any part of it.

    Acknowledgment

    The authors would like to thank Olumide Okubadejo and Sinead Namur for their help with transcribing and correcting part of the lyrics.

  9. h

    artist-lyrics-dataset

    • huggingface.co
    Updated Apr 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Homayouni (2024). artist-lyrics-dataset [Dataset]. https://huggingface.co/datasets/SpartanCinder/artist-lyrics-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2024
    Authors
    Connor Homayouni
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    SpartanCinder/artist-lyrics-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Rap Lyrics

    • kaggle.com
    zip
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie (2023). Rap Lyrics [Dataset]. https://www.kaggle.com/datasets/jamiewelsh2/rap-lyrics
    Explore at:
    zip(27220148 bytes)Available download formats
    Dataset updated
    Mar 20, 2023
    Authors
    Jamie
    Description

    Rap lyrics were obtained for 100 of the most influential rappers of all time (see https://beats-rhymes-lists.com/lists/best-rappers-of-all-time/) via web scraping. The data was then augmented into an easy to understand format using pandas. Each row corresponds to an individual lyric in a song and the song name and artist name appear as columns as well.

  11. English Songs Lyrics

    • kaggle.com
    zip
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    raza (2023). English Songs Lyrics [Dataset]. https://www.kaggle.com/datasets/razauhaq/english-songs-lyrics
    Explore at:
    zip(1881323221 bytes)Available download formats
    Dataset updated
    Apr 24, 2023
    Authors
    raza
    Description

    This dataset is preprocessed version of 5 Million Song Lyrics Dataset contaning lyrics of only English songs extracted by CARLOSGDCJ

  12. Olivia Rodrigo Lyrics Dataset

    • kaggle.com
    zip
    Updated Mar 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mia (2024). Olivia Rodrigo Lyrics Dataset [Dataset]. https://www.kaggle.com/datasets/mehaksingal/olivia-rodrigo-lyrics-datasetl
    Explore at:
    zip(35700 bytes)Available download formats
    Dataset updated
    Mar 25, 2024
    Authors
    Mia
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset presents a comprehensive compilation of song lyrics by Olivia Rodrigo, sourced from her albums 'SOUR', 'GUTS', and other releases. The lyrics for each song are contained within individual text files, organized into three folders corresponding to each album. The dataset provides a valuable resource for fans, researchers, and analysts interested in studying Olivia Rodrigo's music and lyrical themes. Through the extraction of text from the website https://www.azlyrics.com/, this dataset offers a curated selection of song lyrics in a structured format, facilitating further analysis and exploration. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7650625%2Fb69f646439195058b270c554e4d6a942%2F41423665157bcb9bf8ed6998e6530276.jpg?generation=1711458110302802&alt=media" alt="">

  13. h

    song_lyrics

    • huggingface.co
    Updated Sep 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shea (2023). song_lyrics [Dataset]. https://huggingface.co/datasets/sheacon/song_lyrics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2023
    Authors
    Shea
    Description

    Creation Steps

    Downloaded 5 Million Song Dataset from Kaggle Selected quality artists, as defined by me Remove songs featuring any profanity Added normalized version of lyrics (used for GloVe embedding only) lower case, remove punctuation, remove stopwords, lemmatize)

    Computed four sets of embeddings using all-MiniLM-L12-v2, all-distilroberta-v1, text-embedding-ada-002, and average_word_embeddings_glove.840B.300d

  14. Z

    LFM2b Lyrics Descriptor Analyses

    • data.niaid.nih.gov
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emilia Parada-Cabaleiro; Maximilian Mayerl; Stefan Brandl; Marcin Skowron; Markus Schedl; Elisabeth Lex; Eva Zangerle (2024). LFM2b Lyrics Descriptor Analyses [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7740044
    Explore at:
    Dataset updated
    Apr 15, 2024
    Dataset provided by
    Department of Music Pedagogy, Nuremberg University of Music, Germany
    Graz University of Technology, Austria
    Linz Institute of Technology, Austria
    Austrian Research Institute for Artificial Intelligence, Austria
    Universität Innsbruck, Austria
    Johannes Kepler University Linz, Austria
    Authors
    Emilia Parada-Cabaleiro; Maximilian Mayerl; Stefan Brandl; Marcin Skowron; Markus Schedl; Elisabeth Lex; Eva Zangerle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LFM2b Lyrics Descriptor Analyses

    This dataset provides lyrics descriptors for 580,000 songs, including lexical, structural, diversity-related, readability, rhyme, structural, and emotional descriptors. This dataset was the basis of an analysis of the evolution of song lyrics over the course of five decades and five genres (pop, rock, rap, country, and R&B).

    Dataset Generation As a basis for the dataset, we relied on the LFM-2b dataset (http://www.cp.jku.at/datasets/LFM-2b) of listening events by Last.fm. It contains more than two billion listening records, and more than fifty million songs by more than five million artists. We enrich the dataset with information about songs' release year, genre, lyrics, and popularity information. For quantifying the popularity of tracks and lyrics, we distinguish between the listening count, i.e., the number of listening events in the LFM-2b dataset, and lyrics view count, i.e., the number of views of lyrics on the Genius platform (https://genius.com). Release years, genre information, and lyrics are obtained from the Genius platform. Genres are expressed by one primary genre. We used https://polyglot.readthedocs.io/ to automatically infer the language of the lyrics and considered only English lyrics. Adopting this procedure, we ultimately obtain complete information for 582,759 songs.

    Data and Features We provide the full dataset, containing features for 582,759 songs (full_dataset.json.gz). For each song, the dataset contains track title and artist information, genre, popularity, and release date information, and a wide variety of lexical, structural, diversity-related, readability, rhyme, structural, and emotional descriptors.

    For further information on the semantics of the features, we provided a short overview in the following. Please check the implementation of the feature extractor at https://github.com/MaximilianMayerl/CorrelatesOfSongLyrics/ for further details.

    • Track and artist
    • Genre
    • Popularity descriptors:
      • Lyrics view count
      • Last.fm playcount
    • Lexical descriptors:
      • Line counts: Total number of lines, blank lines, unique lines, ratio of blank and repeated lines
      • Token counts: Number of tokens, characters, repeated token ratio, unique tokens per line, and avg. tokens per line
      • Character counts: Number of \texttt{[!?.,:;"-()]} (total amount of these characters and individual counts per character) and digits, ratio of punctuation and digits
      • Token length: Average length of tokens
      • n-gram ratios: Ratio of unique bigrams and trigrams
      • Legomenon ratios: Ratio of hapax legomena, dis legomena and tris legomena
      • Parts of speech: Frequency of adjectives, adverbs, nouns, pronouns, verbs
      • Past tense: Ratio of verbs in past tense to other verbs
      • Stop words: Number and ratio of stop words, stop words per line
      • Uncommon words: Number of uncommon words (i.e., words not contained WordNet)
    • Diversity descriptors
      • Compression ratio: Ratio of the size of zlib compressed lyrics vs the original lyrics
      • Diversity measures: Measure of Textual Lexical Diversity (MTLD), Herdan's C, Summer's S, Dugast's U^2, and Maas' a^2
    • Readability Descriptors
      • Readability formulas: Flesch Reading Ease, Flesch Kincaid Grade, SMOG (Simple Measure of Gobbledygook), Automated Readability Index, Coleman Liau Index, Dale Chall Readability Score, Linsear Write Formula, Gunning Fog, Fernandez Huerta, Szigriszt Pazos and Gutierrez Polini
      • Difficult words: Number of difficult words (three or more syllables)
    • Rhyme Descriptors
      • Rhyme structures: Numbers of couplets, clerihews, alternating rhymes and nested rhymes
      • Rhyme words: Number of unique rhyming words, percentage of rhyming lines in the lyrics
      • Alliterations: Number of alliterations of length two, three, and four or more
    • Structural Descriptors
      • Element counts: Number of sections and verses
      • Distribution: Relation between the number of verses vs. sections and number of choruses vs sections
      • Title occurrences: Number of times the song's title appears
      • Pattern: Verse and chorus alternating, two verses and at least one chorus, two choruses and at least one verse
      • Start: Starts with chorus (binary attribute)
      • Ending: Ends with two chorus repetitions (binary attribute)
    • Emotional/Affective Descriptors
      • Sentiment scores: Positivity and negativity scores via AFINN, the sentiment lexicon by Bing Liu et al., the MPQA opinion corpus, the sentiment140 dataset, and the SentiWordNetlexicon
      • NRC: Emotion scores according to the NRC affect intensity lexicon
      • LIWC: Descriptors provided by LIWC
      • Happiness: Happiness score according to labMT
  15. h

    song-lyrics

    • huggingface.co
    Updated Jul 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hleb Stenin (2023). song-lyrics [Dataset]. https://huggingface.co/datasets/halaction/song-lyrics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2023
    Authors
    Hleb Stenin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    halaction/song-lyrics dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. 🎧 500K+ Spotify Songs with Lyrics,Emotions & More

    • kaggle.com
    zip
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DevDope (2025). 🎧 500K+ Spotify Songs with Lyrics,Emotions & More [Dataset]. https://www.kaggle.com/datasets/devdope/900k-spotify
    Explore at:
    zip(1078959401 bytes)Available download formats
    Dataset updated
    May 15, 2025
    Authors
    DevDope
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    *Important Notice: Due to a technical issue during the upload process, the dataset currently includes approximately 500,000 tracks, although it was originally composed of 900,000 songs. We sincerely apologize for any inconvenience this may cause.

    Our team is actively working to recover the missing files and update the dataset accordingly. The remaining tracks will be re-added as soon as they become available.

    Thank you for your understanding and patience as we work to resolve this issue.*

    This dataset was part of the Top 200 projects in the NVIDIA Llama-Index Contest, supporting the Abracadabra project — a Retrieval-Augmented Generation (RAG) system for intelligent playlist creation using LLMs.

    Overview

    This is a large-scale, music dataset with over 500,000 tracks. It includes lyrics, structured metadata, emotion labels, and more than 30 contextual and audio features per song. It was designed with AI applications in mind, particularly those involving music understanding, semantic search, and playlist generation.

    Dataset Highlights

    • 500K+ unique songs with full metadata

    • Over 30 features including:

      • Popularity, Energy, Danceability, Speechiness, Tempo, Loudness, Key
      • Acousticness, Instrumentalness, Time Signature
      • Contextual tags (e.g., Good for Party, Relaxation, Study, Exercise, Driving, etc.)
    • 3 similar songs per track (with artist, title, and similarity score)

    🧠 How Emotions Were Extracted

    Emotions in the emotion column were automatically generated using the Hugging Face model:

    🔗 mrm8488/t5-base-finetuned-emotion

    📊 Column Descriptions

    Column NameDescriptionExample
    Artist(s)Name of the artist or music group performing the song.!!!
    songTitle of the song.Even When the Water's Cold
    textFull lyrics or main textual content of the song."Friends told her she was better off..."
    LengthDuration of the song (mm:ss).03:47
    emotionMain emotion extracted from lyrics using a fine-tuned emotion detection model.sadness
    GenrePrimary musical genre.hip hop
    AlbumName of the album.Thr!!!er
    Release DateRelease date of the track (DD/MM/YYYY).29/04/2013
    KeyMusical key of the song.D min
    TempoTempo in BPM (may be normalized).0.437869823
    Loudness (db)Loudness in decibels.0.785065407
    Time signatureBeats per bar.4/4
    ExplicitWhether the track has explicit content.No
    PopularityPopularity score.40
    EnergyEnergy level (0–100).83
    DanceabilityDanceability score (0–100).71
    PositivenessValence or positivity score (0–100).87
    SpeechinessPresence of spoken words (0–100).4
    LivenessLive performance probability (0–100).16
    AcousticnessAcoustic level score (0–100).11
    InstrumentalnessInstrumental likelihood (0–100).0
    Good for PartySuitable for party playlists (binary).0
    Good for Work/StudySuitable for work/study (binary).0
    Good for Relaxation/MeditationSuitable for relaxation (binary).0
    Good for ExerciseSuitable for workout (binary).0
    Good for RunningSuitable for running (binary).0
    Good for Yoga/StretchingSuitable for yoga/stretching (binary).0
    Good for DrivingSuitable for driving (binary).0
    Good for Social GatheringsSuitable for social events (binary).0
    Good for Morning RoutineSuitable for mornings (binary).0
    Similar Artist 1First most similar artist.Corey Smith
    Similar Song 1First similar song.If I Could Do It Again
    Similarity Score 1Similarity score (0–1).0.986060785
    Similar Artist 2Second most similar artist.Toby Keith
    Similar Song 2Second similar song.Drinks After Work
    Similarity Score 2Similarity score (0–1).0.983719477
    Similar Artist 3Third most similar artist.Space
    Similar Song 3Third similar song.Neighbourhood
    Similarity Score 3Similarity score (0–1).0.983236351
  17. h

    turkish-lyric-to-genre

    • huggingface.co
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Efe (2023). turkish-lyric-to-genre [Dataset]. https://huggingface.co/datasets/Veucci/turkish-lyric-to-genre
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2023
    Authors
    Efe
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Song Lyrics Dataset

      Description
    

    This dataset contains a collection of song lyrics from various artists and genres in Turkish. It is intended to be used for research, analysis, and other non-commercial purposes.

      Dataset Details
    

    The dataset is organized in a tabular format with the following columns:

    Genre (int): Genre of the lyrics

    Lyrics (str): The lyrics of the song.

    Pop: 1085 rows

    Rock: 765 rows

    Hip-Hop: 969 rows

    Arabesk: 353 rows

      Usage… See the full description on the dataset page: https://huggingface.co/datasets/Veucci/turkish-lyric-to-genre.
    
  18. Arab-Andalusian music lyrics dataset

    • zenodo.org
    • live.european-language-grid.eu
    • +1more
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Sordo; Mehdi Chaachoo; Xavier Serra; Xavier Serra; Mohamed Sordo; Mehdi Chaachoo (2020). Arab-Andalusian music lyrics dataset [Dataset]. http://doi.org/10.5281/zenodo.3337623
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mohamed Sordo; Mehdi Chaachoo; Xavier Serra; Xavier Serra; Mohamed Sordo; Mehdi Chaachoo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset contains lyrics for the songs in the Arab-Anadalusian music collection curated within the CompMusic project, that belong to the nawbas "Isbahan", "Maya”, “Raml Maya”, “Gharibat al-Husayn”, “Hijaz Kabir”, “Hijaz Msharqi”, “Istihlal”, “Rasd”, and ”Rasd Dayl”.

    Lyrics are stored in two formats: as Tab Separated Values (TSV) files and as JSON files.

    Each file is identified by its MusicBrainz recording ID (MBID).

    The lyrics are stored both in their original Arabic script (folder 'original') and a romanized/transliterated version (folder 'transliterated') using the American Library of Congress (ALA-LC standard).

    Corresponding audio files are available from the Arab-Andalusian music corpus, as well as the Internet Archive URL included in the metadata file ('metadata.csv').

    For more information about the exact format and contents of the dataset, please consult the README provided in the archive.

    For more information, please refer to http://compmusic.upf.edu/corpora.

  19. m

    Music Dataset: Lyrics and Metadata from 1950 to 2019

    • data.mendeley.com
    • narcis.nl
    Updated Oct 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luan Moura (2020). Music Dataset: Lyrics and Metadata from 1950 to 2019 [Dataset]. http://doi.org/10.17632/3t9vbwxgr5.3
    Explore at:
    Dataset updated
    Oct 23, 2020
    Authors
    Luan Moura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was studied on Temporal Analysis and Visualisation of Music paper, in the following link:

           https://sol.sbc.org.br/index.php/eniac/article/view/12155
    

    This dataset provides a list of lyrics from 1950 to 2019 describing music metadata as sadness, danceability, loudness, acousticness, etc. We also provide some informations as lyrics which can be used to natural language processing.

    The audio data was scraped using Echo Nest® API integrated engine with spotipy Python’s package. The spotipy API permits the user to search for specific genres, artists,songs, release date, etc. To obtain the lyrics we used the Lyrics Genius® API as baseURL for requesting data based on the song title and artist name.

  20. dataset lyrics musics

    • kaggle.com
    zip
    Updated Jul 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Italo Marcelo (2020). dataset lyrics musics [Dataset]. https://www.kaggle.com/datasets/italomarcelo/dataset-lyrics-musics
    Explore at:
    zip(76772632 bytes)Available download formats
    Dataset updated
    Jul 14, 2020
    Authors
    Italo Marcelo
    Description

    Dataset

    This dataset was created by Italo Marcelo

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nikhil Nayak (2022). 5 Million Song Lyrics Dataset [Dataset]. https://www.kaggle.com/datasets/nikhilnayak123/5-million-song-lyrics-dataset
Organization logo

5 Million Song Lyrics Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(3316858407 bytes)Available download formats
Dataset updated
Apr 22, 2022
Authors
Nikhil Nayak
Description

All (I think) of the song lyrics from genius.com. If you find a specific song/artist that isn't in the dataset but is in Genius lyrics, let me know and I can check if the scraper scraped that song.

Search
Clear search
Close search
Google apps
Main menu