17 datasets found
  1. h

    million-song-subset

    • huggingface.co
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    trojblue (2025). million-song-subset [Dataset]. https://huggingface.co/datasets/trojblue/million-song-subset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2025
    Authors
    trojblue
    Description

    Million Song Subset (Processed Version)

      Overview
    

    This dataset is a structured extraction of the Million Song Subset, derived from HDF5 files into a tabular format for easier accessibility and analysis.

      Source
    

    Original dataset: Million Song Dataset (LabROSA, Columbia University & The Echo Nest) Subset used: Million Song Subset (10,000 songs) URL: http://millionsongdataset.com

      Processing Steps
    

    Extraction: Used hdf5_getters.py to retrieve all… See the full description on the dataset page: https://huggingface.co/datasets/trojblue/million-song-subset.

  2. P

    MSD Dataset

    • paperswithcode.com
    Updated Jun 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MSD Dataset [Dataset]. https://paperswithcode.com/dataset/msd
    Explore at:
    Dataset updated
    Jun 26, 2022
    Description

    The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

    The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code provided by the authors.

    Paper: The Million Song Dataset

  3. i

    Million Song Dataset

    • ieee-dataport.org
    Updated Aug 3, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Outman (2016). Million Song Dataset [Dataset]. https://ieee-dataport.org/documents/million-song-dataset
    Explore at:
    Dataset updated
    Aug 3, 2016
    Authors
    Alex Outman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are:To encourage research on algorithms that scale to commercial sizesTo provide a reference dataset for evaluating researchAs a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)To help new researchers get started in the MIR field

  4. a

    Data from: The Million Song Dataset.

    • academictorrents.com
    bittorrent
    Updated Aug 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thierry Bertin-Mahieux and Daniel P. W. Ellis and Brian Whitman and Paul Lamere (2024). The Million Song Dataset. [Dataset]. https://academictorrents.com/details/fecaeaf2f97a0cd9f62fdaafaac70a6a96fa4ac0
    Explore at:
    bittorrent(214163931939)Available download formats
    Dataset updated
    Aug 23, 2024
    Dataset authored and provided by
    Thierry Bertin-Mahieux and Daniel P. W. Ellis and Brian Whitman and Paul Lamere
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million con- temporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive fea- tures of the Million Song Database include the range of ex- isting resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustra- tion, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.

  5. o

    Spotify Million Song Dataset

    • opendatabay.com
    • huggingface.co
    .undefined
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Spotify Million Song Dataset [Dataset]. https://www.opendatabay.com/data/dataset/db3c0ef7-dfe6-4d65-a588-ee33c43a002e
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This is Spotify Million Song Dataset. This dataset contains song names, artists names, link to the song and lyrics. This dataset can be used for recommending songs, classifying or clustering songs.

    Original Data Source: Spotify Million Song Dataset

  6. a

    Million Song Dataset Subset

    • academictorrents.com
    bittorrent
    Updated Oct 12, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere (2015). Million Song Dataset Subset [Dataset]. https://academictorrents.com/details/e0b6b5ff012fcda7c4a14e4991d8848a6a2bf52b
    Explore at:
    bittorrent(1994614463)Available download formats
    Dataset updated
    Oct 12, 2015
    Dataset authored and provided by
    Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random. It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest s) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, howeve

  7. Million Song Dataset

    • kaggle.com
    Updated Jul 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2022). Million Song Dataset [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/million-song-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Dutta
    Description

    Context Songs, like any other audio signal, feature distinctive fundamental frequencies, timbre components, and other properties. Each song is unique in these respects, which is why they can be patterned.

    Objective Your task is to use machine learning models to predict the release year (between 1922 and 2011) of a song that is described by 90 attributes of average timbre and covariance.

    Data Description TA01 to TA12 – Timbre avarages TC01 to TC78 – Timbre covariances Year – Release year

  8. t

    Language in academics, fiction and song

    • test.researchdata.tuwien.at
    • test.researchdata.tuwien.ac.at
    • +2more
    bin, text/markdown
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Theodor Seiser; Theodor Seiser; Theodor Seiser; Theodor Seiser (2024). Language in academics, fiction and song [Dataset]. http://doi.org/10.70124/3c6eq-e6877
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Theodor Seiser; Theodor Seiser; Theodor Seiser; Theodor Seiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 14, 2023
    Description

    Language in academics, fiction and song

    The research project showed how language differs between song lyrics and written text in academic and fictional context on the example of used key verbs. It compares over all diversity of used verbs as well as diversity within genres and individual texts. It also highlights the most frequently used verbs pre genre.

    The research project used the following existing resources.

    Sönning, Lukas, 2023, "Key verbs in academic writing: Dataset for "Evaluation of keyness metrics: Performance and reliability"", https://doi.org/10.18710/EUXSMW, DataverseNO, V1

    Bertin-Mahieux, Thierry et al. (2011). "The Million Song Dataset". In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011)

    musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://millionsongdataset.com/musixmatch

    Last.fm dataset, the official song tags and song similarity collection for the Million Song Dataset, available at: http://millionsongdataset.com/lastfm

    The data was produced by comparing and querying the existing data sources. This is documented in queries.sql.

    A library or software to access the Database is needed. DB Browser for SQLite was used in this research project and is free, open source and easy to use and therefore recomended for potential users.

  9. t

    Genre Annotations for the MSD: CD2C (truth by consensus)

    • tagtraum.com
    zip
    Updated Oct 25, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hendrik Schreiber (2015). Genre Annotations for the MSD: CD2C (truth by consensus) [Dataset]. https://www.tagtraum.com/msd_genre_datasets.html
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 25, 2015
    Dataset provided by
    tagtraum industries incorporated
    Authors
    Hendrik Schreiber
    Description

    Genre ground-truth for the Million Song Dataset (MSD) generated based on the Last.fm dataset and beaTunes Genre Dataset (BGD) by consensus. When using this dataset, please cite the following paper: Hendrik Schreiber. Improving Genre Annotations for the Million Song Dataset. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 241-247, Málaga, Spain, Oct. 2015. Additional dataset splits are available on the dataset's website.

  10. Z

    ESSENTIA analysis of audio snippets from the Million Song Dataset Taste...

    • data.niaid.nih.gov
    Updated May 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fricke, Kai (2020). ESSENTIA analysis of audio snippets from the Million Song Dataset Taste Profile subset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3860556
    Explore at:
    Dataset updated
    May 27, 2020
    Dataset authored and provided by
    Fricke, Kai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload includes the ESSENTIA analysis output of (a subset of) song snippets from the Million Song Dataset, namely those included in the Taste Profile subset. The audio snippets were collected from 7digital.com and were subsequently analyzed with ESSENTIA 2.1-beta3. Pre-trained SVM models provided by the ESSENTIA authors on their website were applied.

    The file msd_song_jsons.rar contains the ESSENTIA analysis output after applying the SVM models for highlevel feature extraction. Please note that these are 204317 files.

    The file msd_played_songs_essentia.csv.gz contains all one-dimensional real-valued fields of the jsons merged into one csv file with 204317 rows.

    The full procedure and subsequent analysis is described in

    Fricke, K. R., Greenberg, D. M., Rentfrow, P. J., & Herzberg, P. Y. (2019). Measuring musical preferences from listening behavior: Data from one million people and 200,000 songs. Psychology of Music, 0305735619868280.

  11. h

    spotify-million-song-dataset-descriptions

    • huggingface.co
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petko Petkov (2025). spotify-million-song-dataset-descriptions [Dataset]. http://doi.org/10.57967/hf/4479
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2025
    Authors
    Petko Petkov
    Description

    petkopetkov/spotify-million-song-dataset-descriptions dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. Z

    MSD-I: Million Song Dataset with Images for Multimodal Genre Classification

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Oramas (2020). MSD-I: Million Song Dataset with Images for Multimodal Genre Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1240484
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Sergio Oramas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Million Song Dataset (https://labrosa.ee.columbia.edu/millionsong/) is a collection of metadata and precomputed audio features for 1 million songs. Along with this dataset, a dataset with annotations of 15 top-level genres with a single label per song was released. In our work, we combine the CD2c version of this genre datase (http://www.tagtraum.com/msd_genre_datasets.html) with a collection of album cover images.

    The final dataset contains 30,713 tracks from the MSD and their related album cover images, each annotated with a unique genre label among 15 classes. Based on an initial analysis on the images, we identified that this set of tracks is associated to 16,753 albums, yielding an average of 1.8 songs per album.

    We randomly divide the dataset into three parts: 70% for training, 15% for validation, and 15% for test, with no artist and album overlap across these sets. This is crucial to avoid possible overfitting, as the classifier may learn to predict the artist instead of the genre.

    Content:

    MSD-I dataset (mapping, metadata, annotations and links to images) Data splits and feature vectors for TISMIR single-label classification experiments

    These data can be used together with the Tartarus deep learning python module https://github.com/sergiooramas/tartarus.

    Scientific References:

    Please cite the following paper if using MSD-I dataset or Tartarus software.

    Oramas, S., Barbieri, F., Nieto, O., and Serra, X (2018). Multimodal Deep Learning for Music Genre Classification, Transactions of the International Society for Music Information Retrieval, V(1).

  13. f

    MusixMatch dataset

    • figshare.com
    bin
    Updated May 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Ceccarello (2020). MusixMatch dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12287924.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 12, 2020
    Dataset provided by
    figshare
    Authors
    Matteo Ceccarello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset obtained from http://millionsongdataset.com/musixmatch and preprocessed according to the script https://github.com/Cecca/diversity-maximization/blob/master/datasets.shThe files genres.rank88.txt contains additional configuration for the experiments with this dataset

  14. O

    Lakh MIDI Dataset

    • opendatalab.com
    zip
    Updated Sep 21, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Columbia University (2022). Lakh MIDI Dataset [Dataset]. https://opendatalab.com/OpenDataLab/Lakh_MIDI_Dataset
    Explore at:
    zip(21903129607 bytes)Available download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Columbia University
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files). Around 10% of all MIDI files include timestamped lyrics events with lyrics are often transcribed at the word, syllable or character level. LMD-full denotes the whole dataset. LMD-matched is the subset of LMD-full that consists of MIDI files matched with the Million Song Dataset entries. LMD-aligned contains all the files of LMD-matched, aligned to preview MP3s from the Million Song Dataset. A lakh is a unit of measure used in the Indian number system which signifies 100,000.

  15. Gold-Caps_LMD-Matched_General

    • zenodo.org
    • data.niaid.nih.gov
    json
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Jonason; Nicolas Jonason; Luca Casini; Luca Casini; Bob Sturm; Bob Sturm (2023). Gold-Caps_LMD-Matched_General [Dataset]. http://doi.org/10.5281/zenodo.10178563
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicolas Jonason; Nicolas Jonason; Luca Casini; Luca Casini; Bob Sturm; Bob Sturm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 17, 2023
    Description

    This dataset contains captions for the Lakh MIDI Dataset-matched music dataset (~30,000 tracks with accompanying MIDI files).

    These captions were generated by the gpt-4-1106-preview chat endpoint prompted to describe each track based on the track title and artist. The captions have not been filtered or post-processed in any way.

    Prompt used:
    "Give a general description of the track

  16. O

    ADL Piano MIDI

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Apr 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Alberta (2023). ADL Piano MIDI [Dataset]. https://opendatalab.com/OpenDataLab/ADL_Piano_MIDI
    Explore at:
    zip(92214334 bytes)Available download formats
    Dataset updated
    Apr 10, 2023
    Dataset provided by
    University of California
    University of Alberta
    Description

    The ADL Piano MIDI is a dataset of 11,086 piano pieces from different genres. This dataset is based on the Lakh MIDI dataset, which is a collection on 45,129 unique MIDI files that have been matched to entries in the Million Song Dataset. Most pieces in the Lakh MIDI dataset have multiple instruments, so for each file the authors of ADL Piano MIDI dataset extracted only the tracks with instruments from the "Piano Family" (MIDI program numbers 1-8). This process generated a total of 9,021 unique piano MIDI files. Theses 9,021 files were then combined with other approximately 2,065 files scraped from publicly-available sources on the internet. All the files in the final collection were de-duped according to their MD5 checksum.

                Source: ADL Piano MIDI
    
  17. f

    Supplementary material of the paper "The power of deep without going deep? A...

    • figshare.com
    bin
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaehun Kim; Cynthia Liem (2023). Supplementary material of the paper "The power of deep without going deep? A study of HDPGMM music representation learning" [Dataset]. http://doi.org/10.4121/21981442.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Jaehun Kim; Cynthia Liem
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Supplementary material of the paper "The power of deep without going deep? A study of HDPGMM music representation learning"

    Authors: Jaehun Kim (jaehun.j.kim@gmail.com) Cynthia C.S. Liem

    General Information

    This entry contains the following list of data that is the by-product of the experiment conducted for a study titled "The power of deep without going deep? A study of HDPGMM music representation learning". In addition, the program for the main experimental routine is provided in the separate repository.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
trojblue (2025). million-song-subset [Dataset]. https://huggingface.co/datasets/trojblue/million-song-subset

million-song-subset

trojblue/million-song-subset

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Authors
trojblue
Description

Million Song Subset (Processed Version)

  Overview

This dataset is a structured extraction of the Million Song Subset, derived from HDF5 files into a tabular format for easier accessibility and analysis.

  Source

Original dataset: Million Song Dataset (LabROSA, Columbia University & The Echo Nest) Subset used: Million Song Subset (10,000 songs) URL: http://millionsongdataset.com

  Processing Steps

Extraction: Used hdf5_getters.py to retrieve all… See the full description on the dataset page: https://huggingface.co/datasets/trojblue/million-song-subset.

Search
Clear search
Close search
Google apps
Main menu