17 datasets found

h
million-song-subset
huggingface.co
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
trojblue (2025). million-song-subset [Dataset]. https://huggingface.co/datasets/trojblue/million-song-subset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Authors
trojblue
Description
Million Song Subset (Processed Version)

Overview

This dataset is a structured extraction of the Million Song Subset, derived from HDF5 files into a tabular format for easier accessibility and analysis.

Source

Original dataset: Million Song Dataset (LabROSA, Columbia University & The Echo Nest) Subset used: Million Song Subset (10,000 songs) URL: http://millionsongdataset.com

Processing Steps

Extraction: Used hdf5_getters.py to retrieve all… See the full description on the dataset page: https://huggingface.co/datasets/trojblue/million-song-subset.
P
MSD Dataset
paperswithcode.com
Updated Jun 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). MSD Dataset [Dataset]. https://paperswithcode.com/dataset/msd
Explore at:
Dataset updated
Jun 26, 2022
Description
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code provided by the authors.

Paper: The Million Song Dataset
i
Million Song Dataset
ieee-dataport.org
Updated Aug 3, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Outman (2016). Million Song Dataset [Dataset]. https://ieee-dataport.org/documents/million-song-dataset
Explore at:
Dataset updated
Aug 3, 2016
Authors
Alex Outman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are:To encourage research on algorithms that scale to commercial sizesTo provide a reference dataset for evaluating researchAs a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)To help new researchers get started in the MIR field
a
Data from: The Million Song Dataset.
academictorrents.com
bittorrent
Updated Aug 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thierry Bertin-Mahieux and Daniel P. W. Ellis and Brian Whitman and Paul Lamere (2024). The Million Song Dataset. [Dataset]. https://academictorrents.com/details/fecaeaf2f97a0cd9f62fdaafaac70a6a96fa4ac0
Explore at:
bittorrent(214163931939)Available download formats
Dataset updated
Aug 23, 2024
Dataset authored and provided by
Thierry Bertin-Mahieux and Daniel P. W. Ellis and Brian Whitman and Paul Lamere
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million con- temporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive fea- tures of the Million Song Database include the range of ex- isting resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustra- tion, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
o
Spotify Million Song Dataset
opendatabay.com
huggingface.co
.undefined
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Spotify Million Song Dataset [Dataset]. https://www.opendatabay.com/data/dataset/db3c0ef7-dfe6-4d65-a588-ee33c43a002e
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This is Spotify Million Song Dataset. This dataset contains song names, artists names, link to the song and lyrics. This dataset can be used for recommending songs, classifying or clustering songs.

Original Data Source: Spotify Million Song Dataset
a
Million Song Dataset Subset
academictorrents.com
bittorrent
Updated Oct 12, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere (2015). Million Song Dataset Subset [Dataset]. https://academictorrents.com/details/e0b6b5ff012fcda7c4a14e4991d8848a6a2bf52b
Explore at:
bittorrent(1994614463)Available download formats
Dataset updated
Oct 12, 2015
Dataset authored and provided by
Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random. It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest s) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, howeve
Million Song Dataset
kaggle.com
Updated Jul 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Dutta (2022). Million Song Dataset [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/million-song-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Dutta
Description
Context Songs, like any other audio signal, feature distinctive fundamental frequencies, timbre components, and other properties. Each song is unique in these respects, which is why they can be patterned.

Objective Your task is to use machine learning models to predict the release year (between 1922 and 2011) of a song that is described by 90 attributes of average timbre and covariance.

Data Description TA01 to TA12 – Timbre avarages TC01 to TC78 – Timbre covariances Year – Release year
t
Language in academics, fiction and song
test.researchdata.tuwien.at
test.researchdata.tuwien.ac.at
+2more
bin, text/markdown
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theodor Seiser; Theodor Seiser; Theodor Seiser; Theodor Seiser (2024). Language in academics, fiction and song [Dataset]. http://doi.org/10.70124/3c6eq-e6877
Explore at:
bin, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.70124/3c6eq-e6877
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Theodor Seiser; Theodor Seiser; Theodor Seiser; Theodor Seiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 14, 2023
Description
Language in academics, fiction and song
The research project showed how language differs between song lyrics and written text in academic and fictional context on the example of used key verbs. It compares over all diversity of used verbs as well as diversity within genres and individual texts. It also highlights the most frequently used verbs pre genre.
The research project used the following existing resources.
Sönning, Lukas, 2023, "Key verbs in academic writing: Dataset for "Evaluation of keyness metrics: Performance and reliability"", https://doi.org/10.18710/EUXSMW, DataverseNO, V1
Bertin-Mahieux, Thierry et al. (2011). "The Million Song Dataset". In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011)
musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://millionsongdataset.com/musixmatch
Last.fm dataset, the official song tags and song similarity collection for the Million Song Dataset, available at: http://millionsongdataset.com/lastfm
The data was produced by comparing and querying the existing data sources. This is documented in queries.sql.
A library or software to access the Database is needed. DB Browser for SQLite was used in this research project and is free, open source and easy to use and therefore recomended for potential users.
t
Genre Annotations for the MSD: CD2C (truth by consensus)
tagtraum.com
zip
Updated Oct 25, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hendrik Schreiber (2015). Genre Annotations for the MSD: CD2C (truth by consensus) [Dataset]. https://www.tagtraum.com/msd_genre_datasets.html
Explore at:
zipAvailable download formats
Dataset updated
Oct 25, 2015
Dataset provided by
tagtraum industries incorporated
Authors
Hendrik Schreiber
Description
Genre ground-truth for the Million Song Dataset (MSD) generated based on the Last.fm dataset and beaTunes Genre Dataset (BGD) by consensus. When using this dataset, please cite the following paper: Hendrik Schreiber. Improving Genre Annotations for the Million Song Dataset. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 241-247, Málaga, Spain, Oct. 2015. Additional dataset splits are available on the dataset's website.
Z
ESSENTIA analysis of audio snippets from the Million Song Dataset Taste...
data.niaid.nih.gov
Updated May 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fricke, Kai (2020). ESSENTIA analysis of audio snippets from the Million Song Dataset Taste Profile subset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3860556
Explore at:
Dataset updated
May 27, 2020
Dataset authored and provided by
Fricke, Kai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload includes the ESSENTIA analysis output of (a subset of) song snippets from the Million Song Dataset, namely those included in the Taste Profile subset. The audio snippets were collected from 7digital.com and were subsequently analyzed with ESSENTIA 2.1-beta3. Pre-trained SVM models provided by the ESSENTIA authors on their website were applied.

The file msd_song_jsons.rar contains the ESSENTIA analysis output after applying the SVM models for highlevel feature extraction. Please note that these are 204317 files.

The file msd_played_songs_essentia.csv.gz contains all one-dimensional real-valued fields of the jsons merged into one csv file with 204317 rows.

The full procedure and subsequent analysis is described in

Fricke, K. R., Greenberg, D. M., Rentfrow, P. J., & Herzberg, P. Y. (2019). Measuring musical preferences from listening behavior: Data from one million people and 200,000 songs. Psychology of Music, 0305735619868280.
h
spotify-million-song-dataset-descriptions
huggingface.co
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Petko Petkov (2025). spotify-million-song-dataset-descriptions [Dataset]. http://doi.org/10.57967/hf/4479
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/4479
Dataset updated
Apr 23, 2025
Authors
Petko Petkov
Description
petkopetkov/spotify-million-song-dataset-descriptions dataset hosted on Hugging Face and contributed by the HF Datasets community
Z
MSD-I: Million Song Dataset with Images for Multimodal Genre Classification
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Oramas (2020). MSD-I: Million Song Dataset with Images for Multimodal Genre Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1240484
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Sergio Oramas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Million Song Dataset (https://labrosa.ee.columbia.edu/millionsong/) is a collection of metadata and precomputed audio features for 1 million songs. Along with this dataset, a dataset with annotations of 15 top-level genres with a single label per song was released. In our work, we combine the CD2c version of this genre datase (http://www.tagtraum.com/msd_genre_datasets.html) with a collection of album cover images.

The final dataset contains 30,713 tracks from the MSD and their related album cover images, each annotated with a unique genre label among 15 classes. Based on an initial analysis on the images, we identified that this set of tracks is associated to 16,753 albums, yielding an average of 1.8 songs per album.

We randomly divide the dataset into three parts: 70% for training, 15% for validation, and 15% for test, with no artist and album overlap across these sets. This is crucial to avoid possible overfitting, as the classifier may learn to predict the artist instead of the genre.

Content:

MSD-I dataset (mapping, metadata, annotations and links to images) Data splits and feature vectors for TISMIR single-label classification experiments

These data can be used together with the Tartarus deep learning python module https://github.com/sergiooramas/tartarus.

Scientific References:

Please cite the following paper if using MSD-I dataset or Tartarus software.

Oramas, S., Barbieri, F., Nieto, O., and Serra, X (2018). Multimodal Deep Learning for Music Genre Classification, Transactions of the International Society for Music Information Retrieval, V(1).
f
MusixMatch dataset
figshare.com
bin
Updated May 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Ceccarello (2020). MusixMatch dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12287924.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12287924.v1
Dataset updated
May 12, 2020
Dataset provided by
figshare
Authors
Matteo Ceccarello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset obtained from http://millionsongdataset.com/musixmatch and preprocessed according to the script https://github.com/Cecca/diversity-maximization/blob/master/datasets.shThe files genres.rank88.txt contains additional configuration for the experiments with this dataset
O
Lakh MIDI Dataset
opendatalab.com
zip
Updated Sep 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Columbia University (2022). Lakh MIDI Dataset [Dataset]. https://opendatalab.com/OpenDataLab/Lakh_MIDI_Dataset
Explore at:
zip(21903129607 bytes)Available download formats
Dataset updated
Sep 21, 2022
Dataset provided by
Columbia University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files). Around 10% of all MIDI files include timestamped lyrics events with lyrics are often transcribed at the word, syllable or character level. LMD-full denotes the whole dataset. LMD-matched is the subset of LMD-full that consists of MIDI files matched with the Million Song Dataset entries. LMD-aligned contains all the files of LMD-matched, aligned to preview MP3s from the Million Song Dataset. A lakh is a unit of measure used in the Indian number system which signifies 100,000.
Gold-Caps_LMD-Matched_General
zenodo.org
data.niaid.nih.gov
json
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolas Jonason; Nicolas Jonason; Luca Casini; Luca Casini; Bob Sturm; Bob Sturm (2023). Gold-Caps_LMD-Matched_General [Dataset]. http://doi.org/10.5281/zenodo.10178563
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10178563
Dataset updated
Nov 22, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicolas Jonason; Nicolas Jonason; Luca Casini; Luca Casini; Bob Sturm; Bob Sturm
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 17, 2023
Description
This dataset contains captions for the Lakh MIDI Dataset-matched music dataset (~30,000 tracks with accompanying MIDI files).
These captions were generated by the gpt-4-1106-preview chat endpoint prompted to describe each track based on the track title and artist. The captions have not been filtered or post-processed in any way.
Prompt used:
"Give a general description of the track
O
ADL Piano MIDI
opendatalab.com
paperswithcode.com
zip
Updated Apr 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Alberta (2023). ADL Piano MIDI [Dataset]. https://opendatalab.com/OpenDataLab/ADL_Piano_MIDI
Explore at:
zip(92214334 bytes)Available download formats
Dataset updated
Apr 10, 2023
Dataset provided by
University of California
University of Alberta
Description
The ADL Piano MIDI is a dataset of 11,086 piano pieces from different genres. This dataset is based on the Lakh MIDI dataset, which is a collection on 45,129 unique MIDI files that have been matched to entries in the Million Song Dataset. Most pieces in the Lakh MIDI dataset have multiple instruments, so for each file the authors of ADL Piano MIDI dataset extracted only the tracks with instruments from the "Piano Family" (MIDI program numbers 1-8). This process generated a total of 9,021 unique piano MIDI files. Theses 9,021 files were then combined with other approximately 2,065 files scraped from publicly-available sources on the internet. All the files in the final collection were de-duped according to their MD5 checksum.

Source: ADL Piano MIDI
f
Supplementary material of the paper "The power of deep without going deep? A...
figshare.com
bin
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaehun Kim; Cynthia Liem (2023). Supplementary material of the paper "The power of deep without going deep? A study of HDPGMM music representation learning" [Dataset]. http://doi.org/10.4121/21981442.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.4121/21981442.v1
Dataset updated
Jun 1, 2023
Dataset provided by
4TU.ResearchData
Authors
Jaehun Kim; Cynthia Liem
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Supplementary material of the paper "The power of deep without going deep? A study of HDPGMM music representation learning"

Authors: Jaehun Kim (jaehun.j.kim@gmail.com) Cynthia C.S. Liem

General Information

This entry contains the following list of data that is the by-product of the experiment conducted for a study titled "The power of deep without going deep? A study of HDPGMM music representation learning". In addition, the program for the main experimental routine is provided in the separate repository.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

trojblue (2025). million-song-subset [Dataset]. https://huggingface.co/datasets/trojblue/million-song-subset

million-song-subset

trojblue/million-song-subset

Explore at:

10 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 18, 2025

Authors

trojblue

Description

Million Song Subset (Processed Version)

  Overview

This dataset is a structured extraction of the Million Song Subset, derived from HDF5 files into a tabular format for easier accessibility and analysis.

  Source

Original dataset: Million Song Dataset (LabROSA, Columbia University & The Echo Nest) Subset used: Million Song Subset (10,000 songs) URL: http://millionsongdataset.com

  Processing Steps

Extraction: Used hdf5_getters.py to retrieve all… See the full description on the dataset page: https://huggingface.co/datasets/trojblue/million-song-subset.

Clear search

Close search

Google apps

Main menu

million-song-subset

MSD Dataset

Million Song Dataset

Data from: The Million Song Dataset.

Spotify Million Song Dataset

Million Song Dataset Subset

Million Song Dataset

Language in academics, fiction and song

Language in academics, fiction and song

Genre Annotations for the MSD: CD2C (truth by consensus)

ESSENTIA analysis of audio snippets from the Million Song Dataset Taste...

spotify-million-song-dataset-descriptions

MSD-I: Million Song Dataset with Images for Multimodal Genre Classification

MusixMatch dataset

Lakh MIDI Dataset

Gold-Caps_LMD-Matched_General

ADL Piano MIDI

Supplementary material of the paper "The power of deep without going deep? A...

Supplementary material of the paper "The power of deep without going deep? A study of HDPGMM music representation learning"

General Information

million-song-subset

trojblue/million-song-subset