Million Song Subset (Processed Version)
Overview
This dataset is a structured extraction of the Million Song Subset, derived from HDF5 files into a tabular format for easier accessibility and analysis.
Source
Original dataset: Million Song Dataset (LabROSA, Columbia University & The Echo Nest) Subset used: Million Song Subset (10,000 songs) URL: http://millionsongdataset.com
Processing Steps
Extraction: Used hdf5_getters.py to retrieve all… See the full description on the dataset page: https://huggingface.co/datasets/trojblue/million-song-subset.
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code provided by the authors.
Paper: The Million Song Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are:To encourage research on algorithms that scale to commercial sizesTo provide a reference dataset for evaluating researchAs a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)To help new researchers get started in the MIR field
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million con- temporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive fea- tures of the Million Song Database include the range of ex- isting resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustra- tion, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is Spotify Million Song Dataset. This dataset contains song names, artists names, link to the song and lyrics. This dataset can be used for recommending songs, classifying or clustering songs.
Original Data Source: Spotify Million Song Dataset
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random. It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest s) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, howeve
Context Songs, like any other audio signal, feature distinctive fundamental frequencies, timbre components, and other properties. Each song is unique in these respects, which is why they can be patterned.
Objective Your task is to use machine learning models to predict the release year (between 1922 and 2011) of a song that is described by 90 attributes of average timbre and covariance.
Data Description TA01 to TA12 – Timbre avarages TC01 to TC78 – Timbre covariances Year – Release year
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The research project showed how language differs between song lyrics and written text in academic and fictional context on the example of used key verbs. It compares over all diversity of used verbs as well as diversity within genres and individual texts. It also highlights the most frequently used verbs pre genre.
The research project used the following existing resources.
Sönning, Lukas, 2023, "Key verbs in academic writing: Dataset for "Evaluation of keyness metrics: Performance and reliability"", https://doi.org/10.18710/EUXSMW, DataverseNO, V1
Bertin-Mahieux, Thierry et al. (2011). "The Million Song Dataset". In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011)
musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://millionsongdataset.com/musixmatch
Last.fm dataset, the official song tags and song similarity collection for the Million Song Dataset, available at: http://millionsongdataset.com/lastfm
The data was produced by comparing and querying the existing data sources. This is documented in queries.sql.
A library or software to access the Database is needed. DB Browser for SQLite was used in this research project and is free, open source and easy to use and therefore recomended for potential users.
Genre ground-truth for the Million Song Dataset (MSD) generated based on the Last.fm dataset and beaTunes Genre Dataset (BGD) by consensus. When using this dataset, please cite the following paper: Hendrik Schreiber. Improving Genre Annotations for the Million Song Dataset. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 241-247, Málaga, Spain, Oct. 2015. Additional dataset splits are available on the dataset's website.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload includes the ESSENTIA analysis output of (a subset of) song snippets from the Million Song Dataset, namely those included in the Taste Profile subset. The audio snippets were collected from 7digital.com and were subsequently analyzed with ESSENTIA 2.1-beta3. Pre-trained SVM models provided by the ESSENTIA authors on their website were applied.
The file msd_song_jsons.rar contains the ESSENTIA analysis output after applying the SVM models for highlevel feature extraction. Please note that these are 204317 files.
The file msd_played_songs_essentia.csv.gz contains all one-dimensional real-valued fields of the jsons merged into one csv file with 204317 rows.
The full procedure and subsequent analysis is described in
Fricke, K. R., Greenberg, D. M., Rentfrow, P. J., & Herzberg, P. Y. (2019). Measuring musical preferences from listening behavior: Data from one million people and 200,000 songs. Psychology of Music, 0305735619868280.
petkopetkov/spotify-million-song-dataset-descriptions dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Million Song Dataset (https://labrosa.ee.columbia.edu/millionsong/) is a collection of metadata and precomputed audio features for 1 million songs. Along with this dataset, a dataset with annotations of 15 top-level genres with a single label per song was released. In our work, we combine the CD2c version of this genre datase (http://www.tagtraum.com/msd_genre_datasets.html) with a collection of album cover images.
The final dataset contains 30,713 tracks from the MSD and their related album cover images, each annotated with a unique genre label among 15 classes. Based on an initial analysis on the images, we identified that this set of tracks is associated to 16,753 albums, yielding an average of 1.8 songs per album.
We randomly divide the dataset into three parts: 70% for training, 15% for validation, and 15% for test, with no artist and album overlap across these sets. This is crucial to avoid possible overfitting, as the classifier may learn to predict the artist instead of the genre.
Content:
MSD-I dataset (mapping, metadata, annotations and links to images) Data splits and feature vectors for TISMIR single-label classification experiments
These data can be used together with the Tartarus deep learning python module https://github.com/sergiooramas/tartarus.
Scientific References:
Please cite the following paper if using MSD-I dataset or Tartarus software.
Oramas, S., Barbieri, F., Nieto, O., and Serra, X (2018). Multimodal Deep Learning for Music Genre Classification, Transactions of the International Society for Music Information Retrieval, V(1).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset obtained from http://millionsongdataset.com/musixmatch and preprocessed according to the script https://github.com/Cecca/diversity-maximization/blob/master/datasets.shThe files genres.rank88.txt contains additional configuration for the experiments with this dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files). Around 10% of all MIDI files include timestamped lyrics events with lyrics are often transcribed at the word, syllable or character level. LMD-full denotes the whole dataset. LMD-matched is the subset of LMD-full that consists of MIDI files matched with the Million Song Dataset entries. LMD-aligned contains all the files of LMD-matched, aligned to preview MP3s from the Million Song Dataset. A lakh is a unit of measure used in the Indian number system which signifies 100,000.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains captions for the Lakh MIDI Dataset-matched music dataset (~30,000 tracks with accompanying MIDI files).
These captions were generated by the gpt-4-1106-preview chat endpoint prompted to describe each track based on the track title and artist. The captions have not been filtered or post-processed in any way.
Prompt used:
"Give a general description of the track
The ADL Piano MIDI is a dataset of 11,086 piano pieces from different genres. This dataset is based on the Lakh MIDI dataset, which is a collection on 45,129 unique MIDI files that have been matched to entries in the Million Song Dataset. Most pieces in the Lakh MIDI dataset have multiple instruments, so for each file the authors of ADL Piano MIDI dataset extracted only the tracks with instruments from the "Piano Family" (MIDI program numbers 1-8). This process generated a total of 9,021 unique piano MIDI files. Theses 9,021 files were then combined with other approximately 2,065 files scraped from publicly-available sources on the internet. All the files in the final collection were de-duped according to their MD5 checksum.
Source: ADL Piano MIDI
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Authors: Jaehun Kim (jaehun.j.kim@gmail.com) Cynthia C.S. Liem
This entry contains the following list of data that is the by-product of the experiment conducted for a study titled "The power of deep without going deep? A study of HDPGMM music representation learning". In addition, the program for the main experimental routine is provided in the separate repository.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Million Song Subset (Processed Version)
Overview
This dataset is a structured extraction of the Million Song Subset, derived from HDF5 files into a tabular format for easier accessibility and analysis.
Source
Original dataset: Million Song Dataset (LabROSA, Columbia University & The Echo Nest) Subset used: Million Song Subset (10,000 songs) URL: http://millionsongdataset.com
Processing Steps
Extraction: Used hdf5_getters.py to retrieve all… See the full description on the dataset page: https://huggingface.co/datasets/trojblue/million-song-subset.