Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Nowadays, there are lots of datasets available for training and experimentation in the field of recommender systems. Specifically, in the recommendation of audiovisual content, the MovieLens dataset is a prominent example. It is focused on the user-item relationship, providing actual interaction data between users and movies. However, although movies can be described with several characteristics, this dataset only offers limited information about the movie genres.
In this work, we propose enriching the MovieLens dataset by incorporating metadata available on the web (such as cast, description, keywords, etc.) and movie trailers. By leveraging the trailers, we extract audio information and generate transcriptions for each trailer, introducing a crucial textual dimension to the dataset. The audio information was extracted by the waveform and frequency analysis, followed by the application of dimensionality reduction techniques. For the transcription generation, the deep learning model Whisper was used. Finally, metadata was obtained from TMDB, and the BERT model was applied to extract embeddings.
These additional attributes enrich the original dataset, providing deeper and more precise analysis. Then, the use of this extended and enhanced dataset could drive significant advancements in recommendation systems, enhancing user experiences by providing more relevant and tailored movie recommendations based on their tastes and preferences.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Sayha
YouTube Video Audio and Subtitles Extraction
Sayha is a tool designed to download YouTube videos and extract their audio and subtitle data. This can be particularly useful for creating datasets for machine learning projects, transcription services, or language studies.
Features
Download YouTube videos. Extract audio tracks from videos. Retrieve and process subtitle files. Prepare datasets for various applications.
Installation
Clone… See the full description on the dataset page: https://huggingface.co/datasets/sadece/sayha.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This distribution includes details of the SLAC multimodal music dataset as well as features extracted from it. This dataset is intended to facilitate research comparing relative musical influences of four different musical modalities: symbolic, lyrical, audio and cultural. SLAC was assembled by independently collecting, for each of its component musical pieces, a symbolic MIDI encoding, a lyrical text transcription, an audio MP3 recording and cultural information mined from the internet. It is important to emphasize the independence of how each of these components were collected; for example, the MIDI and MP3 encodings of each piece were collected entirely separately, and neither was generated from the other.
Features have been extracted from each of the musical pieces in SLAC using the jMIR (http://jmir.sourceforge.net) feature extractor corresponding to each of the modalities: jSymbolic for symbolic, jLyrics for lyrics, jAudio for audio and jWebMiner2 for mining cultural data from search engines and Last.fm (https://www.last.fm).
SLAC is quite small, consisting of only 250 pieces. This is due to the difficulty of finding matching information in all four modalities independently. Although this limited size does pose certain limitations, the dataset is nonetheless the largest (and only) known dataset including all four independently collected modalities.
The dataset is divided into ten genres, with 25 pieces belonging to each genre: Modern Blues, Traditional Blues, Baroque, Romantic, Bop, Swing, Hardcore Rap, Pop Rap, Alternative Rock and Metal. These can be collapsed into a 5-genre taxonomy, with 50 pieces per genre: Blues, Classical, Jazz, Rap and Rock. This facilitates experiments with both coarser and finer classes.
SLAC was published at the ISMIR 2010 conference, and was itself an expansion of the SAC dataset (published at the ISMIR 2008 conference), which is identical except that it excludes the lyrics and lyrical features found in SLAC. Both ISMIR papers are included in this distribution.
Due to copyright limitations, this distribution does not include the actual music or lyrics of the pieces comprising SLAC. It does, however, include details of the contents of the dataset as well as features extracted from each of its modalities using the jMIR software. These include the original features extracted for the 2010 ISMIR paper, as well as an updated set of symbolic features extracted in 2021 using the newer jSymbolic 2.2 feature extractor (published at ISMIR 2018). These jSymbolic 2.2 features include both the full MIDI feature set and a “conservative” feature set meant to limit potential biases due to encoding practice. Feature values are distributed as CSV files, Weka ARFF (https://www.cs.waikato.ac.nz/ml/weka/) files and ACE XML (http://jmir.sourceforge.net) files.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains transcripts of all tracks by the Adjutant in Starcraft I Terran Campaigns. It includes both base game and the Brood War. Will 65 entries be good enough to extract assistantship response patterns? I will find out.
Curation Process
Extracted all the sound files from the local Starcraft installation location using a CascLib-based extractor I wrote. This gave me a lot of .ogg files. Starcraft I file nomenclature is nice: for example, all files containing Adjutant… See the full description on the dataset page: https://huggingface.co/datasets/yxzwayne/TerranAdjutant-1.
Entails the course material of the first years of elementary school in Greece. It includes 29.698 signed phrases that are present in the 33 issues of 13 distinct textbooks of the A, B and C years of Primary school. The Elementary Dataset consists of the following courses: 9507 videos of Greek Language (1st, 2nd and 3rd year) 6599 videos of Mathematics (1st, 2nd and 3rd year) 4163 videos of Anthology of Greek Literacy (1st, 2nd, 3rd and 4th year) 5528 videos of Environmental Studies (1st, 2nd and 3rd year) 2069 videos of History (3rd year) 1832 Videos of Religious Study (3rd year) Version 2 - Major Features and Improvements Removed Duplicated Videos Improved Transcriptions Audio extraction (for Greek STT tasks)
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We present Da-TACOS: a dataset for cover song identification and understanding. It contains two subsets, namely the benchmark subset (for benchmarking cover song identification systems) and the cover analysis subset (for analyzing the links among cover songs), with pre-extracted features and metadata for 15,000 and 10,000 songs, respectively. The annotations included in the metadata are obtained with the API of SecondHandSongs.com. All audio files we use to extract features are encoded in MP3 format and their sample rate is 44.1 kHz. Da-TACOS does not contain any audio files. For the results of our analyses on modifiable musical characteristics using the cover analysis subset and our initial benchmarking of 7 state-of-the-art cover song identification algorithms on the benchmark subset, you can look at our publication.
For organizing the data, we use the structure of SecondHandSongs where each song is called a ‘performance’, and each clique (cover group) is called a ‘work’. Based on this, the file names of the songs are their unique performance IDs (PID, e.g. P_22
), and their labels with respect to their cliques are their work IDs (WID, e.g. W_14
).
Metadata for each song includes
In addition, we matched the original metadata with MusicBrainz to obtain MusicBrainz ID (MBID), song length and genre/style tags. We would like to note that MusicBrainz related information is not available for all the songs in Da-TACOS, and since we used just our metadata for matching, we include all possible MBIDs for a particular songs.
For facilitating reproducibility in cover song identification (CSI) research, we propose a framework for feature extraction and benchmarking in our supplementary repository: acoss. The feature extraction component is designed to help CSI researchers to find the most commonly used features for CSI in a single address. The parameter values we used to extract the features in Da-TACOS are shared in the same repository. Moreover, the benchmarking component includes our implementations of 7 state-of-the-art CSI systems. We provide the performance results of an initial benchmarking of those 7 systems on the benchmark subset of Da-TACOS. We encourage other CSI researchers to contribute to acoss with implementing their favorite feature extraction algorithms and their CSI systems to build up a knowledge base where CSI research can reach larger audiences.
The instructions for how to download and use the dataset are shared below. Please contact us if you have any questions or requests.
1. Structure
1.1. Metadata
We provide two metadata files that contain information about the benchmark subset and the cover analysis subset. Both metadata files are stored as python dictionaries in .json
format, and have the same hierarchical structure.
An example to load the metadata files in python:
import json
with open('./da-tacos_metadata/da-tacos_benchmark_subset_metadata.json') as f:
benchmark_metadata = json.load(f)
The python dictionary obtained with the code above will have the respective WIDs as keys. Each key will provide the song dictionaries that contain the metadata regarding the songs that belong to their WIDs. An example can be seen below:
"W_163992": { # work id
"P_547131": { # performance id of the first song belonging to the clique 'W_163992'
"work_title": "Trade Winds, Trade Winds",
"work_artist": "Aki Aleong",
"perf_title": "Trade Winds, Trade Winds",
"perf_artist": "Aki Aleong",
"release_year": "1961",
"work_id": "W_163992",
"perf_id": "P_547131",
"instrumental": "No",
"perf_artist_mbid": "9bfa011f-8331-4c9a-b49b-d05bc7916605",
"mb_performances": {
"4ce274b3-0979-4b39-b8a3-5ae1de388c4a": {
"length": "175000"
},
"7c10ba3b-6f1d-41ab-8b20-14b2567d384a": {
"length": "177653"
}
}
},
"P_547140": { # performance id of the second song belonging to the clique 'W_163992'
"work_title": "Trade Winds, Trade Winds",
"work_artist": "Aki Aleong",
"perf_title": "Trade Winds, Trade Winds",
"perf_artist": "Dodie Stevens",
"release_year": "1961",
"work_id": "W_163992",
"perf_id": "P_547140",
"instrumental": "No"
}
}
1.2. Pre-extracted features
The list of features included in Da-TACOS can be seen below. All the features are extracted with acoss repository that uses open-source feature extraction libraries such as Essentia, LibROSA, and Madmom.
To facilitate the use of the dataset, we provide two options regarding the file structure.
1- In da-tacos_benchmark_subset_single_files
and da-tacos_coveranalysis_subset_single_files
folders, we organize the data based on their respective cliques, and one file contains all the features for that particular song.
{
"chroma_cens": numpy.ndarray,
"crema": numpy.ndarray,
"hpcp": numpy.ndarray,
"key_extractor": {
"key": numpy.str_,
"scale": numpy.str_,_
"strength": numpy.float64
},
"madmom_features": {
"novfn": numpy.ndarray,
"onsets": numpy.ndarray,
"snovfn": numpy.ndarray,
"tempos": numpy.ndarray
}
"mfcc_htk": numpy.ndarray,
"tags": list of (numpy.str_, numpy.str_)
"label": numpy.str_,
"track_id": numpy.str_
}
2- In da-tacos_benchmark_subset_FEATURE
and da-tacos_coveranalysis_subset_FEATURE
folders, the data is organized based on their cliques as well, but each of these folders contain only one feature per song. For instance, if you want to test your system that uses HPCP features, you can download da-tacos_benchmark_subset_hpcp
to access the pre-computed HPCP features. An example for the contents in those files can be seen below:
{
"hpcp": numpy.ndarray,
"label": numpy.str_,
"track_id": numpy.str_
}
2. Using the dataset
2.1. Requirements
git clone https://github.com/MTG/da-tacos.git
cd da-tacos
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
2.2. Downloading the data
The dataset is currently stored in only in Google Drive (it will be uploaded to Zenodo soon), and can be downloaded from this link. We also provide a python script that automatically downloads the folders you specify. Basic usage of this script can be seen below:
python download_da-tacos.py -h
usage: download_da-tacos.py [-h]
[--dataset {benchmark,coveranalysis,da-tacos}]
[--type {single_files,cens,crema,hpcp,key,madmom,mfcc,tags}]
[--source {gdrive,zenodo}]
[--outputdir OUTPUTDIR]
[--unpack]
[--remove]
Download script for Da-TACOS
optional arguments:
-h, --help show this help message and exit
--dataset {metadata,benchmark,coveranalysis,da-tacos}
which subset to download. 'da-tacos' option downloads
both subsets. the options other than 'metadata' will
download the metadata as well. (default: metadata)
--type {single_files,cens,crema,hpcp,key,madmom,mfcc,tags} [{single_files,cens,crema,hpcp,key,madmom,mfcc,tags} ...]
which folder to download. for downloading multiple
folders, you can enter multiple arguments (e.g. '--
type cens crema'). for detailed explanation, please
check https://mtg.github.io/da-tacos/ (default:
single_files)
--source {gdrive,zenodo}
from which source to download the files. you can
either download from Google Drive (gdrive) or from
Zenodo (zenodo) (default: gdrive)
--outputdir OUTPUTDIR
directory to store the dataset (default: ./)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Nowadays, there are lots of datasets available for training and experimentation in the field of recommender systems. Specifically, in the recommendation of audiovisual content, the MovieLens dataset is a prominent example. It is focused on the user-item relationship, providing actual interaction data between users and movies. However, although movies can be described with several characteristics, this dataset only offers limited information about the movie genres.
In this work, we propose enriching the MovieLens dataset by incorporating metadata available on the web (such as cast, description, keywords, etc.) and movie trailers. By leveraging the trailers, we extract audio information and generate transcriptions for each trailer, introducing a crucial textual dimension to the dataset. The audio information was extracted by the waveform and frequency analysis, followed by the application of dimensionality reduction techniques. For the transcription generation, the deep learning model Whisper was used. Finally, metadata was obtained from TMDB, and the BERT model was applied to extract embeddings.
These additional attributes enrich the original dataset, providing deeper and more precise analysis. Then, the use of this extended and enhanced dataset could drive significant advancements in recommendation systems, enhancing user experiences by providing more relevant and tailored movie recommendations based on their tastes and preferences.