https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.
The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
Fork this kernel to get started.
Banner Photo by Edho Pratama from Unsplash.
What is the total number of transactions generated per device browser in July 2017?
The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?
What was the average number of product pageviews for users who made a purchase in July 2017?
What was the average number of product pageviews for users who did not make a purchase in July 2017?
What was the average total transactions per user that made a purchase in July 2017?
What is the average amount of money spent per session in July 2017?
What is the sequence of pages viewed?
💁♀️Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.
This dataset focuses on a wide range of sabermetric metrics for analyzing batter performance in baseball. It provides a comprehensive view of a player's abilities in power, plate discipline, speed, and overall efficiency.
NOTE that only qualified players are included in this data, meaning that players who reach the minimum number of plate appearances are required to qualify for season-long leaderboards and rate stats. The data was retrieved on October 18th, 2024.
AB (At-Bats): The total number of times a batter has a turn at the plate, excluding walks and sacrifices.
PA (Plate Appearances): The total number of times a batter completes a turn at the plate, including all outcomes.
Home Run: The number of times the batter hits the ball out of the field, allowing them to round all bases and score.
K% (Strikeout Percentage): The percentage of plate appearances that result in a strikeout.
BB% (Walk Percentage): The percentage of plate appearances that result in the batter receiving a walk.
SLG% (Slugging Percentage): A measure of the batter's power, calculated as total bases per at-bat.
OBP (On-Base Percentage): The percentage of times the batter reaches base via hits, walks, or hit-by-pitch events.
OPS (On-Base Plus Slugging): The sum of OBP and SLG, providing a combined measure of a batter's ability to get on base and hit for power.
Isolated Power (ISO): A measure of a batter's raw power, calculated by subtracting batting average from slugging percentage to focus on extra-base hits.
BABIP (Batting Average on Balls in Play): The batting average on balls hit into play, excluding home runs and strikeouts.
Total Stolen Bases: The total number of bases a player has stolen successfully.
xwOBA (Expected Weighted On-Base Average): A predictive version of wOBA (Weighted On-base Average) based on the quality of contact, such as exit velocity and launch angle.
wOBAdiff (wOBA Differential): The difference between a batter’s actual wOBA and expected wOBA (xwOBA), indicating performance versus expectations.
Exit Velocity Avg (Average Exit Velocity): The average speed of the ball off the bat, providing insight into the quality of contact.
Sweet Spot Percentage: The percentage of batted balls hit with a launch angle between 8 and 32 degrees, which typically leads to better offensive results.
Barrel Batted Rate: The percentage of batted balls hit with ideal exit velocity and launch angle, maximizing chances for extra-base hits.
Hard-Hit Percentage: The percentage of batted balls hit with an exit velocity of 95 mph or higher, reflecting the strength of contact.
Average Hyper Speed: The batter's average sprint speed during short, high-intensity runs like reaching base.
Whiff Percentage: The percentage of swings in which the batter misses the ball entirely.
Swing Percentage: The percentage of pitches at which the batter swings, regardless of whether they make contact.
HP to 1B (Home Plate to First Base Speed): The time it takes for a batter to sprint from home plate to first base after hitting the ball.
Sprint Speed: The player’s top running speed, usually measured during base running or fielding.
WAR (Wins Above Replacement): measures a player's value in all facets of the game by deciphering how many more wins he's worth than a replacement-level player at his same position.
This dataset is designed to simplify the process of analyzing batter performance using advanced sabermetric principles, providing key insights into offensive effectiveness and expected outcomes.
The dataset was retrieved from the respective sources listed in the Provenance section. Users are urged to use this data responsibly and to respect the rights and guidelines specified by the original data providers. When utilizing or sharing insights derived from this dataset, ensure proper attribution to the sources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.
The attractive features of MusicOSet include:
| Data | # Records |
|:-----------------:|:---------:|
| Songs | 20,405 |
| Artists | 11,518 |
| Albums | 26,522 |
| Lyrics | 19,664 |
| Acoustic Features | 20,405 |
| Genres | 1,561 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘K-Pop Hits Through The Years’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sberj127/kpop-hits-through-the-years on 12 November 2021.
--- Dataset description provided by original source is as follows ---
The datasets contain the top songs from the said era or year accordingly (as presented in the name of each dataset). Note that only the KPopHits90s dataset represents an era (1989-2001). Although there is a lack of easily available and reliable sources to show the actual K-Pop hits per year during the 90s, this era was still included as this time period was when the first generation of K-Pop stars appeared. Each of the other datasets represent a specific year after the 90s.
A song is considered to be a K-Pop hit during that era or year if it is included in the annual series of K-Pop Hits playlists, which is created officially by Apple Music. Note that for the dataset that represents the 90s, the playlist 90s K-Pop Essentials was used as the reference.
As someone who has a particular curiosity to the field of data science and a genuine love for the musicality in the K-Pop scene, this data set was created to make something out of the strong interest I have for these separate subjects.
I would like to express my sincere gratitude to Apple Music for creating the annual K-Pop playlists, Spotify for making their API very accessible, Spotipy for making it easier to get the desired data from the Spotify Web API, Tune My Music for automating the process of transferring one's library into another service's library and, of course, all those involved in the making of these songs and artists included in these datasets for creating such high quality music and concepts digestible even for the general public.
--- Original source retains full ownership of the source dataset ---
💁♀️Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.
This dataset compiles the tracks from Spotify's official "Top Tracks of 2023" playlist, showcasing the most popular and influential music of the year according to Spotify's streaming data. It represents a wide range array of genres, artists, and musical styles that have defined the musical landscapes of 2023. Each track in the dataset is detailed with a variety of features, popularity, and metadata. This dataset serves as an excellent resource for music enthusiasts, data analysts, and researchers aiming to explore music trends or develop music recommendation systems based on empirical data.
The data was obtained directly from the Spotify Web API, specifically from the "Top Tracks of 2023" official playlist curated by Spotify. The Spotify API provides detailed information about tracks, artists, and albums through various endpoints.
To process and structure the data, I developed Python scripts using data science libraries such as pandas
for data manipulation and spotipy
for API interactions specifically for Spotify data retrieval.
I encourage users who discover new insights, propose dataset enhancements, or craft analytics that illuminate aspects of the dataset's focus to share their findings with the community. - Kaggle Notebooks: To facilitate sharing and collaboration, users are encouraged to create and share their analyses through Kaggle notebooks. For ease of use, start your notebook by clicking "New Notebook" atop this dataset’s page on K...
As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. We want to make installing solar panels easy and understandable for anyone. Project Sunroof puts Google's expansive data in mapping and computing resources to use, helping calculate the best solar plan for you. How does it work? When you enter your address, Project Sunroof looks up your home in Google Maps and combines that information with other databases to create your personalized roof analysis. Don’t worry, Project Sunroof doesn't give the address to anybody else. Learn more about Project Sunroof and see the tool at Project Sunroof’s site . Project Sunroof computes how much sunlight hits roofs in a year, based on shading calculations, typical meteorological data, and estimates of the size and shape of the roofs. You can see more details about how solar viability is determined by checking out methodology here. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The European Copernicus Coastal Flood Awareness System (ECFAS) project aimed at contributing to the evolution of the Copernicus Emergency Management Service (https://emergency.copernicus.eu/) by demonstrating the technical and operational feasibility of a European Coastal Flood Awareness System. Specifically, ECFAS provides a much-needed solution to bolster coastal resilience to climate risk and reduce population and infrastructure exposure by monitoring and supporting disaster preparedness, two factors that are fundamental to damage prevention and recovery if a storm hits.
The ECFAS Proof-of-Concept development ran from January 2021 to December 2022. The ECFAS project was a collaboration between Scuola Universitaria Superiore IUSS di Pavia (Italy, ECFAS Coordinator), Mercator Ocean International (France), Planetek Hellas (Greece), Collecte Localisation Satellites (France), Consorzio Futuro in Ricerca (Italy), Universitat Politecnica de Valencia (Spain), University of the Aegean (Greece), and EurOcean (Portugal), and was funded by the European Commission H2020 Framework Programme within the call LC-SPACE-18-EO-2020 - Copernicus evolution: research activities in support of the evolution of the Copernicus services.
Description of the containing files inside the Dataset.
The ECFAS Coastal Dataset represents a single access point to publicly available Pan-European datasets that provide key information for studying coastal areas. The publicly available datasets listed below have been clipped to the coastal area extent, quality-checked and assessed for completeness and usability in terms of coverage, accuracy, specifications and access. The dataset was divided at European country level, except for the Adriatic area which was extracted as a region and not at the country level due to the small size of the countries. The buffer zone of each data was 10km inland in order to be correlated with the new Copernicus product Coastal Zone LU/LC.
Specifically, the dataset includes the new Coastal LU/LC product which was implemented by the EEA and became available at the end of 2020. Additional information collected in relation to the location and characteristics of transport (road and railway) and utility networks (power plants), population density and time variability. Furthermore, some of the publicly available datasets that were used in CEMS related to the above mentioned assets were gathered such as OpenStreetMap (building footprints, road and railway network infrastructures), GeoNames (populated places but also names of administrative units, rivers and lakes, forests, hills and mountains, parks and recreational areas, etc.), the Global Human Settlement Layer (GHS) and Global Human Settlement Population Grid (GHS-POP) generated by JRC. Also, the dataset contains 2 layers with statistics information regarding the population of Europe per sex and age divided in administrative units at NUTS level 3. The first layer includes information for the whole of Europe and the second layer has only the information regarding the population at the Coastal area. Finally, the dataset includes the global database of Floods protection standards. Below there are tables which present the dataset.
* Adriatic folder contains the countries: Slovenia, Croatia, Montenegro, Albania, Bosnia and Herzegovina
* Malta was added to the dataset
Copernicus Land Monitoring Service:
Coastal LU/LC
Scale 1:10.000; A Copernicus hotspot product to monitor landscape dynamics in coastal zones
EU-Hydro - Coastline
Scale 1:30.000; EU-Hydro is a dataset for all European countries providing the coastline
Natura 2000
Scale 1: 100000; A Copernicus hotspot product to monitor important areas for nature conservation
European Settlement Map
Resolution 10m; A spatial raster dataset that is mapping human settlements in Europe
Imperviousness Density
Resolution 10m; The percentage of sealed area
Impervious Built-up
Resolution 10m; The part of the sealed surfaces where buildings can be found
Grassland 2018
Resolution 10m; A binary grassland/non-grassland product
Tree Cover Density 2018
Resolution 10m; Level of tree cover density in a range from 0-100%
Joint Research Center:
Global Human Settlement Population Grid
GHS-POP)
Resolution 250m; Residential population estimates for target year 2015
GHS settlement model layer
(GHS-SMOD)
Resolution 1km: The GHS Settlement Model grid delineates and classify settlement typologies via a logic of population size, population and built-up area densities
GHS-BUILT
Resolution 10m; Built-up grid derived from Sentinel-2 global image composite for reference year 2018
ENACT 2011 Population Grid
(ENACT-POP R2020A)
Resolution 1km; The ENACT is a population density for the European Union that take into account major daily and monthly population variations
JRC Open Power Plants Database (JRC-PPDB-OPEN)
Europe's open power plant database
GHS functional urban areas
(GHS-FUA R2019A)
Resolution 1km; City and its commuting zone (area of influence of the city in terms of labour market flows)
GHS Urban Centre Database
(GHS-UCDB R2019A)
Resolution 1km; Urban Centres defined by specific cut-off values on resident population and built-up surface
Additional Data:
Open Street Map (OSM)
BF, Transportation Network, Utilities Network, Places of Interest
CEMS
Data from Rapid Mapping activations in Europe
GeoNames
Populated places, Adm. units, Hydrography, Forests, Hills/Mountains, Parks, etc.
Global Administrative Areas
Administrative areas of all countries, at all levels of sub-division
NUTS3 Population Age/Sex Group
Eurostat population by age and sex statistics interescted with the NUTS3 Units
FLOPROS
A global database of FLOod PROtection Standards, which comprises information in the form of the flood return period associated with protection measures, at different spatial scales
Disclaimer:
ECFAS partners provide the data "as is" and "as available" without warranty of any kind. The ECFAS partners shall not be held liable resulting from the use of the information and data provided.
This project has received funding from the Horizon 2020 research and innovation programme under grant agreement No. 101004211
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019
I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.
The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.
The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.
Updates to these datasets will be announced and published as the project progresses.
II. What’s included? This data set includes:
III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.
IV. Data Set Field Descriptions
IV
a) Collections dataset field descriptions
b) Items dataset field descriptions
V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.
VI. Creator and Contributor Information
Creator: Connections In Sound
Contributors: Library of Congress Labs
VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Sports Analysis: Coaches and analysts can use this computer vision model to track the performance of players during a game or practice session. They can get insights about precise ball movements, successful hits, and goal rates, leading to better training and strategic decisions.
Highlight Generation: Sports media companies can implement the "basketball" model to automatically detect exciting moments like successful goals or impressive hits during a game. This can enable them to create instant highlights for social media, web portals, or live broadcasts, enhancing user engagement.
Virtual Coaching: This model can be integrated into mobile applications or websites that offer virtual basketball coaching. Users would be able to upload their videos, and the model would provide them with feedback based on their technique, ball handling, and shooting accuracy.
Smart Camera Systems: The "basketball" model can be embedded in smart cameras for sports facilities or courts. This would allow the cameras to follow the action as it happens, automatically zooming in on goals or exciting plays, thus enhancing the overall viewing experience for spectators.
Basketball Simulation Games: Game developers can utilize the model's capability to recognize various aspects of a basketball game to create more realistic and engaging basketball simulation games. The AI-driven virtual players would exhibit authentic in-game actions and responses, providing a closer-to-real gaming experience to the users.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Spotify Million Song Dataset
Dataset Summary
This is Spotify Million Song Dataset. This dataset contains song names, artists names, link to the song and lyrics. This dataset can be used for recommending songs, classifying or clustering songs.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data… See the full description on the dataset page: https://huggingface.co/datasets/vishnupriyavr/spotify-million-song-dataset.
UMD-350MB
The Universal MIDI Dataset 350MB (UMD-350MB) is a proprietary collection of 85,618 MIDI files curated for research and development within our organization. This collection is a subset sampled from a larger dataset developed for pretraining symbolic music models.
The field of symbolic music generation is constrained by limited data compared to language models. Publicly available datasets, such as the Lakh MIDI Dataset, offer large collections of MIDI files sourced from the web. While the sheer volume of musical data might appear beneficial, the actual amount of valuable data is less than anticipated, as many songs contain less desirable melodies with erratic and repetitive events.
The UMD-350MB employs an attention-based approach to achieve more desirable output generations by focusing on human-reviewed training examples of single-track melodies, chord progressions, leads and arpeggios with an average duration of 8 bars. This was achieved by refining the dataset over 24 months, ensuring consistent quality and tempo alignment. Moreover, the dataset is normalized by setting the timing information to 120 BPM with a tick resolution (PPQ) of 96 and transposing the musical scales to C major and A minor (natural scales).
Melody Styles
A major portion of the dataset is composed of newly produced private data to represent modern musical styles.
Actual MIDI files are unlabeled for unsupervised training.
Dataset Access
Please note that this is a closed-source dataset with very limited access. Considerations for access include proposals for data augmentation, chord extraction and other enhancement methods, whether through scripts, algorithmic techniques, manual editing in a DAW or additional processing methods.
For inquiries about this dataset, please email us.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random. It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest s) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, howeve
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We present Da-TACOS: a dataset for cover song identification and understanding. It contains two subsets, namely the benchmark subset (for benchmarking cover song identification systems) and the cover analysis subset (for analyzing the links among cover songs), with pre-extracted features and metadata for 15,000 and 10,000 songs, respectively. The annotations included in the metadata are obtained with the API of SecondHandSongs.com. All audio files we use to extract features are encoded in MP3 format and their sample rate is 44.1 kHz. Da-TACOS does not contain any audio files. For the results of our analyses on modifiable musical characteristics using the cover analysis subset and our initial benchmarking of 7 state-of-the-art cover song identification algorithms on the benchmark subset, you can look at our publication.
For organizing the data, we use the structure of SecondHandSongs where each song is called a ‘performance’, and each clique (cover group) is called a ‘work’. Based on this, the file names of the songs are their unique performance IDs (PID, e.g. P_22
), and their labels with respect to their cliques are their work IDs (WID, e.g. W_14
).
Metadata for each song includes
In addition, we matched the original metadata with MusicBrainz to obtain MusicBrainz ID (MBID), song length and genre/style tags. We would like to note that MusicBrainz related information is not available for all the songs in Da-TACOS, and since we used just our metadata for matching, we include all possible MBIDs for a particular songs.
For facilitating reproducibility in cover song identification (CSI) research, we propose a framework for feature extraction and benchmarking in our supplementary repository: acoss. The feature extraction component is designed to help CSI researchers to find the most commonly used features for CSI in a single address. The parameter values we used to extract the features in Da-TACOS are shared in the same repository. Moreover, the benchmarking component includes our implementations of 7 state-of-the-art CSI systems. We provide the performance results of an initial benchmarking of those 7 systems on the benchmark subset of Da-TACOS. We encourage other CSI researchers to contribute to acoss with implementing their favorite feature extraction algorithms and their CSI systems to build up a knowledge base where CSI research can reach larger audiences.
The instructions for how to download and use the dataset are shared below. Please contact us if you have any questions or requests.
1. Structure
1.1. Metadata
We provide two metadata files that contain information about the benchmark subset and the cover analysis subset. Both metadata files are stored as python dictionaries in .json
format, and have the same hierarchical structure.
An example to load the metadata files in python:
import json
with open('./da-tacos_metadata/da-tacos_benchmark_subset_metadata.json') as f:
benchmark_metadata = json.load(f)
The python dictionary obtained with the code above will have the respective WIDs as keys. Each key will provide the song dictionaries that contain the metadata regarding the songs that belong to their WIDs. An example can be seen below:
"W_163992": { # work id
"P_547131": { # performance id of the first song belonging to the clique 'W_163992'
"work_title": "Trade Winds, Trade Winds",
"work_artist": "Aki Aleong",
"perf_title": "Trade Winds, Trade Winds",
"perf_artist": "Aki Aleong",
"release_year": "1961",
"work_id": "W_163992",
"perf_id": "P_547131",
"instrumental": "No",
"perf_artist_mbid": "9bfa011f-8331-4c9a-b49b-d05bc7916605",
"mb_performances": {
"4ce274b3-0979-4b39-b8a3-5ae1de388c4a": {
"length": "175000"
},
"7c10ba3b-6f1d-41ab-8b20-14b2567d384a": {
"length": "177653"
}
}
},
"P_547140": { # performance id of the second song belonging to the clique 'W_163992'
"work_title": "Trade Winds, Trade Winds",
"work_artist": "Aki Aleong",
"perf_title": "Trade Winds, Trade Winds",
"perf_artist": "Dodie Stevens",
"release_year": "1961",
"work_id": "W_163992",
"perf_id": "P_547140",
"instrumental": "No"
}
}
1.2. Pre-extracted features
The list of features included in Da-TACOS can be seen below. All the features are extracted with acoss repository that uses open-source feature extraction libraries such as Essentia, LibROSA, and Madmom.
To facilitate the use of the dataset, we provide two options regarding the file structure.
1- In da-tacos_benchmark_subset_single_files
and da-tacos_coveranalysis_subset_single_files
folders, we organize the data based on their respective cliques, and one file contains all the features for that particular song.
{
"chroma_cens": numpy.ndarray,
"crema": numpy.ndarray,
"hpcp": numpy.ndarray,
"key_extractor": {
"key": numpy.str_,
"scale": numpy.str_,_
"strength": numpy.float64
},
"madmom_features": {
"novfn": numpy.ndarray,
"onsets": numpy.ndarray,
"snovfn": numpy.ndarray,
"tempos": numpy.ndarray
}
"mfcc_htk": numpy.ndarray,
"tags": list of (numpy.str_, numpy.str_)
"label": numpy.str_,
"track_id": numpy.str_
}
2- In da-tacos_benchmark_subset_FEATURE
and da-tacos_coveranalysis_subset_FEATURE
folders, the data is organized based on their cliques as well, but each of these folders contain only one feature per song. For instance, if you want to test your system that uses HPCP features, you can download da-tacos_benchmark_subset_hpcp
to access the pre-computed HPCP features. An example for the contents in those files can be seen below:
{
"hpcp": numpy.ndarray,
"label": numpy.str_,
"track_id": numpy.str_
}
2. Using the dataset
2.1. Requirements
git clone https://github.com/MTG/da-tacos.git
cd da-tacos
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
2.2. Downloading the data
The dataset is currently stored in only in Google Drive (it will be uploaded to Zenodo soon), and can be downloaded from this link. We also provide a python script that automatically downloads the folders you specify. Basic usage of this script can be seen below:
python download_da-tacos.py -h
usage: download_da-tacos.py [-h]
[--dataset {benchmark,coveranalysis,da-tacos}]
[--type {single_files,cens,crema,hpcp,key,madmom,mfcc,tags}]
[--source {gdrive,zenodo}]
[--outputdir OUTPUTDIR]
[--unpack]
[--remove]
Download script for Da-TACOS
optional arguments:
-h, --help show this help message and exit
--dataset {metadata,benchmark,coveranalysis,da-tacos}
which subset to download. 'da-tacos' option downloads
both subsets. the options other than 'metadata' will
download the metadata as well. (default: metadata)
--type {single_files,cens,crema,hpcp,key,madmom,mfcc,tags} [{single_files,cens,crema,hpcp,key,madmom,mfcc,tags} ...]
which folder to download. for downloading multiple
folders, you can enter multiple arguments (e.g. '--
type cens crema'). for detailed explanation, please
check https://mtg.github.io/da-tacos/ (default:
single_files)
--source {gdrive,zenodo}
from which source to download the files. you can
either download from Google Drive (gdrive) or from
Zenodo (zenodo) (default: gdrive)
--outputdir OUTPUTDIR
directory to store the dataset (default: ./)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes 118 recordings of sung melodies. The recordings were made as part of the experiments on Query-by-Humming (QBH) reported in the following article:
J. Salamon, J. Serrà and E. Gómez, "Tonal Representations for Music Retrieval: From Version Identification to Query-by-Humming", International Journal of Multimedia Information Retrieval, special issue on Hybrid Music Information Retrieval, In Press (accepted Nov. 2012).
The recordings were made by 17 different subjects, 9 female and 8 male, whose musical experience ranged from none at all to amateur musicians. Subjects were presented with a list of songs out of which they were asked to select the ones they knew and sing part of the melody. The subjects were aware that the recordings will be used as queries in an experiment on QBH. There was no restriction as to how much of the melody should be sung nor which part of the melody should be sung, and the subjects were allowed to sing the melody with or without lyrics. The subjects did not listen to the original songs before recording the queries, and the recordings were all sung a capella without any accompaniment nor reference tone. To simulate a realistic QBH scenario, all recordings were done using a basic laptop microphone and no post-processing was applied. The duration of the recordings ranges from 11 to 98 seconds, with an average recording length of 26.8 seconds.
In addition to the query recordings, three meta-data files are included, one describing the queries and two describing the music collections against which the queries were tested in the experiments described in the aforementioned article. Whilst the query recordings are included in this dataset, audio files for the music collections listed in the meta-data files are NOT included in this dataset, as they are protected by copyright law. If you wish to reproduce the experiments reported in the aforementioned paper, it is up to you to obtain the original audio files of these songs.
All subjects have given their explicit approval for this dataset to be made public.
Please Acknowledge MTG-QBH in Academic Research
Using this dataset
When the MTG-QBH dataset is used for academic research, we would highly appreciate if scientific publications of works partly based on the MTG-QBH dataset cite the above publication.
We are interested in knowing if you find our datasets useful! If you use our dataset please email us at mtg-info@upf.edu and tell us about your research.
A crystallographic fragment screening (CFS) has been performed on a spliceosomal yeast protein-protein complex of Aar2 and the RNaseH-like domain of Prp8 (AR). The F2X-Universal Library is a fragment library representing the commercially available chemical space of fragments. 917 fragments have been individually screened via crystal soaking. The datasets that could be successfully auto-processed and auto-refined had been subjected to a Pan-Dataset Density Analysis (PanDDA) (Pearce et al., 2017). This analysis allows to find low occupancy binders. The data has been analyzed in different ways; once all datasets were given as input to PanDDA and once the data was clustered via cluster4x (Ginn, 2020) and the individual clusters were given as input for PanDDA. After the analysis, in total 269 hits could be identified in the PanDDA event maps. Most fragments cluster in certain regions on the protein surface, which are termed binding sites. Ten binding sites were identified. Some of these binding sites overlap with known protein-protein interactions sites, while others have no known function. These novel binding sites could be potential interaction sites too. Furthermore, due to the repeated binding of individual fragment hits in the same binding site, certain structural overlaps could be observed of fragments. These provide the confirmation of binding modes which offers additional information for further compound development. The data provided here includes all folders of the different PanDDA-runs in individual tar.gz files. For each run the input (auto-refined structures, fragment structure files) and output data (pandda models and event and Z-maps) is given. Additionally, a directory overview as a PDF is provided to help navigate through the data. With this data, every identified fragment hit can be inspected individually.
Spotify Million Playlist Dataset Challenge Summary The Spotify Million Playlist Dataset Challenge consists of a dataset and evaluation to enable research in music recommendations. It is a continuation of the RecSys Challenge 2018, which ran from January to July 2018. The dataset contains 1,000,000 playlists, including playlist titles and track titles, created by users on the Spotify platform between January 2010 and October 2017. The evaluation task is automatic playlist continuation: given a seed playlist title and/or initial set of tracks in a playlist, to predict the subsequent tracks in that playlist. This is an open-ended challenge intended to encourage research in music recommendations, and no prizes will be awarded (other than bragging rights). Background Playlists like Today’s Top Hits and RapCaviar have millions of loyal followers, while Discover Weekly and Daily Mix are just a couple of our personalized playlists made especially to match your unique musical tastes. Our users love playlists too. In fact, the Digital Music Alliance, in their 2018 Annual Music Report, state that 54% of consumers say that playlists are replacing albums in their listening habits. But our users don’t love just listening to playlists, they also love creating them. To date, over 4 billion playlists have been created and shared by Spotify users. People create playlists for all sorts of reasons: some playlists group together music categorically (e.g., by genre, artist, year, or city), by mood, theme, or occasion (e.g., romantic, sad, holiday), or for a particular purpose (e.g., focus, workout). Some playlists are even made to land a dream job, or to send a message to someone special. The other thing we love here at Spotify is playlist research. By learning from the playlists that people create, we can learn all sorts of things about the deep relationship between people and music. Why do certain songs go together? What is the difference between “Beach Vibes” and “Forest Vibes”? And what words do people use to describe which playlists? By learning more about nature of playlists, we may also be able to suggest other tracks that a listener would enjoy in the context of a given playlist. This can make playlist creation easier, and ultimately help people find more of the music they love. Dataset To enable this type of research at scale, in 2018 we sponsored the RecSys Challenge 2018, which introduced the Million Playlist Dataset (MPD) to the research community. Sampled from the over 4 billion public playlists on Spotify, this dataset of 1 million playlists consist of over 2 million unique tracks by nearly 300,000 artists, and represents the largest public dataset of music playlists in the world. The dataset includes public playlists created by US Spotify users between January 2010 and November 2017. The challenge ran from January to July 2018, and received 1,467 submissions from 410 teams. A summary of the challenge and the top scoring submissions was published in the ACM Transactions on Intelligent Systems and Technology. In September 2020, we re-released the dataset as an open-ended challenge on AIcrowd.com. The dataset can now be downloaded by registered participants from the Resources page. Each playlist in the MPD contains a playlist title, the track list (including track IDs and metadata), and other metadata fields (last edit time, number of playlist edits, and more). All data is anonymized to protect user privacy. Playlists are sampled with some randomization, are manually filtered for playlist quality and to remove offensive content, and have some dithering and fictitious tracks added to them. As such, the dataset is not representative of the true distribution of playlists on the Spotify platform, and must not be interpreted as such in any research or analysis performed on the dataset. Dataset Contains 1000 examples of each scenario: Title only (no tracks) Title and first track Title and first 5 tracks First 5 tracks only Title and first 10 tracks First 10 tracks only Title and first 25 tracks Title and 25 random tracks Title and first 100 tracks Title and 100 random tracks Download Link Full Details: https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge Download Link: https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge/dataset_files {"references": ["C.W. Chen, P. Lamere, M. Schedl, and H. Zamani. Recsys Challenge 2018: Automatic Music Playlist Continuation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18), 2018."]}
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
César E. Corona-González, Claudia Rebeca De Stefano-Ramos, Juan Pablo Rosado-Aíza, Fabiola R Gómez-Velázquez, David I. Ibarra-Zarate, Luz María Alonso-Valerdi
César E. Corona-González
https://orcid.org/0000-0002-7680-2953
a00833959@tec.mx
Psychophysiological data from Mexican children with learning difficulties who strengthen reading and math skills by assistive technology
2023
The current dataset consists of psychometric and electrophysiological data from children with reading or math learning difficulties. These data were collected to evaluate improvements in reading or math skills resulting from using an online learning method called Smartick.
The psychometric evaluations from children with reading difficulties encompassed: spelling tests, where 1) orthographic and 2) phonological errors were considered, 3) reading speed, expressed in words read per minute, and 4) reading comprehension, where multiple-choice questions were given to the children. The last 2 parameters were determined according to the standards from the Ministry of Public Education (Secretaría de Educación Pública in Spanish) in Mexico. On the other hand, group 2 assessments embraced: 1) an assessment of general mathematical knowledge, as well as 2) the hits percentage, and 3) reaction time from an arithmetical task. Additionally, selective attention and intelligence quotient (IQ) were also evaluated.
Then, individuals underwent an EEG experimental paradigm where two conditions were recorded: 1) a 3-minute eyes-open resting state and 2) performing either reading or mathematical activities. EEG recordings from the reading experiment consisted of reading a text aloud and then answering questions about the text. Alternatively, EEG recordings from the math experiment involved the solution of two blocks with 20 arithmetic operations (addition and subtraction). Subsequently, each child was randomly subcategorized as 1) the experimental group, who were asked to engage with Smartick for three months, and 2) the control group, who were not involved with the intervention. Once the 3-month period was over, every child was reassessed as described before.
The dataset contains a total of 76 subjects (sub-), where two study groups were assessed: 1) reading difficulties (R) and 2) math difficulties (M). Then, each individual was subcategorized as experimental subgroup (e), where children were compromised to engage with Smartick, or control subgroup (c), where they did not get involved with any intervention.
Every subject was followed up on for three months. During this period, each subject underwent two EEG sessions, representing the PRE-intervention (ses-1) and the POST-intervention (ses-2).
The EEG recordings from the reading difficulties group consisted of a resting state condition (run-1) and while performing active reading and reading comprehension activities (run-2). On the other hand, EEG data from the math difficulties group was collected from a resting state condition (run-1) and when solving two blocks of 20 arithmetic operations (run-2 and run-3). All EEG files were stored in .set format. The nomenclature and description from filenames are shown below:
Nomenclature | Description |
---|---|
sub- | Subject |
M | Math group |
R | Reading group |
c | Control subgroup |
e | Experimental subgroup |
ses-1 | PRE-intervention |
ses-2 | POST-Intervention |
run-1 | EEG for baseline |
run-2 | EEG for reading activity, or the first block of math |
run-3 | EEG for the second block of math |
Example: the file sub-Rc11_ses-1_task-SmartickDataset_run-2_eeg.set is related to: - The 11th subject from the reading difficulties group, control subgroup (sub-Rc11). - EEG recording from the PRE-intervention (ses-1) while performing the reading activity (run-2)
Psychometric data from the reading difficulties group:
Psychometric data from the math difficulties group:
Psychometric data can be found in the 01_Psychometric_Data.xlsx file
Engagement percentage be found in the 05_SessionEngagement.xlsx file
Seventy-six Mexican children between 7 and 13 years old were enrolled in this study.
The sample was recruited through non-profit foundations that support learning and foster care programs.
g.USBamp RESEARCH amplifier
The stimuli nested folder contains all stimuli employed in the EEG experiments.
Level 1 - Math: Images used in the math experiment. - Reading: Images used in the reading experiment.
Level 2
- Math
* POST_Operations: arithmetic operations from the POST-intervention.
* PRE_Operations: arithmetic operations from the PRE-intervention.
- Reading
* POST_Reading1: text 1 and text-related comprehension questions from the POST-intervention.
* POST_Reading2: text 2 and text-related comprehension questions from the POST-intervention.
* POST_Reading3: text 3 and text-related comprehension questions from the POST-intervention.
* PRE_Reading1: text 1 and text-related comprehension questions from the PRE-intervention.
* PRE_Reading2: text 2 and text-related comprehension questions from the PRE-intervention.
* PRE_Reading3: text 3 and text-related comprehension questions from the PRE-intervention.
Level 3 - Math * Operation01.jpg to Operation20.jpg: arithmetical operations solved during the first block of the math
GPUZIP v2.0 Reproducibility Dataset This dataset provides all the necessary materials to reproduce the results presented in the GPUZIP v2.0 article. It is organized into folders, each containing a README.md.txt file that describes its contents and explains how to interpret the files. Note:This dataset is organized as a directory structure, so for better visualization change the "View type" to "Tree" before explore the dataset through this web application. Types of Files The repository contains the following file types: .md.txt: Markdown-formatted README files. For optimal readability, use a Markdown viewer such as VSCode or Learn More, however, as a straightforward approach any text reader (e.g., Notepad, cat, vi, nano) can also read them. .*.zipfile: Compressed file (usually called .zip). Files with the .extension.zipfile format (e.g., large-mod.su.zipfile) should be unzipped to access their original format (e.g., large-mod.su). Throughout the documentation, files are always referenced by their uncompressed extensions (e.g., .su). To ensure consistency and avoid confusion, it is recommended that all .zipfile files be unzipped before exploring the repository. Hint: Please see the scripts below for unzipping all files. .xlsx: Excel files. Compatible with LibreOffice, Google Sheets, and Numbers. .par: Configuration files for proprietary RTM runs. Readable with any text editor. .hdr: Header files for velocity models. Refer to Datasets/HowToReadDatasetFiles.md.txt for details. .bin: Raw binary data files containing velocity models in float format. See Datasets/HowToReadDatasetFiles.md.txt for parsing instructions. .data: Binary data files, similar to .bin. .su: Seismic Unix files containing seismic traces. Refer to Datasets/HowToReadDatasetFiles.md.txt for details. .png, .jpg, .jpeg, .gif: Rendered visuals of velocity models or diagrams. .qdrep: Nsight Systems profiling files. Compatible with Nsight Systems 2024.01.1. Root Directory Contents Datasets/ Contains input datasets, including velocity models, seismic traces, and configurations. Detailed information is provided in Datasets/HowToReadDatasetFiles.md.txt. DataWarmUp/ Holds results from compressor calibration experiments, including raw data, logs, and the compiled .xlsx summaries. Experiments were conducted with two shots. See DataWarmUp/README.md.txt for more information. GeometryScript/ Utility script for rendering shot distributions in the datasets. Helpful in visualizing experiment setups. NSight/ This folder contains a subset of Nsight profiling files for the Marmousi3D dataset, covering all compressors and a cache size of two across all checkpointing algorithms. If needed, contact the authors for additional profiling data. Quality/ Contains the results for all shots for quality assessment (Section 7.6). See Quality/README.md.txt. TimeBreakdown/ Complete results for Section 7.4 of the GPUZIP v2.0 article. This folder includes detailed breakdowns of two-shot experiments. See TimeBreakdown/README.md.txt for details. SpeedupAndMemory.xlsx Comprehensive data used to generate charts in Figure 6 and Table 4 (Sections 7.2 and 7.1) of the article. Extra: Util for Unzipping All Files We provide a simple script to unzip all files so that data exploration can be more fluid. Feel free to use it. Windows (.bat) @echo off setlocal enabledelayedexpansion for /r %%f in (.zipfile) do ( echo Decompressing: %%f powershell -Command "Expand-Archive -Path '%%f' -DestinationPath '%%~dpf' -Force" if not errorlevel 1 ( echo Decompressed successfully: %%f del "%%f" ) else ( echo Failed to decompress: %%f ) ) echo All zip files processed. pause Shell script (MacOS, Linux, Unix) #!/bin/bash find . -type f -name ".zipfile" | while read -r zipfile; do echo "Decompressing: $zipfile" unzip -o "$zipfile" -d "$(dirname "$zipfile")" if [ $? -eq 0 ]; then echo "Successfully decompressed: $zipfile" rm "$zipfile" else echo "Failed to decompress: $zipfile" fi done echo "All zip files processed." How Do I Read .bin, .data, and .su Files? See: Datasets/HowToReadDatasetFiles.md.txt How Do I Read .par and .hdr Files? See: Datasets/HowToReadDatasetFiles.md.txt How to Interpret Log Files? To analyze cache hits, misses, and memory consumption, refer to the logs in the TimeBreakdown folder (decom-*.txt files). Key metrics can be extracted as follows: Cache Hits: Search for RET_HIT. Cache Misses: Search for RET_MIS. Prefetched Items: Search for ===> Prefetching:. Prefetch Action Vector (PAV): Search for PAV:. Memory Consumption: Search for [MEM_TRACK]. Checkpoint Pool Size: Search for Checkpoint Pool Size. Each log file concludes with a summary from Nsight.
Data licence Germany – Attribution – Version 2.0https://www.govdata.de/dl-de/by-2-0
License information was derived automatically
The BMDV open data portal mCLOUD offers a Export interface (REST-API) via the data as RDF according to the DCAT-AP.de specification or can be exported as CSV.
The parameters in the requests are based on the parameters in the portal for a remote search (URL).
At the end of a hit page in the portal, the export is always offered. So one possibility is to search the portal as normal and then copy the export URL at the end of a page.
All data sets that have been added in the last 24 hours:
filter=newdatasets
https://mcloud.de/export/datasets?filter=newdatasets
All datasets that were changed in the last 24 hours (also includes newly added sets):
filter=modifieddatasets
https://mcloud.de/export/datasets?filter=modifieddatasets
pageSize=10 (number of sentences on one page)
page=1 (display first page)
https://mcloud.de/export/datasets?page=1&pageSize=10
Im DCAT-AP.de export always includes navigation information at the beginning:
itemsPerPage (= pageSize parameter)
totalItems (total number)
firstPage (= first page for page parameter)
lastPage (= last page for page parameters)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
In the Loving Memory 💐 of Liam Payne (1993-2024)
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20924130%2Fae17be6b14a27c4882b9fcaadd23bfd9%2FEI_ZpYZntoovd3fRtY4SZxvRvqHlTwAHoNjFMu1Bfaj.jpg?generation=1732537774342370&alt=media" alt="">
Dataset Description:
Acknowledgements Wikipedia, ChatGPT, genius.com
OPEN FOR COLLABORATION: - I am open to collaborate with anyone who wants to add on features to this dataset or knows how to collect data by using APIs (For instance the Spotify API for developers)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.
The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
Fork this kernel to get started.
Banner Photo by Edho Pratama from Unsplash.
What is the total number of transactions generated per device browser in July 2017?
The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?
What was the average number of product pageviews for users who made a purchase in July 2017?
What was the average number of product pageviews for users who did not make a purchase in July 2017?
What was the average total transactions per user that made a purchase in July 2017?
What is the average amount of money spent per session in July 2017?
What is the sequence of pages viewed?