FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.
Citation
If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)
You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
About this dataset
Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.
The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.
All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.
The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:
"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".
Some other relevant characteristics of FSDKaggle2018:
The dataset is split into a train set and a test set.
The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.
Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.
Non-verified annotations in the train set are properly flagged in train.csv
so that participants can opt to use this information during the development of their systems.
The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.
All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.
Data labeling process
The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.
Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.
Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv
). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.
The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.
More details about the data labeling process can be found in [3].
License
FSDKaggle2018 has licenses at two different levels, as explained next.
All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv
and test_post_competition_scoring_clips.csv
.
In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET
file downloaded with the FSDKaggle2018.doc
zip file.
Files
FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:
root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET
Overview This data set consists of links to social network items for 34 different forensic events that took place between August 14th, 2018 and January 06th, 2021. The majority of the text and images are from Twitter (a minor part is from Flickr, Facebook and Google+), and every video is from YouTube. Data Collection We used Social Tracker, along with the social medias' APIs, to gather most of the collections. For a minor part, we used Twint. In both cases, we provided keywords related to the event to receive the data. It is important to mention that, in procedures like this one, usually only a small fraction of the collected data is in fact related to the event and useful for a further forensic analysis. Content We have data from 34 events, and for each of them we provide the files: items_full.csv: It contains links to any social media post that was collected. images.csv: Enlists the images collected. In some files there is a field called "ItemUrl", that refers to the social network post (e.g., a tweet) that mentions that media. video.csv: Urls of YouTube videos that were gathered about the event. video_tweet.csv: This file contains IDs of tweets and IDs of YouTube videos. A tweet whose ID is in this file has a video in its content. In turn, the link of a Youtube video whose ID is in this file was mentioned by at least one collected tweet. Only two collections have this file. description.txt: Contains some standard information about the event, and possibly some comments about any specific issue related to it. In fact, most of the collections do not have all the files above. Such an issue is due to changes in our collection procedure throughout the time of this work. Events We divided the events into six groups. They are: Fire: Devastating fire is the main issue of the event, therefore most of the informative pictures show flames or burned constructions. 14 Events Collapse: Most of the relevant images depict collapsed buildings, bridges, etc. (not caused by fire). 5 Events Shooting: Likely images of guns and police officers. Few or no destruction of the environment. 5 Events Demonstration: Plethora of people on the streets. Possibly some problem took place on that, but in most cases the demonstration is the actual event. 7 Events Collision: Traffic collision. Pictures of damaged vehicles on an urban landscape. Possibly there are images with victims on the street. 1 Event Flood: Events that range from fierce rain to a tsunami. Many pictures depict water. 2 Events Media Content Due to the terms of use from the social networks, we do not make publicly available the texts, images and videos that were collected. However, we can provide some extra piece of media content related to one (or more) events by contacting the authors.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Tables TitleVersion
and Votes
are not yet visible in the Data preview page, but they are accessible in Kernels.
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.
SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.
This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.
The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:
These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.
This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarize counts of enrollees who are dually-eligible for both Medicare and Medicaid program, including those in Medicare Savings Programs. “Duals” represent 20 percent of all Medicare beneficiaries, yet they account for 34 percent of all spending by the program, according to the Commonwealth Fund . As a representation of this high-needs, high-cost population, these data offer a view of regions ripe for more intensive care coordination that can address complex social and clinical needs. In addition to the high cost savings opportunity to deliver upstream clinical interventions, this population represents the county-by-county volume of patients who are eligible for both state level (Medicaid) and federal level (Medicare) reimbursements and potential funding streams to address unmet social needs across various programs, waivers, and other projects. The dataset includes eligibility type and enrollment by quarter, at both the state and county level. These data represent monthly snapshots submitted by states to the CMS, which are inherently lower than ever-enrolled counts (which include persons enrolled at any time during a calendar year.) For more information on dually eligible beneficiaries
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.sdoh_cms_dual_eligible_enrollment.
In what counties in Michigan has the number of dual-eligible individuals increased the most from 2015 to 2018? Find the counties in Michigan which have experienced the largest increase of dual enrollment households
duals_Jan_2015 AS (
SELECT Public_Total AS duals_2015, County_Name, FIPS
FROM bigquery-public-data.sdoh_cms_dual_eligible_enrollment.dual_eligible_enrollment_by_county_and_program
WHERE State_Abbr = "MI" AND Date = '2015-12-01'
),
duals_increase AS ( SELECT d18.FIPS, d18.County_Name, d15.duals_2015, d18.duals_2018, (d18.duals_2018 - d15.duals_2015) AS total_duals_diff FROM duals_Jan_2018 d18 JOIN duals_Jan_2015 d15 ON d18.FIPS = d15.FIPS )
SELECT * FROM duals_increase WHERE total_duals_diff IS NOT NULL ORDER BY total_duals_diff DESC
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset and the validation are fully described in a Nature Scientific Data Descriptor https://www.nature.com/articles/s41597-019-0265-5
If you want to use this dataset in an interactive environment, then use this link https://mybinder.org/v2/gh/GeographerAtLarge/TravelTime/HEAD
The following text is a summary of the information in the above Data Descriptor.
The dataset is a suite of global travel-time accessibility indicators for the year 2015, at approximately one-kilometre spatial resolution for the entire globe. The indicators show an estimated (and validated), land-based travel time to the nearest city and nearest port for a range of city and port sizes.
The datasets are in GeoTIFF format and are suitable for use in Geographic Information Systems and statistical packages for mapping access to cities and ports and for spatial and statistical analysis of the inequalities in access by different segments of the population.
These maps represent a unique global representation of physical access to essential services offered by cities and ports.
The datasets travel_time_to_cities_x.tif (where x has values from 1 to 12) The value of each pixel is the estimated travel time in minutes to the nearest urban area in 2015. There are 12 data layers based on different sets of urban areas, defined by their population in year 2015 (see PDF report).
travel_time_to_ports_x (x ranges from 1 to 5)
The value of each pixel is the estimated travel time to the nearest port in 2015. There are 5 data layers based on different port sizes.
Format Raster Dataset, GeoTIFF, LZW compressed Unit Minutes
Data type Byte (16 bit Unsigned Integer)
No data value 65535
Flags None
Spatial resolution 30 arc seconds
Spatial extent
Upper left -180, 85
Lower left -180, -60 Upper right 180, 85 Lower right 180, -60 Spatial Reference System (SRS) EPSG:4326 - WGS84 - Geographic Coordinate System (lat/long)
Temporal resolution 2015
Temporal extent Updates may follow for future years, but these are dependent on the availability of updated inputs on travel times and city locations and populations.
Methodology Travel time to the nearest city or port was estimated using an accumulated cost function (accCost) in the gdistance R package (van Etten, 2018). This function requires two input datasets: (i) a set of locations to estimate travel time to and (ii) a transition matrix that represents the cost or time to travel across a surface.
The set of locations were based on populated urban areas in the 2016 version of the Joint Research Centre’s Global Human Settlement Layers (GHSL) datasets (Pesaresi and Freire, 2016) that represent low density (LDC) urban clusters and high density (HDC) urban areas (https://ghsl.jrc.ec.europa.eu/datasets.php). These urban areas were represented by points, spaced at 1km distance around the perimeter of each urban area.
Marine ports were extracted from the 26th edition of the World Port Index (NGA, 2017) which contains the location and physical characteristics of approximately 3,700 major ports and terminals. Ports are represented as single points
The transition matrix was based on the friction surface (https://map.ox.ac.uk/research-project/accessibility_to_cities) from the 2015 global accessibility map (Weiss et al, 2018).
Code The R code used to generate the 12 travel time maps is included in the zip file that can be downloaded with these data layers. The processing zones are also available.
Validation The underlying friction surface was validated by comparing travel times between 47,893 pairs of locations against journey times from a Google API. Our estimated journey times were generally shorter than those from the Google API. Across the tiles, the median journey time from our estimates was 88 minutes within an interquartile range of 48 to 143 minutes while the median journey time estimated by the Google API was 106 minutes within an interquartile range of 61 to 167 minutes. Across all tiles, the differences were skewed to the left and our travel time estimates were shorter than those reported by the Google API in 72% of the tiles. The median difference was −13.7 minutes within an interquartile range of −35.5 to 2.0 minutes while the absolute difference was 30 minutes or less for 60% of the tiles and 60 minutes or less for 80% of the tiles. The median percentage difference was −16.9% within an interquartile range of −30.6% to 2.7% while the absolute percentage difference was 20% or less in 43% of the tiles and 40% or less in 80% of the tiles.
This process and results are included in the validation zip file.
Usage Notes The accessibility layers can be visualised and analysed in many Geographic Information Systems or remote sensing software such as QGIS, GRASS, ENVI, ERDAS or ArcMap, and also by statistical and modelling packages such as R or MATLAB. They can also be used in cloud-based tools for geospatial analysis such as Google Earth Engine.
The nine layers represent travel times to human settlements of different population ranges. Two or more layers can be combined into one layer by recording the minimum pixel value across the layers. For example, a map of travel time to the nearest settlement of 5,000 to 50,000 people could be generated by taking the minimum of the three layers that represent the travel time to settlements with populations between 5,000 and 10,000, 10,000 and 20,000 and, 20,000 and 50,000 people.
The accessibility layers also permit user-defined hierarchies that go beyond computing the minimum pixel value across layers. A user-defined complete hierarchy can be generated when the union of all categories adds up to the global population, and the intersection of any two categories is empty. Everything else is up to the user in terms of logical consistency with the problem at hand.
The accessibility layers are relative measures of the ease of access from a given location to the nearest target. While the validation demonstrates that they do correspond to typical journey times, they cannot be taken to represent actual travel times. Errors in the friction surface will be accumulated as part of the accumulative cost function and it is likely that locations that are further away from targets will have greater a divergence from a plausible travel time than those that are closer to the targets. Care should be taken when referring to travel time to the larger cities when the locations of interest are extremely remote, although they will still be plausible representations of relative accessibility. Furthermore, a key assumption of the model is that all journeys will use the fastest mode of transport and take the shortest path.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Taxicabs in Chicago, Illinois, are operated by private companies and licensed by the city. There are about seven thousand licensed cabs operating within the city limits. Licenses are obtained through the purchase or lease of a taxi medallion which is then affixed to the top right hood of the car. Source: https://en.wikipedia.org/wiki/Taxicabs_of_the_United_States#Chicago
This dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. See http://digital.cityofchicago.org/index.php/chicago-taxi-data-released for more information about this dataset and how it was created.
Fork this kernel to get started.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_taxi_trips
https://cloud.google.com/bigquery/public-data/chicago-taxi
Dataset Source: City of Chicago
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by Ferdinand Stohr from Unplash.
What are the maximum, minimum and average fares for rides lasting 10 minutes or more? Which drop-off areas have the highest average tip? How does trip duration affect fare rates for trips lasting less than 90 minutes?
https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png" alt="">
https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png
The Area Deprivation Index (ADI) can show where areas of deprivation and affluence exist within a community. The ADI is calculated with 17 indicators from the American Community Survey (ACS) having been well-studied in the peer-reviewed literature since 2003, and used for 20 years by the Health Resources and Services Administration (HRSA). High levels of deprivation have been linked to health outcomes such as 30-day hospital readmission rates, cardiovascular disease deaths, cervical cancer incidence, cancer deaths, and all-cause mortality. The 17 indicators from the ADI encompass income, education, employment, and housing conditions at the Census Block Group level.The ADI is available on BigQuery for release years 2018-2020 and is reported as a percentile that is 0-100% with 50% indicating a "middle of the nation" percentile. Data is provided at the county, ZIP, and Census Block Group levels. Neighborhood and racial disparities occur when some neighborhoods have high ADI scores and others have low scores. A low ADI score indicates affluence or prosperity. A high ADI score is indicative of high levels of deprivation. Raw ADI scores and additional statistics and dataviz can be seen in this ADI story with a BroadStreet free account.Much of the ADI research and popularity would not be possible without the excellent work of Dr. Amy Kind and colleagues at HIPxChange and at The University of Wisconsin Madison.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery. 자세히 알아보기
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the dataset that accompanies the paper titled "A Dual-Frequency Radar Retrieval of Snowfall Properties Using a Neural Network", submitted for peer review in August 2020. Please see the github for the most up-to-date data after the revision process: https://github.com/dopplerchase/Chase_et_al_2021_NN Authors: Randy J. Chase, Stephen W. Nesbitt and Greg M. McFarquhar Corresponding author: Randy J. Chase (randyjc2@illinois.edu) Here we have the data used in the manuscript. Please email me if you have specific questions about units etc. 1) DDA/GMM database of scattering properties: base_df_DDA.csv This is the combined dataset from the following papers: Leinonen & Moisseev, 2015; Leinonen & Szyrmer, 2015; Lu et al., 2016; Kuo et al., 2016; Eriksson et al., 2018. The column names are D: Maximum dimension in meters, M: particle mass in grams kg, sigma_ku: backscatter cross-section at ku in m^2, sigma_ka: backscatter cross-section at ka in m^2, sigma_w: backscatter cross-section at w in m^2. The first column is just an index column. 2) Synthetic Data used to train and test the neural network: Unrimed_simulation_wholespecturm_train_V2.nc, Unrimed_simulation_wholespecturm_test_V2.nc This was the result of combining the PSDs and DDA/GMM particles randomly to build the training and test dataset. 3) Notebook for training the network using the synthetic database and Google Colab (tensorflow): Train_Neural_Network_Chase2020.ipynb This is the notebook used to train the neural network. 4)Trained tensorflow neural network: NN_6by8.h5 This is the hdf5 tensorflow model that resulted from the training. You will need this to run the retrieval. 5) Scalers needed to apply the neural network: scaler_X_V2.pkl, scaler_y_V2.pkl These are the sklearn scalers used in training the neural network. You will need these to scale your data if you wish to run the retrieval. 6) New in this version - Example notebook of how to run the trained neural network on Ku- Ka- band observations. We showed this with the 3rd case in the paper: Run_Chase2021_NN.ipynb 7) New in this version - APR data used to show how to run the neural network retrieval: Chase_2021_NN_APR03Dec2015.nc The data for the analysis on the observations are not provided here because of the size of the radar data. Please see the GHRC website (https://ghrc.nsstc.nasa.gov/home/) if you wish to download the radar and in-situ data or contact me. We can coordinate transferring the exact datafiles used. The GPM-DPR data are avail. here: http://dx.doi.org/10.5067/GPM/DPR/GPM/2A/05
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.
Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.
In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin
dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]
. Fork this kernel to get started.
Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj
Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".
Photo by Andre Francois on Unsplash.
Quick overview how brand is performing in Slovakia during 2018.
First touch with Google Trends led us to make quick brand recognition research with this tool. We have been looking for brand searches connected with "cewe fotokniha" topic. Google Trends provided us broad insight into search behavior within search for web, images and youtube content.
These data are public to make quick research how any brand is performing within certain area or reqion.
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
Metadata includes
product IDs
bounding boxes
Basic Statistics:
Scenes: 47,739
Products: 38,111
Scene-Product Pairs: 93,274
An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('speech_commands', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Travel Review Rating Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/wirachleelakiatiwong/travel-review-rating-dataset on 30 September 2021.
--- Dataset description provided by original source is as follows ---
This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set. This data set is populated by capturing user ratings from Google reviews. Reviews on attractions from 24 categories across Europe are considered. Google user rating ranges from 1 to 5 and average user rating per category is calculated.
Attribute 1 : Unique user id Attribute 2 : Average ratings on churches Attribute 3 : Average ratings on resorts Attribute 4 : Average ratings on beaches Attribute 5 : Average ratings on parks Attribute 6 : Average ratings on theatres Attribute 7 : Average ratings on museums Attribute 8 : Average ratings on malls Attribute 9 : Average ratings on zoo Attribute 10 : Average ratings on restaurants Attribute 11 : Average ratings on pubs/bars Attribute 12 : Average ratings on local services Attribute 13 : Average ratings on burger/pizza shops Attribute 14 : Average ratings on hotels/other lodgings Attribute 15 : Average ratings on juice bars Attribute 16 : Average ratings on art galleries Attribute 17 : Average ratings on dance clubs Attribute 18 : Average ratings on swimming pools Attribute 19 : Average ratings on gyms Attribute 20 : Average ratings on bakeries Attribute 21 : Average ratings on beauty & spas Attribute 22 : Average ratings on cafes Attribute 23 : Average ratings on view points Attribute 24 : Average ratings on monuments Attribute 25 : Average ratings on gardens
This data set has been sourced from the Machine Learning Repository of University of California, Irvine (UC Irvine) : Travel Review Ratings Data Set
The UCI page mentions the following publication as the original source of the data set: Renjith, Shini, A. Sreekumar, and M. Jathavedan. 2018. Evaluation of Partitioning Clustering Algorithms for Processing Social Media Data in Tourism Domain. In 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 12731. IEEE
I'm kind of people who love traveling. But sometimes I've problems like where should I visit? Are there somewhere interesting places matched with my lifestyle? Often I spent hours to search for interesting place to go out. Such a waste of time.
What if we can build a recommender system which can recommend you several interesting venue based on your preferences. With information from Google review, I'll try to divide Google review user into cluster of similar interest for further work of building recommender system based on thier preference.
--- Original source retains full ownership of the source dataset ---
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.
Methods RAMP Data Documentation – January 1, 2017 through August 18, 2018
Data Collection
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.
The data in these CSV files include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.
References
Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Leaf Area Index (LAI) is a fundamental vegetation structural variable that drives energy and mass exchanges between the plant and the atmosphere. Moderate-resolution (300m – 7km) global LAI data products have been widely applied to track global vegetation changes, drive Earth system models, monitor crop growth and productivity, etc. Yet, cutting-edge applications in climate adaptation, hydrology, and sustainable agriculture require LAI information at higher spatial resolution (< 100m) to model and understand heterogeneous landscapes.
This dataset was built to assist a machine-learning-based approach for mapping LAI from 30m-resolution Landsat images across the contiguous US (CONUS). The data was derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) Version 6 LAI/FPAR, Landsat Collection 1 surface reflectance, and NLCD Land Cover datasets over 2006 – 2018 using Google Earth Engine. Each record/sample/row includes a MODIS LAI value, corresponding Landsat surface reflectance in green, red, NIR, SWIR1 bands, a land cover (biome) type, geographic location, and other auxiliary information. Each sample represents a MODIS LAI pixel (500m) within which a single biome type dominates 90% of the area. The spatial homogeneity of the samples was further controlled by a screening process based on the coefficient of variation of the Landsat surface reflectance. In total, there are approximately 1.6 million samples, stratified by biome, Landsat sensor, and saturation status from the MODIS LAI algorithm. This dataset can be used to train machine learning models and generate LAI maps for Landsat 5, 7, 8 surface reflectance images within CONUS. Detailed information on the sample generation and quality control can be found in the related journal article. Resources in this dataset:Resource Title: README. File Name: LAI_train_samples_CONUS_README.txtResource Description: Description and metadata of the main datasetResource Software Recommended: Notepad,url: https://www.microsoft.com/en-us/p/windows-notepad/9msmlrh6lzf3?activetab=pivot:overviewtab Resource Title: LAI_training_samples_CONUS. File Name: LAI_train_samples_CONUS_v0.1.1.csvResource Description: This CSV file consists of the training samples for estimating Leaf Area Index based on Landsat surface reflectance images (Collection 1 Tire 1). Each sample has a MODIS LAI value and corresponding surface reflectance derived from Landsat pixels within the MODIS pixel.
Contact: Yanghui Kang (kangyanghui@gmail.com)
Column description
UID: Unique identifier. Format: LATITUDE_LONGITUDE_SENSOR_PATHROW_DATE
Landsat_ID: Landsat image ID
Date: Landsat image date in "YYYYMMDD"
Latitude: Latitude (WGS84) of the MODIS LAI pixel center
Longitude: Longitude (WGS84) of the MODIS LAI pixel center
MODIS_LAI: MODIS LAI value in "m2/m2"
MODIS_LAI_std: MODIS LAI standard deviation in "m2/m2"
MODIS_LAI_sat: 0 - MODIS Main (RT) method used no saturation; 1 - MODIS Main (RT) method with saturation
NLCD_class: Majority class code from the National Land Cover Dataset (NLCD)
NLCD_frequency: Percentage of the area cover by the majority class from NLCD
Biome: Biome type code mapped from NLCD (see below for more information)
Blue: Landsat surface reflectance in the blue band
Green: Landsat surface reflectance in the green band
Red: Landsat surface reflectance in the red band
Nir: Landsat surface reflectance in the near infrared band
Swir1: Landsat surface reflectance in the shortwave infrared 1 band
Swir2: Landsat surface reflectance in the shortwave infrared 2 band
Sun_zenith: Solar zenith angle from the Landsat image metadata. This is a scene-level value.
Sun_azimuth: Solar azimuth angle from the Landsat image metadata. This is a scene-level value.
NDVI: Normalized Difference Vegetation Index computed from Landsat surface reflectance
EVI: Enhanced Vegetation Index computed from Landsat surface reflectance
NDWI: Normalized Difference Water Index computed from Landsat surface reflectance
GCI: Green Chlorophyll Index = Nir/Green - 1
Biome code
1 - Deciduous Forest
2 - Evergreen Forest
3 - Mixed Forest
4 - Shrubland
5 - Grassland/Pasture
6 - Cropland
7 - Woody Wetland
8 - Herbaceous Wetland
Reference Dataset: All data was accessed through Google Earth Engine Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment. MODIS Version 6 Leaf Area Index/FPAR 4-day L5 Global 500m Myneni, R., Y. Knyazikhin, T. Park. MOD15A2H MODIS/Terra Leaf Area Index/FPAR 8-Day L4 Global 500m SIN Grid V006. 2015, distributed by NASA EOSDIS Land Processes DAAC, https://doi.org/10.5067/MODIS/MOD15A2H.006 Landsat 5/7/8 Collection 1 Surface Reflectance Landsat Level-2 Surface Reflectance Science Product courtesy of the U.S. Geological Survey. Masek, J.G., Vermote, E.F., Saleous N.E., Wolfe, R., Hall, F.G., Huemmrich, K.F., Gao, F., Kutler, J., and Lim, T-K. (2006). A Landsat surface reflectance dataset for North America, 1990–2000. IEEE Geoscience and Remote Sensing Letters 3(1):68-72. http://dx.doi.org/10.1109/LGRS.2005.857030. Vermote, E., Justice, C., Claverie, M., & Franch, B. (2016). Preliminary analysis of the performance of the Landsat 8/OLI land surface reflectance product. Remote Sensing of Environment. http://dx.doi.org/10.1016/j.rse.2016.04.008. National Land Cover Dataset (NLCD) Yang, Limin, Jin, Suming, Danielson, Patrick, Homer, Collin G., Gass, L., Bender, S.M., Case, Adam, Costello, C., Dewitz, Jon A., Fry, Joyce A., Funk, M., Granneman, Brian J., Liknes, G.C., Rigge, Matthew B., Xian, George, A new generation of the United States National Land Cover Database—Requirements, research priorities, design, and implementation strategies: ISPRS Journal of Photogrammetry and Remote Sensing, v. 146, p. 108–123, at https://doi.org/10.1016/j.isprsjprs.2018.09.006 Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.
The datasets follow the RLDS format to represent steps and episodes.
We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.
Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.
The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.
Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).
Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.
This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('rlu_atari_checkpoints_ordered', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Open Images is a dataset of ~9M images that have been annotated with image-level labels and object bounding boxes.
The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). Moreover, the dataset is annotated with image-level labels spanning thousands of classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('open_images_v4', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/open_images_v4-original-2.0.0.png" alt="Visualization" width="500px">
The dataset of this paper is collected based on Google, Blockchain, and the Bitcoin market. Generally, there is a total of 26 features, however, a feature whose correlation rate is lower than 0.3 between the variations of price and the variations of feature has been eliminated. Hence, a total of 21 practical features including Market capitalization, Trade-volume, Transaction-fees USD, Average confirmation time, Difficulty, High price, Low price, Total hash rate, Block-size, Miners-revenue, N-transactions-total, Google searches, Open price, N-payments-per Block, Total circulating Bitcoin, Cost-per-transaction percent, Fees-USD-per transaction, N-unique-addresses, N-transactions-per block, and Output-volume have been selected. In addition to the values of these features, for each feature, a new one is created that includes the difference between the previous day and the day before the previous day as a supportive feature. From the point of view of the number and history of the dataset used, a total of 1275 training data were used in the proposed model to extract patterns of Bitcoin price and they were collected from 12 Nov 2018 to 4 Jun 2021.
Data collected from Twitter social media platform (6 May 2018 - 16 May 2018) to explore the potential role of social media data in responding to new and emerging forms of food fraud reported on social media from posts originating in the UK. The dataset contains Tweet IDs and keywords used to search for Tweets using a programatic access via the public Twitter API. Keywords used in this search were generated using a machine learning tool and consisted of a combinations of keywords describing terms related to food and outrage.Social media and other forms of online content have enormous potential as a way to understand people's opinions and attitudes, and as a means to observe emerging phenomena - such as disease outbreaks. How might policy makers use such new forms of data to better assess existing policies and help formulate new ones? This one year demonstrator project is a partnership between computer science academics at the University of Aberdeen and officers from Food Standards Scotland which aims to answer this question. Food Standards Scotland is the public-sector food body for Scotland created by the Food (Scotland) Act 2015. It regularly provides policy guidance to ministers in areas such as food hygiene monitoring and reporting, food-related health risks, and food fraud. The project will develop a software tool (the Food Sentiment Observatory) that will be used to explore the role of data from sources such as Twitter, Facebook, and TripAdvisor in three policy areas selected by Food Standards Scotland: - attitudes to the differing food hygiene information systems used in Scotland and the other UK nations; - study of an historical E.coli outbreak to understand effectiveness of monitoring and decision making protocols; - understanding the potential role of social media data in responding to new and emerging forms of food fraud. The Observatory will integrate a number of existing software tools (developed in our recent research) to allow us to mine large volumes of data to identify important textual signals, extract opinions held by individuals or groups, and crucially, to document these data processing operations - to aid transparency of policy decision-making. Given the amount of noise appearing in user-generated online content (such as fake restaurant reviews) it is our intention to investigate methods to extract meaningful and reliable knowledge, to better support policy making. The search for relevant data content was performed using a custom built data collection module within the Observatory platform (https://sites.google.com/view/foobs/the-observatory). A public API provided by Twitter was utilised to gather all social media messages (Tweets) matching a specific set of keywords. Each line in the food-keywords.txt file (group 1) and in the in the outrage-keywords.txt file (group 2) contains a search keyword/phrase. A list of search keywords was then created from all possible combinations of individual keywords/phrases form group 1 and group 2. A matching Tweet, returned by the search had to include at least one combination of such search keywords/phrases. Therefore, the search string used by the API was constructed as follows: ( ) OR ( ) OR ... *Note: the space between represents a logical AND in terms of the Twitter API service. The Twitter API allows historical searches to be restricted to Tweets associated with a specific location, however, this can be only specified as a specific radius from a given latitude and longitude geo-point. We used Twitter's geo-resticted search by defining a Lat/Long point and radius (in kilometres). In order to cover major areas in the UK we used the following four geo-restrictions: Latitude =57.334942 Longitude=-4.395858 Radius = 253 km; Latitude =55.288000 Longitude=-2.374374 Radius = 282 km; Latitude =52.250808 Longitude=-0.660507 Radius = 198 km; Latitude =51.953880 Longitude=-2.989608 Radius = 198 km.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
CCIHP dataset is devoted to fine-grained description of people in the wild with localized & characterized semantic attributes. It contains 20 attribute classes and 20 characteristic classes split into 3 categories (size, pattern and color). The annotations were made with Pixano, an opensource, smart annotation tool for computer vision applications: https://pixano.cea.fr/
CCIHP dataset provides pixelwise image annotations for:
Images:
The image data are the same as CIHP dataset (see Section Related work) proposed at the LIP (Look Into Person) challenge. They are available at google drive and baidu drive. (Baidu link does not need access right).
Annotations:
Please download and unzip the CCIHP_icip.zip file. The CCIHP annotations can be found in the Training
and Validation
sub-folders of CCIHP_icip2021/dataset/
folder. They correspond to, respectively, 28,280 training images and 5,000 validation images. Annotations consist of:
Label meaning for semantic attribute/body parts:
Label meaning for size characterization:
Label meaning for pattern characterization:
Label meaning for color characterization:
Our work is based on CIHP image dataset from: Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang and Liang Lin, "Instance-level Human Parsing via Part Grouping Network", ECCV 2018.
To evaluate the predictions given by a Human Parsing with Characteristics model, you can run the python scripts in CCIHP_icip2021/evaluation/
folder.
generate_characteristic_instance_part_ccihp.py
eval_test_characteristic_inst_part_ap_ccihp.py
for mean Average Precision based on characterized region (AP^(cr)_(vol)). It evaluates the prediction of characteristic (class & score) relative to each instanced and characterized attribute mask, independently of the attribute class prediction. metric_ccihp_miou_evaluation.py
for a mIoU performance evaluation of semantic predictions (attribute or characteristics).Data annotations are under Creative Commons Attribution Non Commercial 4.0 license (see LICENSE file).
Evaluation codes are under MIT license.
A. Loesch and R. Audigier, "Describe Me If You Can! Characterized Instance-Level Human Parsing," 2021 IEEE International Conference on Image Processing (ICIP), 2021, pp. 2528-2532, doi: 10.1109/ICIP42928.2021.9506509.
@INPROCEEDINGS{ccihp_dataset_2021,
author={Loesch, Angelique and Audigier, Romaric},
booktitle={2021 IEEE International Conference on Image Processing (ICIP)},
title={Describe Me If You Can! Characterized Instance-Level Human Parsing},
year={2021},
volume={},
number={},
pages={2528-2532},
doi={10.1109/ICIP42928.2021.9506509}},
If you have any question about this dataset, you can contact us by email at: ccihp-dataset@cea.fr
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.
Citation
If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)
You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
About this dataset
Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.
The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.
All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.
The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:
"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".
Some other relevant characteristics of FSDKaggle2018:
The dataset is split into a train set and a test set.
The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.
Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.
Non-verified annotations in the train set are properly flagged in train.csv
so that participants can opt to use this information during the development of their systems.
The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.
All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.
Data labeling process
The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.
Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.
Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv
). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.
The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.
More details about the data labeling process can be found in [3].
License
FSDKaggle2018 has licenses at two different levels, as explained next.
All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv
and test_post_competition_scoring_clips.csv
.
In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET
file downloaded with the FSDKaggle2018.doc
zip file.
Files
FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:
root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET