Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle
16kHz, 16 bit, wav, mono channel;
Including self-media, conversation, live, lecture, variety-show, etc;
Low background noise;
America(USA);
en-US;
English;
Transcription text, timestamp, speaker ID, gender.
Sentence Accuracy Rate (SAR) 95%
Commercial License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.
The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.
I collected ten types of noise in this dataset.
Location - Myanmar, Mandalay, Amarapura Township
This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.
Can machine learning be used to detect when speech is AI-generated?
There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.
To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.
For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F921dc2241837cd784329955d570f7802%2Fdfcover.png?generation=1692897655324630&alt=media" alt="Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.">
(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)
There are two forms to the dataset that are made available.
First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.
Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.
**Note: ** All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.
A potential use of a successful system could be used for the following:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F7ae536243464f0dbb48f3566765f6b50%2Fdfcover.png?generation=1692897790677119&alt=media" alt="Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.">
(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)
The dataset and all studies using it are linked using Papers with Code
The Papers with Code page can be found by clicking here: Papers with Code
This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"
Bird, J.J. and Lotfi, A., 2023. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv preprint arXiv:2308.12734.
The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion
This dataset is provided under the MIT License:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
*THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT H...
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
This UK English Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.
The dataset contains 30 hours of dual-channel call center recordings between native UK English speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.
This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.
Rich metadata is available for each participant and conversation:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
• This dataset contains over 7500 fart recordings that were collected over a period of 37 months. I created this dataset for educational purposes, to allow others to study and experiment with audio, and signal processing more broadly.
• The files are in .wav format. I recorded all of the farts using a voice recording app in whatever environment I was in at the time. Thus, there may be background noise, people talking, and variations in volume and clarity (note: I generally step far away to fart if others are present). I validate every single recording and delete those with low volume, too much background noise, and sounds too similar to farts (e.g. phone vibrations). Over time, I have decreased my threshold for acceptable background noise and strive to have minimal or no extraneous noise. There are numerous types of farts, as well. I did not record every fart I produced during the time period, rather I recorded when available to do so. Thus, this data is not inclusive of all emitted farts.
• I did not perform any preprocessing on this data to maintain its versatility. For most audio tasks, you may consider trimming files to be of a consistent duration, as well as other common audio preprocessing techniques.
• The files are named as integers, starting from 1. The order of files bears no significance.
• If you are using these files specifically for fart classification or fart recognition, please bear in mind that this data is biased towards my farts. Consequently, you may find that a model recognizes someone else's farts, such as your own, with different results.
Suggested Uses
• Unsupervised signal classification - You can experiment with categorizing farts without any preexisting knowledge of defining characteristics and potentially apply these learnings to other signal types - speech, radar, tv, radio, light, EEG.
• Supervised signal recognition - This dataset could be used to experiment with developing deep learning models capable of recognizing whether a sound is a fart. An interesting property of farts is variable frequencies and inconsistent durations.
• Sound effects creation - This dataset could be used by sound designers or audio engineers as a basis to create new sound effects for movies, video games, or other media. You could also simply use it as a publicly available and free source of farts.
• Education and outreach - Educators and scientists can use this dataset as an approach to better engage their audiences in signal processing and deep learning.
License
• This data is publicly and freely available to use and modify however you would like. There is no license and no limitations for use. I would appreciate being notified of this data being used publicly, purely for my own entertainment.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Column | Description |
---|---|
id | file id (string) |
file_path | file path to .wav file (string) |
speech | transcription of the audio file (string) |
speaker | speaker name, use this as the target variable if you are doing audio classification (string) |
The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.
The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.
The 1132 sentence prompt list is available from cmuarctic.data
The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.
This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Objective is to remove/suppress indoor or outdoor noises present during conversation for other person on call. Improve recording system or get rid of the noises present in the environment.
Real time noise reduction using DNN : - https://devblogs.nvidia.com/nvidia-real-time-noise-suppression-deep-learning/ - http://staff.ustc.edu.cn/~jundu/Publications/publications/Trans2015_Xu.pdf
Design a network which works for stationary and non-stationary noises
Prepared Data : 1 sec speech mixed with noises present in indoor and outdoor environment
- For preparing longer speech duration with noise present in it - Use audio and marcas python library
- For filtering noise for longer speech - make 1 sec model for cleaning the noise and repeat for process in chunks of 1 sec for longer wav files.
Data preparation technique (a) ProcessWav.ipynb (b) Processdata.ipynb
Additional reference repository for preparing datasets
Model architecture and summary attached along with dataset to reduce noise :
Train and test code is available for reference ( https://github.com/vijay033/Noise-Suppression-Auto-Encoder ) (a) trainnoise.ipynb ( train ) - generate model.h5 (b) testprediction ( using model.h5 ) - chunking of audio for 1 sec - passed through model - rejoin all chunked files
Sampling Rate = 16K File format : WAV NPY dimension : 257 X 62 X 3
Chunking wav files in small segments https://www.programcreek.com/python/example/89506/pydub.AudioSegment.from_file https://readthedocs.org/projects/audiosegment/downloads/pdf/latest/
Free Sound Noise https://www.freesoundeffects.com/free-sounds/
Speech Dataset http://www.openslr.org/12 ( Not all sample used : train-clean-100.tar.gz 6.3G) https://openslr.org/83/
Model which can remove or suppress the noises from the original speech. CNN or Auto encoder model to remove or suppress noise from the original speech.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Spotify Recommendation’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bricevergnou/spotify-recommendation on 28 January 2022.
--- Dataset description provided by original source is as follows ---
( You can check how I used this dataset on my github repository )
I am basically a HUGE fan of music ( mostly French rap though with some exceptions but I love music ). And someday , while browsing stuff on Internet , I found the Spotify's API . I knew I had to use it when I found out you could get information like danceability about your favorite songs just with their id's.
https://user-images.githubusercontent.com/86613710/127216769-745ac143-7456-4464-bbe3-adc53872c133.png" alt="image">
Once I saw that , my machine learning instincts forced me to work on this project.
I collected 100 liked songs and 95 disliked songs
For those I like , I made a playlist of my favorite 100 songs. It is mainly French Rap , sometimes American rap , rock or electro music.
For those I dislike , I collected songs from various kind of music so the model will have a broader view of what I don't like
There is : - 25 metal songs ( Cannibal Corps ) - 20 " I don't like " rap songs ( PNL ) - 25 classical songs - 25 Disco songs
I didn't include any Pop song because I'm kinda neutral about it
From the Spotify's API "Get a playlist's Items" , I turned the playlists into json formatted data which cointains the ID and the name of each track ( ids/yes.py and ids/no.py ). NB : on the website , specify "items(track(id,name))" in the fields format , to avoid being overwhelmed by useless data.
With a script ( ids/ids_to_data.py ) , I turned the json data into a long string with each ID separated with a comma.
Now I just had to enter the strings into the Spotify API "Get Audio Features from several tracks" and get my data files ( data/good.json and data/dislike.json )
From Spotify's API documentation :
And the variable that has to be predicted :
--- Original source retains full ownership of the source dataset ---
Speech is the most natural way of expressing ourselves as humans. It is only natural then to extend this communication medium to computer applications. We define speech emotion recognition (SER) systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions. SER is not a new field, it has been around for over two decades, and has regained attention thanks to the recent advancements. These novel studies make use of the advances in all fields of computing and technology, making it necessary to have an update on the current methodologies and techniques that make SER possible. We have identified and discussed distinct areas of SER, provided a detailed survey of current literature of each, and also listed the current challenges.
Here 4 most popular datasets in English: Crema, Ravdess, Savee and Tess. Each of them contains audio in .wav format with some main labels.
Ravdess:
Here is the filename identifiers as per the official RAVDESS website:
So, here's an example of an audio filename. 02-01-06-01-02-01-12.wav This means the meta data for the audio file is:
Crema:
The third component is responsible for the emotion label: * SAD - sadness; * ANG - angry; * DIS - disgust; * FEA - fear; * HAP - happy; * NEU - neutral.
Tess:
Very similar to Crema - label of emotion is contained in the name of file.
Savee:
The audio files in this dataset are named in such a way that the prefix letters describes the emotion classes as follows:
My pleasure to show you a notebook of this guy which inspire me to contain this dataset publicly.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Emotion expression is an essential part of human interaction. The same text can hold different meanings when expressed with different emotions. Thus understanding the text alone is not enough for getting the meaning of an utterance. Acted and natural corpora have been used to detect emotions from speech. Many speech databases for different languages including English, German, Chinese, Japanese, Russian, Italian, Swedish and Spanish exist for modeling emotion recognition. Since there is no reported reference of an available Arabic corpus, we decided to collect the first Arabic Natural Audio Dataset (ANAD) to recognize discrete emotions.
Embedding an effective emotion detection feature in speech recognition system seems a promising solution for decreasing the obstacles faced by the deaf when communicating with the outside world. There exist several applications that allow the deaf to make and receive phone calls normally, as the hearing-impaired individual can type a message and the person on the other side hears the words spoken, and as they speak, the words are received as text by the deaf individual. However, missing the emotion part still makes these systems not hundred percent reliable. Having an effective speech to text and text to speech system installed in their everyday life starting from a very young age will hopefully replace the human ear. Such systems will aid deaf people to enroll in normal schools at very young age and will help them to adapt better in classrooms and with their classmates. It will help them experience a normal childhood and hence grow up to be able to integrate within the society without external help.
Eight videos of live calls between an anchor and a human outside the studio were downloaded from online Arabic talk shows. Each video was then divided into turns: callers and receivers. To label each video, 18 listeners were asked to listen to each video and select whether they perceive a happy, angry or surprised emotion. Silence, laughs and noisy chunks were removed. Every chunk was then automatically divided into 1 sec speech units forming our final corpus composed of 1384 records.
Twenty five acoustic features, also known as low-level descriptors, were extracted. These features are: intensity, zero crossing rates, MFCC 1-12 (Mel-frequency cepstral coefficients), F0 (Fundamental frequency) and F0 envelope, probability of voicing and, LSP frequency 0-7. On every feature nineteen statistical functions were applied. The functions are: maximum, minimum, range, absolute position of maximum, absolute position of minimum, arithmetic of mean, Linear Regression1, Linear Regression2, Linear RegressionA, Linear RegressionQ, standard Deviation, kurtosis, skewness, quartiles 1, 2, 3 and, inter-quartile ranges 1-2, 2-3, 1-3. The delta coefficient for every LLD is also computed as an estimate of the first derivative hence leading to a total of 950 features.
I would have never reached that far without the help of my supervisors. I warmly thank and appreciate Dr. Rached Zantout, Dr. Lama Hamandi, and Dr. Ziad Osman for their guidance, support and constant supervision.
The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.
Every utterance is named according to the same scheme:
Example: 03a01Fa.wav is the audio file from Speaker 03 speaking text a01 with the emotion "Freude" (Happiness).
|
letter | emotion (english) | letter | emotion (german) |
---|---|---|---|
A | anger | W | Ärger (Wut) |
B | boredom | L | Langeweile |
D | disgust | E | Ekel |
F | anxiety/fear | A | Angst |
H | happiness | F | Freude |
S | sadness | T | Trauer |
N = neutral version |
EMOTION classification from speech has an increasing interest in the field of the speech processing area. The objective of the emotion classification is to classify different emotions from the speech signal. A person’s emotional state affects the production mechanism of speech, and due to this, breathing rate and muscle tension change from the neutral condition. Therefore, the resulting speech signal may have different characteristics from that of neutral speech.
The performance of speech recognition or speaker recognition decreases significantly if the model is trained with neutral speech and it is tested with an emotional speech. So we as a Machine Learning Enthusiast can start working on speaker emotion recognition problems and can come up with some good robust models.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This is a dataset of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. The data is in CSV
format which is tabular and can be loaded quickly.
The dataset can be used for:
;
0 = C
, 1 = C♯/D♭
, 2 = D
, and so on. If no key was detected, the value is -13/4
, to 7/4
.Image credits: BPR world
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides a regional detailed overview of the Brazil digital music consumption in Spotify between 2021-2023. It includes acoustic features and all genres/artists that are listened at least one time in those years. The data is provided by the Spotify API for Developers and the SpotifyCharts wich are used to collect the acoustic features and the summarized most listened songs in city, respectively.
It contemplates 17 cities of 16 different states in Brazil that achieved 5190 unique tracks, 487 different genres and 2056 artists. The covered cities are: Belém, Belo Horizonte, Brasília, Campinas, Campo Grande, Cuiabá, Curitiba, Florianópolis, Fortaleza, Goiânia, Manaus, Porto Alegre, Recife, Rio de Janeiro, Salvador, São Paulo and Uberlândia. Each city has 119 different weekly's charts wich the week period is described by the file name.
The covered acoustic features are provided by Spotify and are described as: - Acousticness: Measures from 0.0 to 1.0 of wheter the track is acoustic; 1.0 indicates a totally acoustic song and 0.0 means a song without any acoustic element - Danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. - Energy: is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. - Instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. - Key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. - Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. - Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. - Mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. - Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. - Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. - Time Signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4". - Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
The dataset include all the songs that have been on the Top 200 Weekly (Global) charts of Spotify in 2020 & 2021. The dataset include the following features:
Highest Charting Position: The highest position that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Number of Times Charted: The number of times that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Week of Highest Charting: The week when the song had the Highest Position in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Song Name: Name of the song that has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Song iD: The song ID provided by Spotify (unique to each song). Streams: Approximate number of streams the song has. Artist: The main artist/ artists involved in making the song. Artist Followers: The number of followers the main artist has on Spotify. Genre: The genres the song belongs to. Release Date: The initial date that the song was released. Weeks Charted: The weeks that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Popularity:The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. Acousticness: A measure from 0.0 to 1.0 of whether the track is acoustic. Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. Instrumentalness: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track. Values typical range between -60 and 0 db. Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Chord: The main chord of the song instrumental.
Acknowledgements- This dataset would not be possible without the help of spotifycharts.com and Spotipy Python Library
The dataset is composed by different musial genres and for each kind we have different features that caracterize it. Reference https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle
16kHz, 16 bit, wav, mono channel;
Including self-media, conversation, live, lecture, variety-show, etc;
Low background noise;
America(USA);
en-US;
English;
Transcription text, timestamp, speaker ID, gender.
Sentence Accuracy Rate (SAR) 95%
Commercial License