15 datasets found

English Conversation and Monologue speech dataset
kaggle.com
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
English(America) Real-world Casual Conversation and Monologue speech dataset

Description

English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

Format

16kHz, 16 bit, wav, mono channel;

Content category

Including self-media, conversation, live, lecture, variety-show, etc;

Recording environment

Low background noise;

Country

America(USA);

Language(Region) Code

en-US;

Language

English;

Features of annotation

Transcription text, timestamp, speaker ID, gender.

Accuracy Rate

Sentence Accuracy Rate (SAR) 95%

Licensing Information

Commercial License
Audio Noise Dataset
kaggle.com
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Si Thu (2023). Audio Noise Dataset [Dataset]. https://www.kaggle.com/datasets/minsithu/audio-noise-dataset/
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 20, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Min Si Thu
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Audio Noise Dataset

Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.

The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.

I collected ten types of noise in this dataset.

Location - Myanmar, Mandalay, Amarapura Township

Ten types of noise

the noise of a crowded place (Myanmar, Mandalay, Amarapura Township)

the noise of urban areas with people talking (Myanmar, Mandalay, Amarapura Township)

the noise of the restaurant (Myanmar, Mandalay, Amarapura Township, at a random restaurant)

the noise of a working place, people's discussion (Myanmar, Mandalay, Amarapura Township, a private company)

the noise of mosquitos (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)

the noise of car traffic (Myanmar, Mandalay, Amarapura Township, Asia Bank Road, nighttime)

the noise of painful sounds (Myanmar, Mandalay, Amarapura Township)

the noise of the rainy day (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)

the noise of motorbike and people talking (Myanmar, Mandalay, Amarapura Township, NanTawYar Quarter, Cherry Street)

the noise of a festival (Myanmar, Mandalay, Chinese Festival)
DEEP-VOICE: DeepFake Voice Recognition
kaggle.com
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan J. Bird (2023). DEEP-VOICE: DeepFake Voice Recognition [Dataset]. https://www.kaggle.com/datasets/birdy654/deep-voice-deepfake-voice-recognition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jordan J. Bird
Description
DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

Can machine learning be used to detect when speech is AI-generated?

Introduction

There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F921dc2241837cd784329955d570f7802%2Fdfcover.png?generation=1692897655324630&alt=media" alt="Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.">

(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

Dataset

There are two forms to the dataset that are made available.

First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

**Note: ** All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

A potential use of a successful system could be used for the following:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2039603%2F7ae536243464f0dbb48f3566765f6b50%2Fdfcover.png?generation=1692897790677119&alt=media" alt="Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.">

(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

Papers with Code

The dataset and all studies using it are linked using Papers with Code

The Papers with Code page can be found by clicking here: Papers with Code

Attribution

This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

Bird, J.J. and Lotfi, A., 2023. Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv preprint arXiv:2308.12734.

The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

License

This dataset is provided under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

*THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT H...
F
British English Call Center Data for Telecom AI
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). British English Call Center Data for Telecom AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/telecom-call-center-conversation-english-uk
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
This UK English Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.
Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.
Speech Data
The dataset contains 30 hours of dual-channel call center recordings between native UK English speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.
•Participant Diversity:
•
Speakers: 60 native UK English speakers from our verified contributor pool.

•
Regions: Representing multiple provinces across United Kingdom to ensure coverage of various accents and dialects.

•
Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.

•Recording Details:
•
Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.

•
Call Duration: Ranges from 5 to 15 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.

•
Recording Environment: Captured in clean conditions with no echo or background noise.

Topic Diversity
This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.
•Inbound Calls:
•Phone Number Porting
•Network Connectivity Issues
•Billing and Payments
•Technical Support
•Service Activation
•International Roaming Enquiry
•Refund Requests and Billing Adjustments
•Emergency Service Access, and others
•Outbound Calls:
•Welcome Calls & Onboarding
•Payment Reminders
•Customer Satisfaction Surveys
•Technical Updates
•Service Usage Reviews
•Network Complaint Status Calls, and more
This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.
Transcription
All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-coded Segments
•Non-speech Tags (e.g., pauses, coughs)
•High transcription accuracy with word error rate < 5% thanks to dual-layered quality checks.
These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.
Metadata
Rich metadata is available for each participant and conversation:
•
Participant Metadata: ID, age, gender, accent, dialect, and location.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:
Fart Recordings Dataset
kaggle.com
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alec Ledoux (2025). Fart Recordings Dataset [Dataset]. https://www.kaggle.com/datasets/alecledoux/fart-recordings-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alec Ledoux
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description

• This dataset contains over 7500 fart recordings that were collected over a period of 37 months. I created this dataset for educational purposes, to allow others to study and experiment with audio, and signal processing more broadly.

• The files are in .wav format. I recorded all of the farts using a voice recording app in whatever environment I was in at the time. Thus, there may be background noise, people talking, and variations in volume and clarity (note: I generally step far away to fart if others are present). I validate every single recording and delete those with low volume, too much background noise, and sounds too similar to farts (e.g. phone vibrations). Over time, I have decreased my threshold for acceptable background noise and strive to have minimal or no extraneous noise. There are numerous types of farts, as well. I did not record every fart I produced during the time period, rather I recorded when available to do so. Thus, this data is not inclusive of all emitted farts.

• I did not perform any preprocessing on this data to maintain its versatility. For most audio tasks, you may consider trimming files to be of a consistent duration, as well as other common audio preprocessing techniques.

• The files are named as integers, starting from 1. The order of files bears no significance.

• If you are using these files specifically for fart classification or fart recognition, please bear in mind that this data is biased towards my farts. Consequently, you may find that a model recognizes someone else's farts, such as your own, with different results.

Suggested Uses

• Unsupervised signal classification - You can experiment with categorizing farts without any preexisting knowledge of defining characteristics and potentially apply these learnings to other signal types - speech, radar, tv, radio, light, EEG.

• Supervised signal recognition - This dataset could be used to experiment with developing deep learning models capable of recognizing whether a sound is a fart. An interesting property of farts is variable frequencies and inconsistent durations.

• Sound effects creation - This dataset could be used by sound designers or audio engineers as a basis to create new sound effects for movies, video games, or other media. You could also simply use it as a publicly available and free source of farts.

• Education and outreach - Educators and scientists can use this dataset as an approach to better engage their audiences in signal processing and deep learning.

License

• This data is publicly and freely available to use and modify however you would like. There is no license and no limitations for use. I would appreciate being notified of this data being used publicly, purely for my own entertainment.

Speaker Recognition - CMU ARCTIC

kaggle.com

Updated Nov 21, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Gabriel Lins (2022). Speaker Recognition - CMU ARCTIC [Dataset]. https://www.kaggle.com/datasets/mrgabrielblins/speaker-recognition-cmu-arctic/data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 21, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gabriel Lins

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Can you predict which speaker is talking?
Can you predict what they are saying? This dataset makes all of these possible. Perfect for a school project, research project, or resume builder.

File information

train.csv - file containing all the data you need for training, with 4 columns, id (file id), file_path(path to .wav files), speech(transcription of audio file), and speaker (target column)
test.csv - file containing all the data you need to test your model (20% of total audio files), it has the same columns as train.csv
train/ - Folder with training data, subdivided with Speaker's folders
- aew/ - Folder containing audio files in .wav format for speaker aew
- ...
test/ - Folder containing audio files for test data.

Column description

Column	Description
id	file id (string)
file_path	file path to .wav file (string)
speech	transcription of the audio file (string)
speaker	speaker name, use this as the target variable if you are doing audio classification (string)

More Details

The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.

The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.

The 1132 sentence prompt list is available from cmuarctic.data

The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.

Acknowledgements

This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Libri Speech Noise Dataset
kaggle.com
Updated Aug 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VJ (2020). Libri Speech Noise Dataset [Dataset]. https://www.kaggle.com/earth16/libri-speech-noise-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
VJ
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Objective is to remove/suppress indoor or outdoor noises present during conversation for other person on call. Improve recording system or get rid of the noises present in the environment.

Real time noise reduction using DNN : - https://devblogs.nvidia.com/nvidia-real-time-noise-suppression-deep-learning/ - http://staff.ustc.edu.cn/~jundu/Publications/publications/Trans2015_Xu.pdf

Design a network which works for stationary and non-stationary noises

Content

Prepared Data : 1 sec speech mixed with noises present in indoor and outdoor environment - For preparing longer speech duration with noise present in it - Use audio and marcas python library
- For filtering noise for longer speech - make 1 sec model for cleaning the noise and repeat for process in chunks of 1 sec for longer wav files.

Data preparation technique (a) ProcessWav.ipynb (b) Processdata.ipynb

https://github.com/vijay033/Noise-Suppression-Auto-Encoder

Additional reference repository for preparing datasets

https://github.com/vijay033/Audio-Noising-DeNoising/blob/master/prepareData.ipynb

Model architecture and summary attached along with dataset to reduce noise :

Modeling train(noisy input) vs y_train (clean output) similar to noising-denoising technique for images

Train and test code is available for reference ( https://github.com/vijay033/Noise-Suppression-Auto-Encoder ) (a) trainnoise.ipynb ( train ) - generate model.h5 (b) testprediction ( using model.h5 ) - chunking of audio for 1 sec - passed through model - rejoin all chunked files

Sampling Rate = 16K File format : WAV NPY dimension : 257 X 62 X 3

Acknowledgements

Since data is not balanced, prepare more data in balanced way

Uploaded result gives,

proof of work on dataset

functioning of network design (autoencoder)

inspires to improve existing work

SNR for test samples ( 13 files - NoiseTest_SNR.pptx )

System limitations to play with epoc and batch size

Other Reference Links :

Wav2ImageSpectrogram - ImageSpectogram2Wav https://www.youtube.com/watch?v=QrBpex7ZCMY https://github.com/sikora507/elgen/blob/master/src/audio%20analysis.ipynb

AutoEncoder Model For Image Denoising https://github.com/nsarang/ImageDenoisingAutoencdoer

Mix Noise With Speech https://pypi.org/project/maracas/

Chunking wav files in small segments https://www.programcreek.com/python/example/89506/pydub.AudioSegment.from_file https://readthedocs.org/projects/audiosegment/downloads/pdf/latest/

Free Sound Noise https://www.freesoundeffects.com/free-sounds/

Speech Dataset http://www.openslr.org/12 ( Not all sample used : train-clean-100.tar.gz 6.3G) https://openslr.org/83/

Inspiration

Model which can remove or suppress the noises from the original speech. CNN or Auto encoder model to remove or suppress noise from the original speech.
A
‘Spotify Recommendation’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Spotify Recommendation’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-spotify-recommendation-3903/3a5b5131/?iid=006-758&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Spotify Recommendation’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bricevergnou/spotify-recommendation on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Spotify Recommandation

( You can check how I used this dataset on my github repository )

I am basically a HUGE fan of music ( mostly French rap though with some exceptions but I love music ). And someday , while browsing stuff on Internet , I found the Spotify's API . I knew I had to use it when I found out you could get information like danceability about your favorite songs just with their id's.

https://user-images.githubusercontent.com/86613710/127216769-745ac143-7456-4464-bbe3-adc53872c133.png" alt="image">

Once I saw that , my machine learning instincts forced me to work on this project.

1. Data Collection

1.1 Playlist creation

I collected 100 liked songs and 95 disliked songs

For those I like , I made a playlist of my favorite 100 songs. It is mainly French Rap , sometimes American rap , rock or electro music.

For those I dislike , I collected songs from various kind of music so the model will have a broader view of what I don't like

There is : - 25 metal songs ( Cannibal Corps ) - 20 " I don't like " rap songs ( PNL ) - 25 classical songs - 25 Disco songs

I didn't include any Pop song because I'm kinda neutral about it

1.2 Getting the ID's

From the Spotify's API "Get a playlist's Items" , I turned the playlists into json formatted data which cointains the ID and the name of each track ( ids/yes.py and ids/no.py ). NB : on the website , specify "items(track(id,name))" in the fields format , to avoid being overwhelmed by useless data.

With a script ( ids/ids_to_data.py ) , I turned the json data into a long string with each ID separated with a comma.

1.3 Getting the statistics

Now I just had to enter the strings into the Spotify API "Get Audio Features from several tracks" and get my data files ( data/good.json and data/dislike.json )

2. Data features

From Spotify's API documentation :

acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

duration_ms : The duration of the track in milliseconds.

energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

key : The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

And the variable that has to be predicted :

liked : 1 for liked songs , 0 for disliked songs

--- Original source retains full ownership of the source dataset ---
Speech Emotion Recognition (en)
kaggle.com
Updated Jan 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmytro Babko (2021). Speech Emotion Recognition (en) [Dataset]. https://www.kaggle.com/datasets/dmitrybabko/speech-emotion-recognition-en
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dmytro Babko
Description
Context

Speech is the most natural way of expressing ourselves as humans. It is only natural then to extend this communication medium to computer applications. We define speech emotion recognition (SER) systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions. SER is not a new field, it has been around for over two decades, and has regained attention thanks to the recent advancements. These novel studies make use of the advances in all fields of computing and technology, making it necessary to have an update on the current methodologies and techniques that make SER possible. We have identified and discussed distinct areas of SER, provided a detailed survey of current literature of each, and also listed the current challenges.

Content

Here 4 most popular datasets in English: Crema, Ravdess, Savee and Tess. Each of them contains audio in .wav format with some main labels.

Ravdess:

Here is the filename identifiers as per the official RAVDESS website:

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

Vocal channel (01 = speech, 02 = song).

Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

So, here's an example of an audio filename. 02-01-06-01-02-01-12.wav This means the meta data for the audio file is:

Video-only (02)

Speech (01)

Fearful (06)

Normal intensity (01)

Statement "dogs" (02)

1st Repetition (01)

12th Actor (12) - Female (as the actor ID number is even)

Crema:

The third component is responsible for the emotion label: * SAD - sadness; * ANG - angry; * DIS - disgust; * FEA - fear; * HAP - happy; * NEU - neutral.

Tess:

Very similar to Crema - label of emotion is contained in the name of file.

Savee:

The audio files in this dataset are named in such a way that the prefix letters describes the emotion classes as follows:

'a' = 'anger'

'd' = 'disgust'

'f' = 'fear'

'h' = 'happiness'

'n' = 'neutral'

'sa' = 'sadness'

'su' = 'surprise'

Acknowledgements

My pleasure to show you a notebook of this guy which inspire me to contain this dataset publicly.
Arabic Natural Audio Dataset
kaggle.com
zip
Updated Dec 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SamiraKlaylat (2017). Arabic Natural Audio Dataset [Dataset]. https://www.kaggle.com/suso172/arabic-natural-audio-dataset
Explore at:
zip(587092760 bytes)Available download formats
Dataset updated
Dec 1, 2017
Authors
SamiraKlaylat
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Emotion expression is an essential part of human interaction. The same text can hold different meanings when expressed with different emotions. Thus understanding the text alone is not enough for getting the meaning of an utterance. Acted and natural corpora have been used to detect emotions from speech. Many speech databases for different languages including English, German, Chinese, Japanese, Russian, Italian, Swedish and Spanish exist for modeling emotion recognition. Since there is no reported reference of an available Arabic corpus, we decided to collect the first Arabic Natural Audio Dataset (ANAD) to recognize discrete emotions.

Embedding an effective emotion detection feature in speech recognition system seems a promising solution for decreasing the obstacles faced by the deaf when communicating with the outside world. There exist several applications that allow the deaf to make and receive phone calls normally, as the hearing-impaired individual can type a message and the person on the other side hears the words spoken, and as they speak, the words are received as text by the deaf individual. However, missing the emotion part still makes these systems not hundred percent reliable. Having an effective speech to text and text to speech system installed in their everyday life starting from a very young age will hopefully replace the human ear. Such systems will aid deaf people to enroll in normal schools at very young age and will help them to adapt better in classrooms and with their classmates. It will help them experience a normal childhood and hence grow up to be able to integrate within the society without external help.

Eight videos of live calls between an anchor and a human outside the studio were downloaded from online Arabic talk shows. Each video was then divided into turns: callers and receivers. To label each video, 18 listeners were asked to listen to each video and select whether they perceive a happy, angry or surprised emotion. Silence, laughs and noisy chunks were removed. Every chunk was then automatically divided into 1 sec speech units forming our final corpus composed of 1384 records.

Twenty five acoustic features, also known as low-level descriptors, were extracted. These features are: intensity, zero crossing rates, MFCC 1-12 (Mel-frequency cepstral coefficients), F0 (Fundamental frequency) and F0 envelope, probability of voicing and, LSP frequency 0-7. On every feature nineteen statistical functions were applied. The functions are: maximum, minimum, range, absolute position of maximum, absolute position of minimum, arithmetic of mean, Linear Regression1, Linear Regression2, Linear RegressionA, Linear RegressionQ, standard Deviation, kurtosis, skewness, quartiles 1, 2, 3 and, inter-quartile ranges 1-2, 2-3, 1-3. The delta coefficient for every LLD is also computed as an estimate of the first derivative hence leading to a total of 950 features.

I would have never reached that far without the help of my supervisors. I warmly thank and appreciate Dr. Rached Zantout, Dr. Lama Hamandi, and Dr. Ziad Osman for their guidance, support and constant supervision.
EmoDB Dataset
kaggle.com
Updated Sep 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Piyush Agnihotri (2020). EmoDB Dataset [Dataset]. https://www.kaggle.com/piyushagni5/berlin-database-of-emotional-speech-emodb/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 24, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Piyush Agnihotri
Description
Emo-DB Database

The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.

Additional Information

Every utterance is named according to the same scheme:
Positions 1-2: number of speaker
Positions 3-5: code for text
Position 6: emotion (sorry, letter stands for german emotion word)
Position 7: if there are more than two versions these are numbered a, b, c ....

Example: 03a01Fa.wav is the audio file from Speaker 03 speaking text a01 with the emotion "Freude" (Happiness).

Information about the speakers

03 - male, 31 years old
08 - female, 34 years
09 - female, 21 years
10 - male, 32 years
11 - male, 26 years
12 - male, 30 years
13 - female, 32 years
14 - female, 35 years
15 - male, 25 years
16 - female, 31 years

Code of emotions:

letter emotion (english) letter emotion (german)
A anger W Ärger (Wut)
B boredom L Langeweile
D disgust E Ekel
F anxiety/fear A Angst
H happiness F Freude
S sadness T Trauer
N = neutral version

Inspiration

EMOTION classification from speech has an increasing interest in the field of the speech processing area. The objective of the emotion classification is to classify different emotions from the speech signal. A person’s emotional state affects the production mechanism of speech, and due to this, breathing rate and muscle tension change from the neutral condition. Therefore, the resulting speech signal may have different characteristics from that of neutral speech.

The performance of speech recognition or speaker recognition decreases significantly if the model is trained with neutral speech and it is tested with an emotional speech. So we as a Machine Learning Enthusiast can start working on speaker emotion recognition problems and can come up with some good robust models.
🎹 Spotify Tracks Dataset
kaggle.com
Updated Oct 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MaharshiPandya (2022). 🎹 Spotify Tracks Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/4372070
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/4372070
Dataset updated
Oct 22, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MaharshiPandya
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Content

This is a dataset of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. The data is in CSV format which is tabular and can be loaded quickly.

Usage

The dataset can be used for:

Building a Recommendation System based on some user input or preference

Classification purposes based on audio features and available genres

Any other application that you can think of. Feel free to discuss!

Column Description

track_id: The Spotify ID for the track

artists: The artists' names who performed the track. If there is more than one artist, they are separated by a ;

album_name: The album name in which the track appears

track_name: Name of the track

popularity: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity.

duration_ms: The track length in milliseconds

explicit: Whether or not the track has explicit lyrics (true = yes it does; false = no it does not OR unknown)

danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable

energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale

key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1

loudness: The overall loudness of a track in decibels (dB)

mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0

speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks

acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic

instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content

liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live

valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)

tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration

time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4.

track_genre: The genre in which the track belongs

Acknowledgement

Image credits: BPR world
Brazil regional spotify charts
kaggle.com
zip
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filipe Moura (2024). Brazil regional spotify charts [Dataset]. https://www.kaggle.com/datasets/filipeasm/brazil-regional-spotify-charts
Explore at:
zip(10117250 bytes)Available download formats
Dataset updated
Apr 14, 2024
Authors
Filipe Moura
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Brazil
Description
This dataset provides a regional detailed overview of the Brazil digital music consumption in Spotify between 2021-2023. It includes acoustic features and all genres/artists that are listened at least one time in those years. The data is provided by the Spotify API for Developers and the SpotifyCharts wich are used to collect the acoustic features and the summarized most listened songs in city, respectively.

Data description

It contemplates 17 cities of 16 different states in Brazil that achieved 5190 unique tracks, 487 different genres and 2056 artists. The covered cities are: Belém, Belo Horizonte, Brasília, Campinas, Campo Grande, Cuiabá, Curitiba, Florianópolis, Fortaleza, Goiânia, Manaus, Porto Alegre, Recife, Rio de Janeiro, Salvador, São Paulo and Uberlândia. Each city has 119 different weekly's charts wich the week period is described by the file name.

Acoustic features

The covered acoustic features are provided by Spotify and are described as: - Acousticness: Measures from 0.0 to 1.0 of wheter the track is acoustic; 1.0 indicates a totally acoustic song and 0.0 means a song without any acoustic element - Danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. - Energy: is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. - Instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. - Key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. - Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. - Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. - Mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. - Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. - Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. - Time Signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4". - Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Data Science Applications:

Time Series Analysis: Identify seasonal behaviors and the deviation of each city during those 2 years

Trend Analysis: Identify patterns and trends in digital music consumption based in genres and/or acoustic features in each city to understand seasonal changes

Clustering Tasks: Group cities based on genre and/or acoustic features to identify different regional patterns between Brazil's regions and describe the difference between each group
Spotify Top 200 Charts (2020-2021)
kaggle.com
Updated Aug 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SASHANK PILLAI (2021). Spotify Top 200 Charts (2020-2021) [Dataset]. http://doi.org/10.34740/kaggle/dsv/2529719
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/2529719
Dataset updated
Aug 16, 2021
Dataset provided by
Kaggle
Authors
SASHANK PILLAI
Description
The dataset include all the songs that have been on the Top 200 Weekly (Global) charts of Spotify in 2020 & 2021. The dataset include the following features:

Highest Charting Position: The highest position that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Number of Times Charted: The number of times that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Week of Highest Charting: The week when the song had the Highest Position in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Song Name: Name of the song that has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Song iD: The song ID provided by Spotify (unique to each song). Streams: Approximate number of streams the song has. Artist: The main artist/ artists involved in making the song. Artist Followers: The number of followers the main artist has on Spotify. Genre: The genres the song belongs to. Release Date: The initial date that the song was released. Weeks Charted: The weeks that the song has been on in the Spotify Top 200 Weekly Global Charts in 2020 & 2021. Popularity:The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. Acousticness: A measure from 0.0 to 1.0 of whether the track is acoustic. Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. Instrumentalness: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track. Values typical range between -60 and 0 db. Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Chord: The main chord of the song instrumental.

Acknowledgements- This dataset would not be possible without the help of spotifycharts.com and Spotipy Python Library
🎵 Spotify by Genres
kaggle.com
Updated May 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pessina Luca (2021). 🎵 Spotify by Genres [Dataset]. https://www.kaggle.com/pesssinaluca/spotify-by-generes/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Pessina Luca
Description
The dataset is composed by different musial genres and for each kind we have different features that caracterize it. Reference https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

Duration_ms: The duration of the track in milliseconds.

Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

Key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

Mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

Popularity: The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Note: When applying track relinking via the market parameter, it is expected to find relinked tracks with popularities that do not match min_*, max_*and target_* popularities. These relinked tracks are accurate replacements for unplayable tracks with the expected popularity scores. Original, non-relinked tracks are available via the linked_from attribute of the relinked track response.

Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset

English Conversation and Monologue speech dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 7, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Frank Wong

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

Format

16kHz, 16 bit, wav, mono channel;

Content category

Including self-media, conversation, live, lecture, variety-show, etc;

Recording environment

Low background noise;

Country

America(USA);

Language(Region) Code

en-US;

Language

English;

Features of annotation

Transcription text, timestamp, speaker ID, gender.

Accuracy Rate

Sentence Accuracy Rate (SAR) 95%

Licensing Information

Commercial License

Clear search

Close search

Google apps

Main menu

English Conversation and Monologue speech dataset

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

Format

Content category

Recording environment

Country

Language(Region) Code

Language

Features of annotation

Accuracy Rate

Licensing Information

Audio Noise Dataset

Audio Noise Dataset

Ten types of noise

DEEP-VOICE: DeepFake Voice Recognition

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

Introduction

Dataset

Papers with Code

Attribution

License

British English Call Center Data for Telecom AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Fart Recordings Dataset

Speaker Recognition - CMU ARCTIC

File information

Column description

More Details

Acknowledgements

Libri Speech Noise Dataset

Context

Content

Acknowledgements

Inspiration

‘Spotify Recommendation’ analyzed by Analyst-2

Spotify Recommandation

1. Data Collection

1.1 Playlist creation

1.2 Getting the ID's

1.3 Getting the statistics

2. Data features

Speech Emotion Recognition (en)

Context

Content

Acknowledgements

Arabic Natural Audio Dataset

EmoDB Dataset

Emo-DB Database

Additional Information

Information about the speakers

Code of emotions:

Inspiration

🎹 Spotify Tracks Dataset

Content

Usage

Column Description

Acknowledgement

Brazil regional spotify charts

Data description

Acoustic features

Data Science Applications:

Spotify Top 200 Charts (2020-2021)

🎵 Spotify by Genres

English Conversation and Monologue speech dataset

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

Format

Content category

Recording environment

Country

Language(Region) Code

Language

Features of annotation

Accuracy Rate

Licensing Information