Results of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.
In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.
This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.
The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)
This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.
This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.
V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)
The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such:
def dataset_fn_local(split, shuffle_files=False):
global nq_tsv_path
del shuffle_files
# Load lines from the text file as examples.
files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)]
print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files.
First 10: {files_to_read[0:10]}")
ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0))
ds = ds.shuffle(buffer_size=600000)
ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
return ds
A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.
This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
Dataset Card for Unsupervised Peoples Speech
Dataset Description
Dataset Summary
The Unsupervised Peoples Speech Dataset is a compilation of audiofiles extracted from Archive.org that is licensed for academic and commercial usage under CC-BY and CC-BY-SA licenses. It includes more than one million hours of audio with a diverse set of speakers.
Point of Contact: MLCommons Datasets Discord
Dataset Structure
This dataset is a collection of audio… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/unsupervised_peoples_speech.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Results of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.
In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.
This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.