3 datasets found
  1. Discord Survey

    • kaggle.com
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yonko (Czeslaw Meyer) (2022). Discord Survey [Dataset]. https://www.kaggle.com/datasets/yonkotoshiro/discord-survey/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yonko (Czeslaw Meyer)
    Description

    Results of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.

    In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.

    This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.

  2. Anime Subtitles

    • kaggle.com
    Updated Aug 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jess Fan (2021). Anime Subtitles [Dataset]. https://www.kaggle.com/datasets/jef1056/anime-subtitles/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jess Fan
    Description

    Content

    The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)

    This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.

    This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.

    V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)

    Format

    The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such: def dataset_fn_local(split, shuffle_files=False): global nq_tsv_path del shuffle_files # Load lines from the text file as examples. files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)] print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files. First 10: {files_to_read[0:10]}") ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0)) ds = ds.shuffle(buffer_size=600000) ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE) ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex))) return ds

    Acknowledgements

    A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.

    This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data

  3. h

    unsupervised_peoples_speech

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons (2025). unsupervised_peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/unsupervised_peoples_speech
    Explore at:
    Dataset updated
    Feb 20, 2025
    Dataset authored and provided by
    MLCommons
    Description

    Dataset Card for Unsupervised Peoples Speech

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    The Unsupervised Peoples Speech Dataset is a compilation of audiofiles extracted from Archive.org that is licensed for academic and commercial usage under CC-BY and CC-BY-SA licenses. It includes more than one million hours of audio with a diverse set of speakers.

    Point of Contact: MLCommons Datasets Discord

      Dataset Structure
    

    This dataset is a collection of audio… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/unsupervised_peoples_speech.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yonko (Czeslaw Meyer) (2022). Discord Survey [Dataset]. https://www.kaggle.com/datasets/yonkotoshiro/discord-survey/code
Organization logo

Discord Survey

Survey of 400 Russian-speaking discord users

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yonko (Czeslaw Meyer)
Description

Results of a survey of 403 discord users. The selection was random, the servers were random, a lot of people refused to go through, but someone agreed. Interrogated only Russian-speaking people. When creating, I notified users that after completion I was going to analyze the data and post the results in the public domain. No any personal user data was collected either.

In general, you can see that I like the discord, as well as some of the psychological focus of the questions. I have no experience in doing something like this, but still I tried to do everything as correctly as possible.

This version is translated into English. Also cleaned data and removed or changed something that wasn't needed.

Search
Clear search
Close search
Google apps
Main menu