CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Japanese Dataset日本語データセットHigh-Quality Japanede TTS Dataset for AI & Speech Models Contact Us OverviewTitleJapanese Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in the 5 to 30 second range.Use CaseASR,…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Japanese Call Center Speech Dataset for the Telecom domain designed to enhance the development of call center speech recognition models specifically for the Telecom industry. This dataset is meticulously curated to support advanced speech recognition, natural language processing, conversational AI, and generative voice AI algorithms.
This training dataset comprises 40 Hours of call center audio recordings covering various topics and scenarios related to the Telecom domain, designed to build robust and accurate customer service speech technology.
This dataset offers a diverse range of conversation topics, call types, and outcomes, including both inbound and outbound calls with positive, neutral, and negative outcomes.
This extensive coverage ensures the dataset includes realistic call center scenarios, which is essential for developing effective customer support speech recognition models.
To facilitate your workflow, the dataset includes manual verbatim transcriptions of each call center audio file in JSON format. These transcriptions feature:
These ready-to-use transcriptions accelerate the development of the Telecom domain call center conversational AI and ASR models for the Japanese language.
The dataset provides comprehensive metadata for each conversation and participant:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Japanese Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of Japanese language speech recognition models, with a particular focus on Japan accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the Japanese language spoken in Japan.
Speech Data:This training dataset comprises 50 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 70 native Japanese speakers from different states/provinces of Japan. This collaborative effort guarantees a balanced representation of Japan accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Japanese language speech recognition models.
Transcription:This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of Japanese language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
The gravity station data (4,381 records) were compiled by the Japanese Oceanographic Data Center. This data base was received in July 1988. The data are in the 'MGD77' exchange format. Principal gravity parameters include Free-air Anomalies and Observed gravity corrected for Eotvos, drift, and tares. The observed gravity values are referenced to the International Gravity Standardization Net 1971 (IGSN 71). The gravity anomaly computation uses the Geodetic Reference System 1967 (GRS 67) theoretical gravity formula. The data are randomly distributed within the boundaries of Japan.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the English-Japanese Bilingual Parallel Corpora dataset for the Environment domain! This comprehensive dataset contains a vast collection of bilingual text data, carefully translated between English and Japanese, to support the development of environment-specific language models and machine translation engines.
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the Environment domain.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Japanese Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content:This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Japanese language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity:To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats:To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Japanese Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy:Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Japanese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Japanese Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
japanese-asr/ja_asr.common_voice_8_0 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A corpus database managed by the MedNLP Laboratory, Kyoto University, Japan. This corpus was compiled using data from 22 people aged 74 to 86 years (mean age: 78.32 years; standard deviation [SD]: 3.36) who agreed to provide data for research purposes. This corpus also includes under 74 data (total 30 data).
Japanese-Novels-23M
This dataset contains Japanese web novels that I collected personally. Machine-Learning Use OnlyAccess is restricted to bona fide machine-learning–related purposes.To request access, please provide a detailed explanation of the specific tasks or applications for which you intend to use the dataset.
Total records: 23,212,809 Total characters: 80,846,120,027 Total tokens (Llama 4 tokenizer): 55,406,468,406 (55.4 B)
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
The audio dataset includes speech corpuses, featuring Japanese speakers from Japan with detailed metadata.
As of October 2024, approximately ****** Japanese residents lived in Hong Kong. Over the past decade, the Japanese population in the city has shown a general downward trend from around ****** residents ten years earlier.
The data contains 101,702 entries. All words and pronunciations are produced by Japanese linguists. It can be used in the research and development of Japanese ASR technology.
Japanese(Japan) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(around 1000 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
This dataset was created by Keith Lê
Released under Community Data License Agreement - Sharing - Version 1.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vital Statistics: Japanese Only: Natural Increase data was reported at -394,373.000 Person in 2017. This records a decrease from the previous number of -330,770.000 Person for 2016. Vital Statistics: Japanese Only: Natural Increase data is updated yearly, averaging 768,649.000 Person from Dec 1947 (Median) to 2017, with 71 observations. The data reached an all-time high of 1,751,194.000 Person in 1949 and a record low of -394,373.000 Person in 2017. Vital Statistics: Japanese Only: Natural Increase data remains active status in CEIC and is reported by Ministry of Health, Labour and Welfare. The data is categorized under Global Database’s Japan – Table JP.G005: Vital Statistics.
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in Japanese Investment in the United States: Superior Performance, Increasing Integration, PIIE Policy Brief 15-3. If you use the data, please cite as: Oldenski, Lindsay, and Theodore H. Moran. (2015). Japanese Investment in the United States: Superior Performance, Increasing Integration. PIIE Policy Brief 15-3. Peterson Institute for International Economics.
japanese-asr/whisper_transcriptions.reazonspeech.all.wer_10.0.vectorized dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Industrial Machinery: Overseas Sub: No of Employees data was reported at 471,725.000 Person in Mar 2018. This records a decrease from the previous number of 472,111.000 Person for Dec 2017. Industrial Machinery: Overseas Sub: No of Employees data is updated quarterly, averaging 193,903.500 Person from Dec 1996 (Median) to Mar 2018, with 86 observations. The data reached an all-time high of 472,111.000 Person in Dec 2017 and a record low of 78,489.000 Person in Dec 1996. Industrial Machinery: Overseas Sub: No of Employees data remains active status in CEIC and is reported by Ministry of Economy, Trade and Industry. The data is categorized under Global Database’s Japan – Table JP.S059: Japanese Business Activities Survey: Overseas Sub: Major Indicators.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Japan DI Index: Other Mfg: OS: Number of Employees data was reported at 6.700 % in Mar 2018. This records an increase from the previous number of 2.000 % for Dec 2017. Japan DI Index: Other Mfg: OS: Number of Employees data is updated quarterly, averaging 5.550 % from Dec 1996 (Median) to Mar 2018, with 86 observations. The data reached an all-time high of 17.400 % in Mar 2010 and a record low of -29.800 % in Dec 2008. Japan DI Index: Other Mfg: OS: Number of Employees data remains active status in CEIC and is reported by Ministry of Economy, Trade and Industry. The data is categorized under Global Database’s Japan – Table JP.S060: Japanese Business Activities Survey: Overseas Sub: Diffusion Index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unemployment Rate in Japan remained unchanged at 2.50 percent in April. This dataset provides the latest reported value for - Japan Unemployment Rate - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Japanese Dataset日本語データセットHigh-Quality Japanede TTS Dataset for AI & Speech Models Contact Us OverviewTitleJapanese Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in the 5 to 30 second range.Use CaseASR,…