Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages
The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data
Models used for text generation:
GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)
Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247
LLM texts: 3004
See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts
Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv
"Car-free cities
"
"Does the electoral college work?
"
"Exploring Venus
"
"The Face on Mars
"
"Facial action coding system
"
"A Cowboy Who Rode the Waves
"
"Driverless cars
"
How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"
train_essays_7_prompts_v2.csv
) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts. Namely:
Car-free cities
"Does the electoral college work?
"Exploring Venus
"The Face on Mars
"Facial action coding system
"Seeking multiple opinions
"Phones and driving
"This dataset is a derivative of the datasets
as well as the original competition training dataset
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
stas1k/llm-bootcamp-train-samples dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text
. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark
the code used to generate the benchmarkevaluation
evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Silencio’s interpolation dataset delivers spatially continuous noise data combining: • 10M+ hours of real dBA measurements • AI-generated interpolations
Applications: • AI-based acoustic mapping • Digital twin and simulation models • Ground-truth data for AI validation
Delivered via CSV or S3. GDPR-compliant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.
-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.
-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.
-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization
-Quality: Multiple rounds of quality inspections ensures high quality data output
-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.
-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.
3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045
From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The NewsMediaBias-Plus dataset is designed for the analysis of media bias and disinformation by combining textual and visual data from news articles. It aims to support research in detecting, categorizing, and understanding biased reporting in media outlets.
NewsMediaBias-Plus pairs news articles with relevant images and annotations indicating perceived biases and the reliability of the content. It adds a multimodal dimension for bias detection in news media.
unique_id
: Unique identifier for each news item. Each unique_id
matches an image for the same article.outlet
: The publisher of the article.headline
: The headline of the article.article_text
: The full content of the news article.image_description
: Description of the paired image.image
: The file path of the associated image.date_published
: The date the article was published.source_url
: The original URL of the article.canonical_link
: The canonical URL of the article.new_categories
: Categories assigned to the article.news_categories_confidence_scores
: Confidence scores for each category.text_label
: Indicates the likelihood of the article being disinformation:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.multimodal_label
: Indicates the likelihood of disinformation from the combination of the text snippet and image content:
Likely
: Likely to be disinformation.Unlikely
: Unlikely to be disinformation.Load the dataset into Python:
from datasets import load_dataset
ds = load_dataset("vector-institute/newsmediabias-plus")
print(ds) # View structure and splits
print(ds['train'][0]) # Access the first record of the train split
print(ds['train'][:5]) # Access the first five records
from datasets import load_dataset
# Load the dataset in streaming mode
streamed_dataset = load_dataset("vector-institute/newsmediabias-plus", streaming=True)
# Get an iterable dataset
dataset_iterable = streamed_dataset['train'].take(5)
# Print the records
for record in dataset_iterable:
print(record)
Contributions are welcome! You can:
To contribute, fork the repository and create a pull request with your changes.
This dataset is released under a non-commercial license. See the LICENSE file for more details.
Please cite the dataset using this BibTeX entry:
@misc{vector_institute_2024_newsmediabias_plus,
title={NewsMediaBias-Plus: A Multimodal Dataset for Analyzing Media Bias},
author={Vector Institute Research Team},
year={2024},
url={https://huggingface.co/datasets/vector-institute/newsmediabias-plus}
}
For questions or support, contact Shaina Raza at: shaina.raza@vectorinstitute.ai
Disclaimer: The labels Likely
and Unlikely
are based on LLM annotations and expert assessments, intended for informational use only. They should not be considered final judgments.
Guidance: This dataset is for research purposes. Cross-reference findings with other reliable sources before drawing conclusions. The dataset aims to encourage critical thinking, not provide definitive classifications.
GPTFuzzer is a fascinating project that explores red teaming of large language models (LLMs) using auto-generated jailbreak prompts. Let's dive into the details:
Project Overview: GPTFuzzer aims to assess the security and robustness of LLMs by crafting prompts that can potentially lead to harmful or unintended behavior.
The project focuses on GPT-3 and similar models.
Datasets:
The datasets used in GPTFuzzer include:
Harmful Questions: Sampled from public datasets like llm-jailbreak-study and hh-rlhf. Human-Written Templates: Collected from llm-jailbreak-study. Responses: Gathered by querying models like Vicuna-7B, ChatGPT, and Llama-2-7B-chat.
Models:
The judgment model is a finetuned RoBERTa-large model. The training code and data are available in the repository.
During fuzzing experiments, the model is automatically downloaded and cached.
Updates:
The project has received recognition and awards at conferences like Geekcon 2023. The team continues to improve the codebase and aims to build a general black-box fuzzing framework for LLMs.
Source: Conversation with Bing, 3/17/2024 (1) sherdencooper/GPTFuzz: Official repo for GPTFUZZER - GitHub. https://github.com/sherdencooper/GPTFuzz. (2) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md. (3) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md. (4) undefined. https://avatars.githubusercontent.com/u/37368657?v=4. (5) undefined. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md?raw=true. (6) undefined. https://desktop.github.com. (7) undefined. https://github.com/sherdencooper/GPTFuzz/raw/master/README.md. (8) undefined. https://opensource.org/licenses/MIT. (9) undefined. https://camo.githubusercontent.com/a4426cbe5c21edb002526331c7a8fbfa089e84a550567b02a0d829a98b136ad0/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667. (10) undefined. https://img.shields.io/badge/License-MIT-yellow.svg. (11) undefined. https://arxiv.org/pdf/2309.10253.pdf. (12) undefined. https://sherdencooper.github.io/. (13) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&. (14) undefined. http://www.dataisland.org/. (15) undefined. http://xinyuxing.org/. (16) undefined. https://geekcon.darknavy.com/2023/china/en/index.html. (17) undefined. https://avatars.githubusercontent.com/u/35443979?v=4. (18) undefined. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md?raw=true. (19) undefined. https://docs.github.com/articles/about-issue-and-pull-request-templates. (20) undefined. https://github.com/CriticalPulsar/GPTFuzz/raw/master/README.md. (21) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&hl=en.
LLM Dataset - Prompts and Generated Texts The dataset contains prompts and texts generated by the Large Language Models (LLMs) in 32 different languages. The prompts are short sentences or phrases for the model to generate text. The texts generated by the LLM are responses to these prompts and can vary in length and complexity.
Researchers and developers can use this dataset to train and fine-tune their own language models for multilingual applications. The dataset provides a rich and diverse collection of outputs from the model, demonstrating its ability to generate coherent and contextually relevant text in multiple languages.
💴 For Commercial Usage: Full version of the dataset includes 4,000,000 logs generated in 32 languages with diferent types of LLM, including Uncensored GPT, leave a request on TrainingData to buy the dataset Models used for text generation: GPT-3.5, GPT-4 Languages in the dataset: Arabic, Azerbaijani, Catalan, Chinese, Czech, Danish, German, Greek, English, Esperanto, Spanish, Persian, Finnish, French, Irish, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malayalam, Maratham, Netherlands, Polish, Portuguese, Portuguese (Brazil), Slovak, Swedish, Thai, Turkish, Ukrainian
Content CSV File includes the following data:
from_language: language the prompt is made in, model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), time: time when the answer was generated, text: user prompt, response: response generated by the model 💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset TrainingData provides high-quality data annotation tailored to your needs keywords: dataset, machine learning, natural language processing, artificial intelligence, deep learning, neural networks, text generation, language models, openai, gpt-3, data science, predictive modeling, sentiment analysis, keyword extraction, text classification, sequence-to-sequence models, attention mechanisms, transformer architecture, word embeddings, glove embeddings, chatbots, question answering, language understanding, text mining, information retrieval, data preprocessing, feature engineering, explainable ai, model deployment
CC-BY-NC
Original Data Source: LLM Question-Answer Dataset
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
A comprehensive dataset covering over 1 million stores in the US and Canada, designed for training and optimizing retrieval-augmented generation (RAG) models and other AI/ML systems. This dataset includes highly detailed, structured information such as:
Menus: Restaurant menus with item descriptions, categories, and modifiers. Inventory: Grocery and retail product availability, SKUs, and detailed attributes like sizes, flavors, and variations.
Pricing: Real-time and historical pricing data for dynamic pricing strategies and recommendations.
Availability: Real-time stock status and fulfillment details for grocery, restaurant, and retail items.
Applications: Retrieval-Augmented Generation (RAG): Train AI models to retrieve and generate contextually relevant information.
Search Optimization: Build advanced, accurate search and recommendation engines. Personalization: Enable personalized shopping, ordering, and discovery experiences in apps.
Data-Driven Insights: Develop AI systems for pricing analysis, consumer behavior studies, and logistics optimization.
This dataset empowers businesses in marketplaces, grocery apps, delivery services, and retail platforms to scale their AI solutions with precision and reliability.
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.
Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.
We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.
CNCF QA Dataset for LLM Tuning
Description
This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model. The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training.
Dataset Summary Speech Brown is a comprehensive, synthetic, and diverse paired speech-text dataset in 15 categories, covering a wide range of topics from fiction to religion. This dataset consists of over 55,000 sentence-level samples.
To train the CLASP model, we created this dataset based on the Brown Corpus. The synthetic speech was generated using the NVIDIA Tacotron 2 text-to-speech model.
For more information about our proposed model, please refer to this paper. The dataset generation pipeline, along with code and usage instructions, is available on this GitHub page.
Dataset Statistics
Total size: Approximately 30 GB.
Number of samples: 55,173 pairs of speech and text.
Average tokens per sample: 19.00.
Maximum tokens in a sample: 48.
Average characters per sample: 96.72.
Number of unique tokens: 50,667
Categories: 15 categories consist of adventure, belles_lettres, editorial, fiction, government, hobbies, humor, learned, lore, mystery, news, religion, reviews, romance, science_fiction.
Dataset Structure To ensure ease of use, the dataset is partitioned into 10 parts. Each part can be used independently if it meets the requirements of your task and model.
Metadata Files
global_metadata: A JSON file containing metadata for all 55,173 samples.
localized_metadata: A JSON file containing metadata for all samples, categorized into the 10 dataset partitions.
Metadata Fields
id: The unique identifier for the sample.
audio_file_path: The file path for the audio in the dataset.
category: The category of the sample's text.
text: The corresponding text of the audio file.
Usage Instructions To use this dataset, download the parts and metadata files as follows:
Option 1: Manual Download Visit the dataset repository and download all dataset_partX.zip files and the global_metadata.json file.
Option 2: Programmatic Download Use the huggingface_hub library to download the files programmatically:
from huggingface_hub import hf_hub_download
from zipfile import ZipFile
import os
import json
Download dataset parts
zip_file_path1 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part1.zip", repo_type="dataset")
zip_file_path2 = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="dataset_part2.zip", repo_type="dataset")
Download other parts...
Download metadata
metadata_file_path = hf_hub_download(repo_id="llm-lab/SpeechBrown", filename="global_metadata.json", repo_type="dataset")
for i in range(1, 11):
with ZipFile(f'dataset_part{i}.zip', 'r') as zip_ref:
zip_ref.extractall(f'dataset_part{i}')
os.remove(f'dataset_part{i}.zip')
with open('global_metadata.json', 'r') as f:
metadata = json.load(f)
metadata.keys()
Citations If you find our paper, code, data, or models useful, please cite the paper: @misc{abootorabi2024claspcontrastivelanguagespeechpretraining, title={CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval}, author={Mohammad Mahdi Abootorabi and Ehsaneddin Asgari}, year={2024}, eprint={2412.13071}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.13071}, }
Contact If you have questions, please email mahdi.abootorabi2@gmail.com or asgari@berkeley.edu.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This synthetic customer purchase dataset has been created as an educational resource for data science, machine learning, and retail analytics applications. The data focuses on key consumer purchase behaviours, including demographic information, product details, purchase history, and payment methods. It is designed to help users practice data manipulation, analysis, and predictive modelling in the context of retail and e-commerce.
https://storage.googleapis.com/opendatabay_public/images/image_e2373b5a-94d0-4587-a7c9-72e63e79115c.png" alt="image_e2373b5a-94d0-4587-a7c9-72e63e79115c.png">
This dataset is useful for a variety of applications, including:
This dataset is synthetic and anonymized, making it a safe tool for experimentation and learning without compromising any real customer data.
CCO (Public Domain)
Data science enthusiasts: For learning and practising retail data analysis, customer segmentation, and predictive modelling. Researchers and educators: For academic studies or teaching purposes in retail analytics and consumer behaviour. Marketing professionals: For analyzing purchasing patterns and designing targeted promotional campaigns.
Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over **********-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of *** Germans in 2022. While not a staggering amount, it is a considerable use of energy. Energy savings through AI While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a ***** expect that AI might reduce power consumption by *** to ******* percent. Considering that much of the world uses mobile phones this would be a considerable energy saver. Emissions are considerable The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly *** tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages
The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data
Models used for text generation:
GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)
Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.