https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for ReactiveAI/TinyStories-MRL
Synthetic Memory Reinforcement Learning dataset for Proof-of-Concept Reactive Transformer models. Dataset is divided into subsets, used in different Curriculum Stage of MRL training - each subset have different number of follow-up interactions, could use different strategy, and have train and validation splits.
After first experiments with MRL, we decided to abandon single step and two steps stages. That's because with single step… See the full description on the dataset page: https://huggingface.co/datasets/ReactiveAI/TinyStories-MRL.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
TinyStories Dataset README
Overview
This dataset is based on TinyStories and includes structured JSON data with corresponding annotations, designed for research in controllable story generation and related tasks.
Dataset Structure
Each data item contains the following fields:
1. conversations
Type: List Purpose: Contains the JSON of the story from: Always set to "human". value: Structured data containing entities, events, story structures and… See the full description on the dataset page: https://huggingface.co/datasets/guodaosun/tale-frame.
https://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/
GradedStories
GradedStories is a synthetically-augmented dataset, created by evaluating the quality of 2.7M children stories from the TinyStories dataset. The evaluation process was done using Llama-3-8B-Instruct. The dataset includes the original stories, the generated evaluations & an assigned grade (out of 10) for each story’s structure as well as a grade for the story's adherence to common sense reasoning. The added information about each sample’s potential quality (evaluations… See the full description on the dataset page: https://huggingface.co/datasets/AB057/GradedStories.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.