Facebook
Twitterjacobmorrison/sciriff-license-filtered-final-commercial dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjacobmorrison/sciriff-filtered-licenses dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for Data Provenance Initiative - Commercial-Licenses
Legal Disclaimer / Notice
Collected License Information is NOT Legal Advice. It is important to note we collect self-reported licenses, from the papers and repositories that released these datasets, and categorize them according to our best efforts, as a volunteer research and transparency initiative. The information provided by any of our works and any outputs of the Data Provenance Initiative do not, and… See the full description on the dataset page: https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.
A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.
File Format: natural_reasoning.parquet
Click here to view the dataset
CC-BY-NC-4.0 Text Generation Reasoning English (en) 1M < n < 10M Hugging Face You can load the dataset directly from Hugging Face as follows:
from datasets import load_dataset
ds = load_dataset("facebook/natural_reasoning")
The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.
In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.
Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.
https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">
If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:
@misc{yuan2025naturalreasoningreasoningwild28m,
title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions},
author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li},
year={2025},
eprint={2502.13124},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13124}
}
Source: Hugging Face
Facebook
TwitterStack V2 Edu
Description
We filter the Stack V2 to only include code from openly licensed repositories, based on the license detection performed by the creators of Stack V2. When multiple licenses are detected in a single repository, we ensure that all of the licenses are on the Blue Oak Council certified license list. Per-document license information is available in the license entry of the metadata field of each example. Code for collecting, processing, and preparing… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/stackv2_edu_filtered.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Megalith-CC0
A CC0-filtered version of the Megalith-10m dataset. The images have also been persisted to an independent public S3 bucket, supported by the AWS Open Data Registry program, for durability.
Why filter by CC0?
The images in Megalith-10m, having been gathered from Flickr, have attached licenses of CC0 and public domain. However, it is not clear if users assigning the public domain license to their works understand the implications of the public domain license.… See the full description on the dataset page: https://huggingface.co/datasets/Spawning/megalith-cc0.
Facebook
Twitterhttps://huggingface.co/datasets/JeanKaddour/minipile
The MiniPile Challenge for Data-Efficient Language Models
MiniPile is a 6GB subset of the deduplicated The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters.
The primary motivation for curating MiniPile is that (i) diverse pre-training datasets (like the Pile) are often too large for academic budgets and (ii) most smaller-scale datasets are fairly homogeneous and thereby unrepresentative of contemporary general-purpose language models. MiniPile aims to fill this gap and thereby facilitate data-efficient research on model architectures, training procedures, optimizers, etc.
More details on the MiniPile curation procedure and some pre-training results be found in the MiniPile paper.
For more details on the Pile corpus, we refer the reader to the Pile datasheet.
English (EN)
MiniPile is a subset of the Pile, curated by Jean Kaddour. The Pile was created by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy.
Since MiniPile is a subset of the Pile, the same MIT License holds.
@article{kaddour2023minipile,
title={The MiniPile Challenge for Data-Efficient Language Models},
author={Kaddour, Jean},
journal={arXiv preprint arXiv:2304.08442},
year={2023}
}
@article{gao2020pile,
title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
journal={arXiv preprint arXiv:2101.00027},
year={2020}
}
Facebook
Twitter~10,000 professional retail scene photographs from UK grocery stores for computer vision research
| Attribute | Details |
|---|---|
| Total Images | ~10,000 high-resolution photos |
| Markets | United Kingdom |
| Collections | 2014 archive, Full store surveys, Halloween 2024 |
| Privacy | All faces automatically blurred |
| License | Evaluation & Research Only |
| Format | JPEG with comprehensive metadata |
This dataset is perfect for:
👉 Request access: https://huggingface.co/datasets/dresserman/kanops-open-access-imagery
This dataset is gated - request access on HuggingFace. By requesting access, you agree to the evaluation-only license terms.
from datasets import load_dataset
# Load the dataset (after getting HuggingFace access)
ds = load_dataset(
"imagefolder",
data_dir="hf://datasets/dresserman/kanops-open-access-imagery/train",
split="train",
)
# Access first image
img = ds[0]["image"] # PIL.Image
img.show()
import pandas as pd
meta = pd.read_csv(
"hf://datasets/dresserman/kanops-open-access-imagery/metadata.csv"
)
print(meta.head())
train/
├── 2014/
│ ├── Aldi/
│ ├── Tesco/
│ ├── Sainsburys/
│ └── ... (22 UK retailers)
├── FullStores/
│ ├── Tesco_Lincoln_2014/
│ ├── Tesco_Express_2015/
│ └── Asda_Leeds_2016/
└── Halloween2024/
└── Various_Retailers/
Root files:
├── MANIFEST.csv # File listing + basic attributes
├── metadata.csv # Enriched metadata (retailer, dims, collection)
├── checksums.sha256 # Integrity verification
├── blur_log.csv # Face-blur verification log
└── LICENSE # Evaluation-only terms
Each image includes comprehensive metadata in metadata.csv:
| Field | Description |
|---|---|
file_name | Path relative to dataset root |
bytes | File size in bytes |
width, height | Image dimensions |
sha256 | Content hash for integrity verification |
collection | One of: 2014, FullStores, Halloween2024 |
retailer | Inferred from file path |
year | Inferred from file path |
License: Evaluation & Research Only
For commercial licensing: Contact happytohelp@groceryinsight.com
This free sample is part of Kanops Archive - a much larger commercial dataset used by AI companies and research institutions.
Applications: - Training production computer vision models - Autonomous checkout systems - Retail robotics and automation - Seasonal demand forecasting - Market research and competitive intelligence
Learn more: [groceryinsight.com/retail-image-dataset](...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
trpfrog-icons Dataset
This is a dataset of TrpFrog's icons. By the way, what do you use this for? 🤔
How to use
from datasets import load_dataset
dataset = load_dataset("TrpFrog/trpfrog-icons")
for data in dataset["train"]: print(data)
dataset = dataset.filter(lambda x: x["label"] == 0)
License
MIT License
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset, TokenBender: 122k Alpaca-Style Instructions Word-Level Classification Towards Accurate Natural Language Understanding, provides a comprehensive collection of 122K Alpaca-style instructions with their associated input, text and output for word-level classification. It enables natural language understanding research to be done conveniently as it contains entries from diverse areas such as programming code instructions and gaming instructions that are written in varying levels of complexity. With the help of this dataset, developers aiming to apply natural language processing techniques for machines may gain insight into how to improve the accuracy and facilitate the comprehension of human language commands. By using this dataset, one may develop advanced algorithms such as neural networks or decision trees that can quickly understand commands in foreign languages and bridge the gap between machines and humans for different practical purposes
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains 122k Alpaca-Style Instructions with their corresponding input, text, and output for word-level classification. It is a valuable resource to those who wish to gain insight into natural language understanding through data science approaches. This guide will provide some tips on how to use this dataset in order to maximize accuracy and gain a better understanding of natural language.
Preprocessing: Cleaning the data is an essential step when dealing with any sort of text data which includes the Alpaca instructions dataset. This involves removing stopwords like articles, pronouns, etc., normalizing words such as capitalization or lemmatization, filtering for relevant terms based on context or key problems you are trying to solve; and finally tokenizing the remaining text into appropriate individual pieces that can be provided as input features for different models – SentencePiece is perfect for this sort of task.
Feature extraction: After preprocessing your text data it’s time to extract insightful features from it utilizing techniques like Bag-of-Words (BOW), Term Frequency - Inverse Document Frequency (TF-IDF) Vectorizer etc., which might help you better understand the context behind each instruction sentence/word within the corpus. Additionally embedding techniques using word2vec/GloVe might also serve useful in extracting semantic information from these instructions while helping build classifiers successful at predicting word level categories related tasks (Semantic segmentation).
Model selection: Depending on your problem setup AI architectures like Support vector machines(SVMs)/Conditional Random Fields(CRFs)/ Attention Based Models should work well in tackling these types of tasks related towards NLP analysis at both sentence or shallow representation form levels (Part Of Speech tagging). If learning what words are used together efficiently matters more than all other options then selecting an RNN model such as LSTM or GRU might do wonders; they are similarly effective but faster modelling approach due its recursive structure that allows you store context information more effectively compared BOWs or TFIDF Vectors spaces separately built up during feature engineering processing periods per individual supervised training tasks points instead across all!
Evaluating Results: After choosing the best algorithm model fit analysis performance measures such as F1 scores should enable easier tracking end goal results adjustments if needed precision/recall levels are declining significantly past certain number values threshold points compared lower task confirming holding out uncategorized sample documents versus larger ID test portion splits train tests datasets subsets collected
- Developing an AI-based algorithm capable of accurately understanding the meaning of natural language instructions.
- Using this dataset for training and testing machine learning models to classify specific words and phrases within natural language instructions.
- Training a deep learning model to generate visual components based on the given input, text, and output values from this dataset
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Univer...
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
FineVideo
FineVideo Description Dataset Explorer Revisions Dataset Distribution
How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset
Dataset StructureData Instances Data Fields
Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases
Additional Information Credits Future Work Opting out of FineVideo Citation Information
Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This Synthia-v1.3 dataset provides insight into the complexities of human-machine communication through its collection of dialogue interactions between humans and machines. Contained within this dataset are details on how conversations develop between the two, detailing behavioural changes in both humans and machines towards one another over time. With information provided on both user instructions to machines, as well as the system, machine responses and other related data points, this dataset offers a detailed overview of machine learning concepts, examining how systems utilise dialogue to interact with people in various scenarios. This can offer valuable insight into how predictive intelligence is applied by these systems in conversational settings, better informing developers seeking to build their own human-machine interfaces for effective two-way communication. By looking at this data set as a whole it can create an understanding of the way connections form between humans and machines providing a deeper level of appreciation for ongoing challenges faced when working on projects with these technological components at play
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The dataset consists of a collection of dialogue interactions between humans and machines, providing insight into human-machine communication. It includes information about the system being used, instructions given by humans to machines and responses from machines.
To start using this data set: - Download the csv file containing all of the dialogue interactions from Kaggle datasets page. - Open up your favourite spreadsheet software like Excel or Google Sheets and load up the CSV file - Take a look at each of the columns listed in order to familiarize yourself with what they contain: ‘system’ column contains details about what system was used for role play between human and machine; ‘instruction’ column contains instructions given by humans to machines; ‘response’ column contains responses from machines back to humans based on their instructions
- Start exploring how conversations progress between humans and machine over time by examining information in each of these columns separately or together as requiredYou can also filter out specific conditions within your data set such as searching for conversations that were driven entirely by particular systems or involving certain instruction types etc. In addition, you have an opportunity conduct various kinds of analysis such as statistical analysis (e.g., descriptive statistics or correlation analysis). With so many possibilities for exploration, you are sure find something interesting!
- Utilizing the dataset to understand how various types of instruction styles can influence conversation order and flow between humans and machines.
- Using the data to predict potential responses in a given dialogue interaction from varying sources, such as robots or virtual assistants.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:----------------|:--------------------------------------------------------------| | system | The type of system used in the dialogue interaction. (String) | | instruction | The instruction given by the human to the machine. (String) | | response | The response given by the machine to the human. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
We introduce Social IQa: Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like "Jesse saw a concert" and a question like "Why did Jesse do this?", humans can easily infer that Jesse wanted "to see their favorite performer" or "to enjoy the music", and not "to see what's happening inside" or "to see if it works". The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluating models’ abilities to reason about the social implications of everyday events and situations. (Less)
This dataset can be used to train and test models for social inquiry question answering. The questions and answers in the dataset have been annotations by experts, and the dataset has been verified for accuracy.
- The dataset can be used to train a model to answer questions about social topics.
- The dataset can be used to improve question-answering systems for social inquiry.
- The dataset can be used to generate new questions about social topics
Huggingface Hub: link
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:------------------------------------------------------| | context | The context of the question. (String) | | answerA | One of the possible answers to the question. (String) | | answerB | One of the possible answers to the question. (String) | | answerC | One of the possible answers to the question. (String) | | label | The correct answer to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | context | The context of the question. (String) | | answerA | One of the possible answers to the question. (String) | | answerB | One of the possible answers to the question. (String) | | answerC | One of the possible answers to the question. (String) | | label | The correct answer to the question. (String) |
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.
The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.
Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.
With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.
Introduction:
File Description:
validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.
train.csv: The purpose of this file is to provide training data for summarizing scientific articles.
test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.
Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.
Usage Examples: This dataset can be utilized in various ways:
a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.
b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.
c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.
- Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.
Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions
- Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
- Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
- Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Amazon Reviews Polarity Dataset discloses eighteen years of customers' ratings and reviews from Amazon.com, offering an unparalleled trove of insight and knowledge. Drawing from the immense pool of over 35 million customer reviews, this dataset presents a broad spectrum of customer opinions on products they have bought or used. This invaluable data is a gold mine for improving products and services as it contains comprehensive information regarding customers' experiences with a product including ratings, titles, and plaintext content. At the same time, this dataset contains both customer-specific data along with product information which encourages deep analytics that could lead to great advances in providing tailored solutions for customers. Has your product been favored by the majority? Are there any aspects that need extra care? Use Amazon Reviews Polarity to gain deeper insights into what your customers want - explore now!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Analyze customer ratings to identify trends: Take a look at how many customers have rated the same product or service with the same score (e.g., 4 stars). You can use this information to identify what customers like or don’t like about it by examining common sentiment throughout the reviews. Identifying these patterns can help you make decisions on which features of your products or services to emphasize in order to boost sales and satisfaction rates.
2 Review content analysis: Analyzing review content is one of the best ways to gauge customer sentiment toward specific features or aspects of a product/service. Using natural language processing tools such as Word2Vec, Latent Dirichlet Allocation (LDA), or even simple keyword search algorithms can quickly reveal general topics that are discussed in relation to your product/service across multiple reviews - allowing you quickly pinpoint areas that may need improvement for particular items within your lines of business.
3 Track associated scores over time: By tracking customer ratings overtime, you may be able to better understand when there has been an issue with something specific related to your product/service - such as negative response toward a feature that was introduced but didn’t seem popular among customers and was removed shortly after introduction.. This can save time and money by identifying issues before they become widespread concerns with larger sets of consumers who invest their money in using your company's item(s).
4 Visualize sentiment data over time graphs : Utilizing visualizations such as bar graphs can help identify trends across different categories quicker than raw numbers alone; combining both numeric values along with color differences associated between different scores allows you spot anomalies easier - allowing faster resolution times when trying figure out why certain spikes occurred where other stayed stable (or vice-versa) when comparing similar data points through time-series based visualization models
- Developing a customer sentiment analysis system that can be used to quickly analyze the sentiment of reviews and identify any potential areas of improvement.
- Building a product recommendation service that takes into account the ratings and reviews of customers when recommending similar products they may be interested in purchasing.
- Training a machine learning model to accurately predict customers’ ratings on new products they have not yet tried and leverage this for further product development optimization initiatives
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | label | The sentiment of the review, either positive or negative. (String) | | title | The title of the review. (String) ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Data Dictionary
The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)
Preparing data for analysis
It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.
##### Training Models using Mistral 7B
Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .
##### Testing phosphors :
After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low
- Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.
- Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.
- Optimizing search algorithms that surface relevant answer results based on types of queries
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📀 Falcon RefinedWeb
Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains 100,000 feedback responses from GPT-4 AI models along with rubrics designed to evaluate both absolute and ranking scores. Each response is collected through a comprehensive evaluation process that takes into account the model's feedback, instruction, criteria for scoring, referenced answer and input given. This data provides researchers and developers with valuable insights into the performance of their AI models on various tasks as well as the ability to compare them against one another using precise and accurate measures. Each response is accompanied by five descriptive scores that give a detailed overview of its quality in terms of relevance to the input given, accuracy in reference to the reference answer provided, coherence between different parts of the output such as grammar and organization, fluency in expression of ideas without errors or unnecessary repetitions, and overall productivity accounting for all other factors combined. With this dataset at your disposal, you will be able to evaluate each output qualitatively without having to manually inspect every single response
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains feedback from GPT-4 models, along with associated rubrics for absolute and ranking scoring. It can be used to evaluate the performance of GPT-4 models on different challenging tasks.
In order to use this dataset effectively, it is important to understand the data provided in each column: - orig_feedback – Feedback given by the original GPT-4 model - orig_score2_description – Description of the second score given to the original GPT-4 model - orig_reference_answer – Reference answer used to evaluate the original GPT-4 model
- output – Output from the fine-grained evaluation
- orig_response – Response from the original GPT-4 model * orig_criteria – Criteria used to evaluate the original GPT-4 model *orig_instruction– Instruction given to the original GPT 4 model *orig_score3 _description– Description of third score given to
- Data-driven evaluation of GPT-4 models using the absolute and ranking scores collected from this dataset.
- Training a deep learning model to automate the assessment of GPT-4 responses based on the rubrics provided in this dataset.
- Building a semantic search engine using GPT-4 that is able to identify relevant responses more accurately with the help of this dataset's data collection metrics and rubrics for scoring
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:----------------------------|:---------------------------------------------------------------| | orig_feedback | Feedback from the evaluator. (Text) | | orig_score2_description | Description of the second score given by the evaluator. (Text) | | orig_reference_answer | Reference answer used to evaluate the model response. (Text) | | output | Output from the GPT-4 model. (Text) | | orig_response | Original response from the GPT-4 model. (Text) | | orig_criteria | Criteria used by the evaluator to rate the response. (Text) | | orig_instruction | Instructions provided by the evaluator. (Text) | | orig_score3_description | Description of the third score given by the evaluator. (Text) | | orig_score5_description | Description of the fifth score given by the evaluator. (Text) | | orig_score1_description | Description of the first score given by the evaluator. (Text) | | input | Input given to the evaluation. (Text) | | orig_score4_description | Description of the fourth score given by the evalua...
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 1.5TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-dedup.
Facebook
TwitterLlama 3.1 Tulu 3 Ultrafeedback (Cleaned) (on-policy 8B)
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture. It contains prompts from Ai2's cleaned version of Ultrafeedback which removes instances of TruthfulQA. We further filtered this dataset to remove… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-ultrafeedback-cleaned-on-policy-8b.
Facebook
Twitterjacobmorrison/sciriff-license-filtered-final-commercial dataset hosted on Hugging Face and contributed by the HF Datasets community