Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
philschmid/finanical-rag-embedding-dataset
philschmid/finanical-rag-embedding-dataset is a modified fork of virattt/llama-3-8b-financialQA for fine-tuning embedding models using positive text pairs (question, context). The dataset include 7,000 question, context pairs from NVIDIAs 2023 SEC Filling Report
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Retrieval-Augmented Generation (RAG) Full 20000
Retrieval-Augmented Generation (RAG) Full 20000 is an English dataset designed for RAG-optimized models, built by Neural Bridge AI, and released under Apache license 2.0.
Dataset Description
Dataset Summary
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by allowing them to consult an external authoritative knowledge base before generating responses. This approach significantly boosts… See the full description on the dataset page: https://huggingface.co/datasets/neural-bridge/rag-full-20000.
A comprehensive dataset covering over 1 million stores in the US and Canada, designed for training and optimizing retrieval-augmented generation (RAG) models and other AI/ML systems. This dataset includes highly detailed, structured information such as:
Menus: Restaurant menus with item descriptions, categories, and modifiers. Inventory: Grocery and retail product availability, SKUs, and detailed attributes like sizes, flavors, and variations.
Pricing: Real-time and historical pricing data for dynamic pricing strategies and recommendations.
Availability: Real-time stock status and fulfillment details for grocery, restaurant, and retail items.
Applications: Retrieval-Augmented Generation (RAG): Train AI models to retrieve and generate contextually relevant information.
Search Optimization: Build advanced, accurate search and recommendation engines. Personalization: Enable personalized shopping, ordering, and discovery experiences in apps.
Data-Driven Insights: Develop AI systems for pricing analysis, consumer behavior studies, and logistics optimization.
This dataset empowers businesses in marketplaces, grocery apps, delivery services, and retail platforms to scale their AI solutions with precision and reliability.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Introduction
RAG-Instruct is a RAG dataset designed to comprehensively enhance LLM RAG capabilities, synthesized using GPT-4o. This dataset is based on the Wikipedia corpus and This dataset is based on the Wikipedia corpus and offers the advantages of query-document scenario diversity and task diversity. The RAG-Instruct dataset can significantly enhance the RAG ability of LLMs and make remarkable improvements in RAG performance across various tasks.
Model WQA (acc) PQA (acc)… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/RAG-Instruct.
This dataset is derived from the Global News Dataset. Please refer to the original source (also cited below) and ensure that your use complies with its terms and conditions.
Webz.io News Dataset Repository
Introduction
Welcome to the Webz.io News Dataset Repository! This repository is created by Webz.io and is dedicated to providing free datasets of publicly available news articles. We release new datasets weekly, each containing around 1,000 news articles focused on… See the full description on the dataset page: https://huggingface.co/datasets/Jerry999/sds-news-rag.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Dataset Name
A Dataset for Evaluating Retrieval-Augmented Generation Across Documents
Dataset Description
MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
Dataset Sources… See the full description on the dataset page: https://huggingface.co/datasets/yixuantt/MultiHopRAG.
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.
Historical daily stock prices (open, high, low, close, volume)
Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)
Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)
Feature engineering based on financial data and technical indicators
Sentiment analysis data from social media and news articles
Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)
Stock price prediction
Portfolio optimization
Algorithmic trading
Market sentiment analysis
Risk management
Researchers investigating the effectiveness of machine learning in stock market prediction
Analysts developing quantitative trading Buy/Sell strategies
Individuals interested in building their own stock market prediction models
Students learning about machine learning and financial applications
The dataset may include different levels of granularity (e.g., daily, hourly)
Data cleaning and preprocessing are essential before model training
Regular updates are recommended to maintain the accuracy and relevance of the data
The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
The PubMed Corpus in MedRAG
This HF dataset contains the snippets from the PubMed corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).
News
(02/26/2024) The "id" column has been reformatted. A new "PMID" column is added.
Dataset Details
Dataset Descriptions
PubMed is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/pubmed.
Dataset Card for Dataset Name
RAG FOR DIETS
Dataset Details
Dataset Description
THIS IS A DATASET CREATED BY SLECTIVELY CHOOSING AND MERGING MULTIPLE DATASETS FROM VARIOUS SOURCERS INCLUDING OTHER DATASETS AND GENERATED DATASETS. FEEL FREE TO USE THESE ANYWHERE AND MAKE SURE TO CREDIT THE APPROPIATE DATA SOURCERS WHEREVER NECESSARY!! 😀
Curated by: [Navaneeth. K]
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.
Historical daily stock prices (open, high, low, close, volume)
Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)
Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)
Feature engineering based on financial data and technical indicators
Sentiment analysis data from social media and news articles
Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)
Stock price prediction
Portfolio optimization
Algorithmic trading
Market sentiment analysis
Risk management
Researchers investigating the effectiveness of machine learning in stock market prediction
Analysts developing quantitative trading Buy/Sell strategies
Individuals interested in building their own stock market prediction models
Students learning about machine learning and financial applications
The dataset may include different levels of granularity (e.g., daily, hourly)
Data cleaning and preprocessing are essential before model training
Regular updates are recommended to maintain the accuracy and relevance of the data
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains four synthetic Engineering Change Order (ECO) documents, authored by me as part of the Google GenAI Capstone Challenge (Q1 2025).
These documents simulate realistic engineering change processes at a fictional wearable technology company named HappyMatrix, which is developing a conceptual product called MatrixSync X100.
Each .txt
file captures a different engineering change—ranging from hardware updates and battery improvements to algorithm tuning and sustainable packaging—written in technical yet human-readable language.
File Name | Description |
---|---|
ECO-100001.txt | Enclosure update — added ventilation slots to improve thermal performance |
ECO-100002.txt | Battery replacement — switch from lithium-polymer to solid-state for safety and longevity |
ECO-100003.txt | Algorithm tuning — improved step detection accuracy in signal processing logic |
ECO-100004.txt | Packaging redesign — introduced eco-friendly materials and minimized waste |
.txt
files resembling real ECOs This dataset supports experimentation and learning in areas such as:
Ideal for projects simulating GenAI applications in product lifecycle management, documentation review, and engineering operations.
These documents were authored entirely by me to support my GenAI Capstone notebook.
They do not represent any real company or proprietary information.
Any resemblance to existing products or organizations is purely coincidental.
This dataset is used in the following notebook:
🧠HappyMatrix ECO Assistant
A GenAI-powered tool for analyzing engineering change orders with LangChain, Gemini, and ChromaDB.
This dataset is shared under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
You are free to: - Share — copy and redistribute the material - Adapt — remix, transform, and build upon it
Under the following terms: - Attribution — You must give appropriate credit. - NonCommercial — You may not use the material for commercial purposes.
These ECO documents were created for educational demonstration purposes as part of the Google GenAI Capstone 2025.
Dataset Card for Dataset Name
RAG FOR RECOVERY
Dataset Details
Dataset Description
THIS IS A DATASET CREATED BY SLECTIVELY CHOOSING AND MERGING MULTIPLE DATASETS FROM VARIOUS SOURCERS INCLUDING OTHER DATASETS AND GENERATED DATASETS. FEEL FREE TO USE THESE ANYWHERE AND MAKE SURE TO CREDIT THE APPROPIATE DATA SOURCERS WHEREVER NECESSARY!! 😀
Curated by: [Navaneeth. K]
Dataset Card for Dataset Name
RAG FOR DIFFERENT WORKOUTS
Dataset Details
Dataset Description
THIS IS A DATASET CREATED BY SLECTIVELY CHOOSING AND MERGING MULTIPLE DATASETS FROM VARIOUS SOURCERS INCLUDING OTHER DATASETS AND GENERATED DATASETS. FEEL FREE TO USE THESE ANYWHERE AND MAKE SURE TO CREDIT THE APPROPIATE DATA SOURCERS WHEREVER NECESSARY!! 😀
Curated by: [Navaneeth. K]
WFDD is a dataset for benchmarking anomaly detection methods with a focus on textile inspection. It includes 4101 woven fabric images categorized into 4 categories: grey cloth, grid cloth, yellow cloth, and pink flower. The first three classes are collected from the industrial production sites of WEIQIAO Textile, while the 'pink flower' class is gathered from the publicly available Cloth Flaw Dataset. Each category contains block-shape, point-like, and line-type defects with pixel-level annotations.
Dataset Card for Dataset Name
RAG FOR MEALS
Dataset Details
Dataset Description
THIS IS A DATASET CREATED BY SLECTIVELY CHOOSING AND MERGING MULTIPLE DATASETS FROM VARIOUS SOURCERS INCLUDING OTHER DATASETS AND GENERATED DATASETS. FEEL FREE TO USE THESE ANYWHERE AND MAKE SURE TO CREDIT THE APPROPIATE DATA SOURCERS WHEREVER NECESSARY!! 😀
Curated by: [Navaneeth. K]
Dataset Card for Dataset Name
RAG FOR SUPPLEMENTS
Dataset Details
Dataset Description
THIS IS A DATASET CREATED BY SLECTIVELY CHOOSING AND MERGING MULTIPLE DATASETS FROM VARIOUS SOURCERS INCLUDING OTHER DATASETS AND GENERATED DATASETS. FEEL FREE TO USE THESE ANYWHERE AND MAKE SURE TO CREDIT THE APPROPIATE DATA SOURCERS WHEREVER NECESSARY!! 😀
Curated by: [Navaneeth. K]
Dataset Card for Dataset Name
RAG FOR STRENGTH WORKOUTS
Dataset Details
Dataset Description
THIS IS A DATASET CREATED BY SLECTIVELY CHOOSING AND MERGING MULTIPLE DATASETS FROM VARIOUS SOURCERS INCLUDING OTHER DATASETS AND GENERATED DATASETS. FEEL FREE TO USE THESE ANYWHERE AND MAKE SURE TO CREDIT THE APPROPIATE DATA SOURCERS WHEREVER NECESSARY!! 😀
Curated by: [Navaneeth. K]
Dataset Card for Dataset Name
RAG FOR CARDIO WORKOUTS
Dataset Details
Dataset Description
THIS IS A DATASET CREATED BY SLECTIVELY CHOOSING AND MERGING MULTIPLE DATASETS FROM VARIOUS SOURCERS INCLUDING OTHER DATASETS AND GENERATED DATASETS. FEEL FREE TO USE THESE ANYWHERE AND MAKE SURE TO CREDIT THE APPROPIATE DATA SOURCERS WHEREVER NECESSARY!! 😀
Curated by: [Navaneeth. K]
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.