InDyne/Revenue dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lamini Earning Calls QA Dataset
Description
This dataset contains transcripts of earning calls for various companies, along with questions and answers related to the companies' financial performance and other relevant topics.
Format
The transcripts, questions, and answers are in the form of jsonlines files, with each json object in the file containing the transcript of an earning call for a single company.
Data Pipeline Code
The entire data pipeline… See the full description on the dataset page: https://huggingface.co/datasets/lamini/earnings-calls-qa.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
finRAG Datasets
This is the official Huggingface repo of the finRAG datasets published by parsee.ai. More detailed information about the 3 datasets and methodology can be found in the sub-directories for the individual datasets. We wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same… See the full description on the dataset page: https://huggingface.co/datasets/parsee-ai/finRAG.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NFCorpus: 20 generated queries (BEIR Benchmark)
This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.
DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py
Below contains the old dataset card for the BEIR benchmark.
Dataset Card for BEIR… See the full description on the dataset page: https://huggingface.co/datasets/income/cqadupstack-gaming-top-20-gen-queries.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Adult Census Income Dataset
The following was retrieved from UCI machine learning repository. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year. Description of fnlwgt (final weight)… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/adult-census-income.
Dataset Card for "revenue-estimate-stocks"
More Information needed
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Earnings 22 dataset ( also referred to as earnings22 ) is a 119-hour corpus of English-language earnings calls collected from global companies. The primary purpose is to serve as a benchmark for industrial and academic automatic speech recognition (ASR) models on real-world accented speech.
Dataset Card for Earnings 22
Dataset Summary
Earnings-22 provides a free-to-use benchmark of real-world, accented audio to bridge academic and industrial research. This dataset contains 125 files totalling roughly 119 hours of English language earnings calls from global countries. This dataset provides the full audios, transcripts, and accompanying metadata such as ticker symbol, headquarters country, and our defined "Language Region".
Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/distil-whisper/earnings22.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The dataset reports a collection of earnings call transcripts, the related stock prices, and the sector index In terms of volume, there is a total of 188 transcripts, 11970 stock prices, and 1196 sector index values. Furthermore, all of these data originated in the period 2016-2020 and are related to the NASDAQ stock market. Furthermore, the data collection was made possible by Yahoo Finance and Thomson Reuters Eikon. Specifically, Yahoo Finance enabled the search for stock values and Thomson Reuters Eikon provided the earnings call transcripts. Lastly, the dataset can be used as a benchmark for the evaluation of several NLP techniques to understand their potential for financial applications. Moreover, it is also possible to expand the dataset by extending the period in which the data originated following a similar procedure.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NFCorpus: 20 generated queries (BEIR Benchmark)
This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.
DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py
Below contains the old dataset card for the BEIR benchmark.
Dataset Card for BEIR… See the full description on the dataset page: https://huggingface.co/datasets/income/trec-news-top-20-gen-queries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CUADRevenueProfitSharingLegalBenchClassification An MTEB dataset Massive Text Embedding Benchmark
This task was constructed from the CUAD dataset. It consists of determining if the clause require a party to share revenue or profit with the counterparty for any technology, goods, or services.
Task category t2c
Domains Legal, Written
Reference https://huggingface.co/datasets/nguha/legalbench
How to evaluate on this task
You can evaluate an embedding… See the full description on the dataset page: https://huggingface.co/datasets/mteb/CUADRevenueProfitSharingLegalBenchClassification.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
rahulisivaram5/income dataset hosted on Hugging Face and contributed by the HF Datasets community
language:
en --Generate a clean Excel dataset with the following columns: Date (from 01-01-2023 to 31-12-2025), Region (North, South, East, West), Branch (Branch A to Branch E), Business Type (B2B & B2C), Partner ID (should be unique), Client ID (should be unique), Total Investment, Total Revenue, Revenue generated by B2B, Revenue generated by B2C, Revenue generated by Partner, Partner share of 40% from total revenue, Admin Expenses, Employee & HR Expenses, Marketing Expense, Technology… See the full description on the dataset page: https://huggingface.co/datasets/mitadhamdhere13/BusinessData.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NFCorpus: 20 generated queries (BEIR Benchmark)
This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.
DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py
Below contains the old dataset card for the BEIR benchmark.
Dataset Card for BEIR… See the full description on the dataset page: https://huggingface.co/datasets/income/scidocs-top-20-gen-queries.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Information
This dataset includes quarterly earnings reports for various US stocks.
Instruments Included
7000+ US Stocks
Dataset Columns
symbol: The stock ticker or financial instrument identifier associated with the data. date: The end date of the fiscal period for which the financial data is reported. reported_date: The actual date on which the company reported its earnings or financial results. reported_eps: The earnings per share (EPS) that the… See the full description on the dataset page: https://huggingface.co/datasets/paperswithbacktest/Stocks-Quarterly-Earnings.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NFCorpus: 20 generated queries (BEIR Benchmark)
This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.
DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py
Below contains the old dataset card for the BEIR benchmark.
Dataset Card for BEIR… See the full description on the dataset page: https://huggingface.co/datasets/income/climate-fever-top-20-gen-queries.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
lamini/earnings-raw dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Adult
The Adult dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their income threshold.
Configurations and tasks
Configuration Task Description
encoding
Encoding dictionary showing original values of encoded features.
income Binary classification Classify the person's income as over or under the threshold.
income-no race Binary classification As income, but the race feature is removed.
race Multiclass… See the full description on the dataset page: https://huggingface.co/datasets/mstz/adult.
7wolf/cr-y2024-summer-556-profit-points-17k dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NFCorpus: 20 generated queries (BEIR Benchmark)
This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.
DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py
Below contains the old dataset card for the BEIR benchmark.
Dataset Card for BEIR… See the full description on the dataset page: https://huggingface.co/datasets/income/arguana-top-20-gen-queries.
InDyne/Revenue dataset hosted on Hugging Face and contributed by the HF Datasets community