100+ datasets found

CCDV Arxiv Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
Explore at:
zip(2219742528 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

By ccdv (From Huggingface) [source]

About this dataset

The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

How to use the dataset

Introduction:

File Description:

validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

Usage Examples: This dataset can be utilized in various ways:

a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

Research Ideas

Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.

Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.

Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Combined dataset from Arxiv and Wikipedia
kaggle.com
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monica Avagyan (2025). Combined dataset from Arxiv and Wikipedia [Dataset]. https://www.kaggle.com/datasets/monikaavagyan/combined-dataset-from-arxiv-and-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2025
Dataset provided by
Kaggle
Authors
Monica Avagyan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a combination of **arXiv **and Wikipedia data, containing metadata and textual content related to academic publications and encyclopedic knowledge. It includes key attributes such as authors, titles, digital object identifiers (DOI), categories, abstracts, update dates, URLs, and full text content of documents.

Key Features Authors: Names of the researchers or contributors to each document. Title: The title of the publication or article. DOI: A unique identifier for academic papers. Categories: Classification labels indicating the subject area (e.g., astrophysics, mathematics, physics). **Abstract: **A summary of the document’s content. Update Date: The most recent modification date of the document. URL: A link to the full document. Text: The full textual content of the document.

The dataset contains over 2.6 million unique values for text-based fields. Use Cases Academic research analysis: Studying trends in scientific publications over time. Natural Language Processing (NLP): Developing models for summarization, classification, and text generation. Knowledge extraction: Identifying key themes and topics in scientific and encyclopedic data. Citation and impact studies: Analyzing author influence and research impact based on citations. This dataset is a valuable resource for text mining, AI training, and scientific knowledge analysis, providing a rich blend of structured metadata and unstructured text.
Z
unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured...
data.niaid.nih.gov
nde-dev.biothings.io
Updated Nov 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saier, Tarek; Krause, Johan; Färber, Michael (2023). unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (open subset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7752614
Explore at:
Dataset updated
Nov 3, 2023
Dataset provided by
Karlsruhe Institute of Technology
Authors
Saier, Tarek; Krause, Johan; Färber, Michael
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network. The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files. Typical uses are

Training of ML models (citation recommendation, summarization, LLMs) Citation context analysis Bibliographic analyses Access ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛ Regarding the full data set, please note the following:

Note: this Zenodo record is the "open subset" of unarXive, which contains all permissively licensed papers from arXiv.org. You can find the full version here. The code used for generating the data set is publicly available.
ArXiv CS Papers Multi-Label Classification (200K)
kaggle.com
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharukh Rahman (2023). ArXiv CS Papers Multi-Label Classification (200K) [Dataset]. https://www.kaggle.com/datasets/devintheai/arxiv-cs-papers-multi-label-classification-200k-v1
Explore at:
zip(83841332 bytes)Available download formats
Dataset updated
Jun 7, 2023
Authors
Sharukh Rahman
Description
The ArXiv CS Papers Multi-Label Classification dataset is a comprehensive collection of research papers from the computer science domain. This dataset is intended for multi-label classification tasks and contains a diverse range of research papers spanning various topics within computer science.

The dataset consists of approximately 200,000+ research papers and includes the following columns:

Paper ID: A unique identifier for each research paper in the dataset.

Title: The title of the research paper.

Abstract: A brief summary or abstract of the research paper.

Year: The publication year of the research paper.

Primary Category: The primary category of the research paper, representing the main topic or area of focus.

Categories: Additional categories or subtopics associated with the research paper.

This dataset is well-suited for tasks related to text classification, topic modeling, information retrieval, and other natural language processing (NLP) tasks. Researchers and practitioners can leverage this dataset to develop and evaluate machine learning models for multi-label classification on a wide range of computer science topics.

Note: Please refer to the original ArXiv repository for access to the full-text content of the papers and proper citation guidelines. This dataset contains metadata and should be used for research and educational purposes only.

We hope that the ArXiv CS Papers Multi-Label Classification dataset serves as a valuable resource for researchers, data scientists, and machine learning enthusiasts in their quest to advance knowledge and understanding in the field of computer science.
Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...
springernature.figshare.com
application/csv
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal (2024). A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.26133973.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26133973.v1
Dataset updated
Sep 24, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.

We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].

Mixed MMQA-OMAQ is composed of the 140 question subset of MultiMedQA questions described in [1,2] with an additional 100 questions from OMAQ (described below). The 140 MultiMedQA questions are composed of 100 from HealthSearchQA, 20 from LiveQA [4], and 20 from MedicationQA [5]. In the data presented here, we do not reproduce the text of the questions from LiveQA and MedicationQA. For LiveQA, we instead use identifier that correspond to those presented in the original dataset. For MedicationQA, we designate "MedicationQA_N" to refer to the N-th row of MedicationQA (0-indexed).

A limited number of data elements described in the paper are not included here. The following elements are excluded:

The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.

The free-text comments written by raters during the ratings process.

Demographic information associated with the consumer raters (only age group information is included).

References

Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2

Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z

Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.

Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.

Description of files and sheets

Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.

Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.

Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.

Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].

Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.

Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.

Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.

Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.

TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.

Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.

Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.

HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].

Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.

Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].

Version history

Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)

WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.

NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
Data from: Beyond designer’s knowledge-Generating materials design...
figshare.com
bin
Updated Sep 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
quanliang liu (2024). Beyond designer’s knowledge-Generating materials design hypotheses via a large language model [Dataset]. http://doi.org/10.6084/m9.figshare.26322460.v4
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26322460.v4
Dataset updated
Sep 13, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
quanliang liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data supporting the paper entitled "Beyond designer’s knowledge: Generating materials design hypotheses via a large language model". For detailed information about the file contents, see 'Folder Contents Description.pdf'.Provider is willing to license its rights in the Prompts and Codes (“Provider’s Rights”) to academic researchers to use free of charge solely for academic, non-commercial research purposes subject to the terms and conditions outlined herein. The Prompts and Codes were created at the University of Wisconsin ("UW") by Quanliang Liu and Hyunseok Oh. Please note Provider's Rights may include, but are not limited to, certain patents or patent applications owned by the Wisconsin Alumni Research Foundation (“WARF”).https://arxiv.org/abs/2409.06756
h
arxiv-image-text
huggingface.co
Updated Sep 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nopperl (2023). arxiv-image-text [Dataset]. https://huggingface.co/datasets/nopperl/arxiv-image-text
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2023
Authors
nopperl
License
https://choosealicense.com/licenses/pddl/https://choosealicense.com/licenses/pddl/
Description
arXiv Figures Dataset

This dataset contains image-text pairs extracted from figures from papers published until the end of 2020 in the arXiv repository. The dataset can be used to train CLIP models. This repo contains a Parquet file containing the metadata of a WebDataset in img2dataset format. The images themselves are not distributed and need to be retrieved. Note that the images cannot be retrieved by an HTTP URL, so img2dataset cannot be used as is to retrieve the data. Instead… See the full description on the dataset page: https://huggingface.co/datasets/nopperl/arxiv-image-text.
Z
Supplementary Data: Status of the scalar singlet dark matter model...
data-staging.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The GAMBIT Collaboration (2020). Supplementary Data: Status of the scalar singlet dark matter model (arXiv:1705.07931) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_801510
Explore at:
Dataset updated
Jan 24, 2020
Authors
The GAMBIT Collaboration
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Data

Status of the scalar singlet dark matter model arXiv:1705.07931

The files in this record contain data for the scalar singlet dark matter model considered in the GAMBIT "Round 1" scalar singlet paper.

The files consist of

Three YAML files, each corresponding to a different parameter range

StandardModel_SLHA2_SingletDM_scan_15.yaml, a universal YAML fragment included from the other three YAML files

Three hdf5 files. SingletDM.hdf5 contains the combined results of all sampling runs, and is the basis for the profile likelihood plots in the paper. SingletDM_TW_full.hdf5 and SingletDM_TW_lowmass.hdf5 contain the results from T-Walk scans over the full and low-mass parameter ranges, respectively. These are the bases for the marginalised posterior plots in the paper.

An example pip file corresponding to each hdf5 file, for producing plots using pippi

A tarball best_fits_yaml.tar.gz containing YAML files of the best-fit point in each subregion of the fit.

The YAML files corresponding to different parameter ranges follow the naming scheme SingletDM_[slice].yaml, where slice may be full, lowmass or neck. Each of these YAML files contains entries in the Scanners node for running Diver, MultiNest, TWalk and GreAT.

A few caveats to keep in mind:

The YAML files that we give here are updated compared to the ones that we used when generating the hdf5 file, in order to match the set of available options in the release version of GAMBIT 1.0.0. The included physics and numerics are however identical.

The YAML files are designed to work with the tagged release of GAMBIT 1.0.0, and the pip file is tested with pippi 2.0, commit 2ab061a8. They may or may not work with later versions of either software (but you can of course always obtain the version that they do work with via the git history).

The pip file is an example only. Users wishing to reproduce the more advanced plots in any of the GAMBIT papers should contact us for tips or scripts, or experiment for themselves. Many of these scripts are in multiple parts and require undocumented manual interventions and steps in order to implement various plot-specific customisations, so please don't expect the same level of polish as for files provided here or in the GAMBIT repo.
Data from: Extracting Accurate Materials Data from Research Papers with...
figshare.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dane Morgan; Maciej Polak (2023). Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering [Dataset]. http://doi.org/10.6084/m9.figshare.22213747.v5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22213747.v5
Dataset updated
Dec 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Dane Morgan; Maciej Polak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data supporting the paper entitled "Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering" by Maciej P. Polak and Dane Morganhttps://arxiv.org/abs/2303.05352BulkModulus_test_database_MPPolak_DMorgan.xlsx - dataset of bulk modulus text passages and sentences used for methods assessment.CriticalCoolingRates_MGs_database_MPPolak_DMorgan.xlsx - a database of critical cooling rates of metallic glasses. The data is presented in three versions and (described in detail in the paper), i.e. "raw", "cleaned", and "standardized". The critical cooling rate additionally includes manually extracted data serving as ground truth for tests, in sheets labeled as "manual". In addition, a "standardized_MG" database is included, which limits the results to metallic glasses only, together with "standardized_tables_MG" for values extracted from tables, and "Figure_Classification" which contains Figure numbers, captions, and DOIs of their source documents.YieldStrength_HEAs_database_MPPolak_DMorgan.xlsx - a database of yield strengths in the context of high entropy alloys. The data is presented in three versions and (described in detail in the paper), i.e. "raw", "cleaned", and "standardized". In addition, a "standardized_HEA" database is included, which limits the results to HEAs only, together with "standardized_tables_HEA" for values extracted from tables, and "Figure_Classification" which contains Figure numbers, captions, and DOIs of their source documents.ChatExtract_Code_MPPolak_DMorgan.zip - These files contain the ChatExtract code with a short example and instructions.
Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities...
figshare.com
zip
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dane Morgan; Maciej P. Polak (2025). Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots" [Dataset]. http://doi.org/10.6084/m9.figshare.28559639.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28559639.v2
Dataset updated
Oct 21, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Dane Morgan; Maciej P. Polak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"https://arxiv.org/abs/2503.12326This repository contains datasets and tools related to PlotExtract, a pipeline for automated plot digitization using LLM-based vision models. Below is a description of the key components:Dataset Output Files*.out_data - Results of LLM-based visual data extraction from plot images. These files contain the extracted data points in CSV-like format.*.out_code - Python code generated by the LLM to recreate the source plot using the extracted data.*.out_conversation - Full conversations with the LLM conducted by PlotExtract, including prompts and responses.interpolated_* - Visual and statistical comparisons based on interpolation between the LLM-extracted data and the ground-truth. These correspond to the interpolation accuracy assessments described in the paper.pointwise_* - Visual and statistical comparisons on a point-by-point basis between extracted and ground-truth data. These correspond to pointwise accuracy evaluations from the main text.*.stats - Numerical summaries of extraction accuracy, referenced in the associated visual comparisons.*.csv - Manually extracted ground truth data used as reference for evaluating extraction accuracy.All of the above files are generated automatically during PlotExtract execution.Published, Synthetic, and chartQA DatasetThe Published Dataset does not include original plot images due to copyright restrictions. Instead, each plot is referenced in source_images.csv, which lists:DOI of the source publicationFigure numberFilename used in this datasetThe Synthetic Dataset includes synthetic plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes.The chartQA Dataset (https://doi.org/10.48550/arXiv.2203.10244) includes chartQA plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes. There are two equivaent datasets: FULL and CROPPED, the first one containing original images and the second one containing images cropped as much as possible to preserve the plot only and remove additional text.CodesAll source code, including PlotExtract and supporting scripts for evaluation and comparison, is included in MPPolak_DMorgan_PlotExtract_Codes.zip.Each script contains usage instructions in-line and is intended to be self-explanatory for users familiar with Python-based data processing workflows.
Data from: Large Language Model
zenodo.org
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregory Diamos; Mostofa; Gregory Diamos; Mostofa (2020). Large Language Model [Dataset]. http://doi.org/10.5281/zenodo.1492880
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1492880
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gregory Diamos; Mostofa; Gregory Diamos; Mostofa
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
One of the large language models trained in this paper: https://arxiv.org/abs/1810.10045
Summary of citation networks arXiv-HepTh and arXiv-HepPh.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuichiro Yasui; Junji Nakano (2023). Summary of citation networks arXiv-HepTh and arXiv-HepPh. [Dataset]. http://doi.org/10.1371/journal.pone.0269845.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269845.t002
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yuichiro Yasui; Junji Nakano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of citation networks arXiv-HepTh and arXiv-HepPh.
h
subset_arxiv_papers_with_embeddings
huggingface.co
Updated Jun 30, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MongoDB (2015). subset_arxiv_papers_with_embeddings [Dataset]. https://huggingface.co/datasets/MongoDB/subset_arxiv_papers_with_embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2015
Dataset authored and provided by
MongoDB
Description
This dataset is a curated subset of the original arXiv dataset, each entry enriched with a 256-dimensional embedding vector. The embeddings are generated using OpenAI's "text-embedding-3-small" model. For each data point, the embedding is created by concatenating the text of the title, author(s), and abstract into a single string, which is then processed by the embedding model. This approach captures the semantic essence of each document, facilitating tasks such as similarity search… See the full description on the dataset page: https://huggingface.co/datasets/MongoDB/subset_arxiv_papers_with_embeddings.
3M+ Academic Papers: Titles & Abstracts
kaggle.com
zip
Updated Sep 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Arias (2025). 3M+ Academic Papers: Titles & Abstracts [Dataset]. https://www.kaggle.com/datasets/beta3logic/3m-academic-papers-titles-and-abstracts
Explore at:
zip(1478156333 bytes)Available download formats
Dataset updated
Sep 18, 2025
Authors
David Arias
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Comprehensive Academic Papers Dataset: 3M+ Research Paper Titles and Abstracts

📋 Overview

This dataset is a comprehensive collection of over 3 million research paper titles and abstracts, curated and consolidated from multiple high-quality academic sources. The dataset provides a unified, clean, and standardized format for researchers, data scientists, and machine learning practitioners working on natural language processing, academic research analysis, and knowledge discovery tasks.

🎯 Key Features

3.6+ million scientific papers with titles and abstracts

Multi-domain coverage: Physics, Mathematics, Computer Science, Biology, Medicine, and more

Standardized format: Consistent title and abstract columns

Quality assured: Validated using Pydantic models and cleaned of duplicates/null values

Ready-to-use: Pre-processed and formatted for immediate analysis

Format: CSV

Language: English

📊 Dataset Statistics

Metric Value
Total Records ~3,000,000+
Columns 2 (title, abstract)
File Size 4.15 GB
Format CSV
Duplicates Removed
Missing Values Removed

🗂️ Dataset Structure

cleaned_papers.csv ├── title (string): Scientific paper title └── abstract (string): Scientific paper abstract

🔄 Data Processing Pipeline

The dataset underwent a rigorous cleaning and standardization process:

Data Import: Automated import from multiple sources (Kaggle API, Hugging Face)

Column Standardization: Mapping various column names to consistent title and abstract format

Data Validation: Pydantic model validation ensuring data quality

Duplicate Removal: Advanced deduplication based on title and abstract similarity

Null Value Handling: Removal of records with missing titles or abstracts

Quality Assurance: Final validation and statistics generation

💡 Use Cases

This dataset is ideal for:

Natural Language Processing: Text classification, sentiment analysis, topic modeling

Scientific Literature Analysis: Trend analysis, domain classification, citation prediction

Machine Learning Research: Training language models, text summarization, information extraction

Academic Research: Bibliometric analysis, research trend identification

Educational Applications: Building search engines, recommendation systems

🔗 Data Sources and Attribution

This dataset consolidates academic papers from the following sources:

Kaggle Datasets:

ArXiv Scientific Research Papers Dataset by @sumitm004

Cornell University ArXiv Dataset by @Cornell-University

Hugging Face Datasets:

ML-ArXiv-Papers by @CShorten

ArXiv Biology by @zeroshot

ArXiv Data Extended by @wrapper228

Stroke PubMed Abstracts by @Gaborandi

PubMed ArXiv Abstracts Data by @brainchalov

Abstracts Cleaned by @Eitanli

🔄 Update Schedule

This dataset represents a point-in-time consolidation. Future versions may include: - Additional academic sources - Extended fields (authors, publication dates, venues) - Domain-specific subsets - Enhanced metadata

📄 License and Usage

Please respect the individual licenses of the source datasets. This consolidated version is provided for research and educational purposes. When using this dataset:

Citation: Please cite this dataset and acknowledge the original data sources

Attribution: Credit the original dataset creators listed above

Compliance: Ensure compliance with individual dataset licenses

Academic Use: Primarily intended for non-commercial, academic, and research purposes

🙏 Acknowledgments

Special thanks to all the original dataset creators and the academic communities that make their research data publicly available. This work builds upon their valuable contributions to open science and knowledge sharing.

Keywords: academic papers, research abstracts, NLP, machine learning, text mining, scientific literature, ArXiv, PubMed, natural language processing, research dataset
h
arxiv-hard-negatives-cross-encoder
huggingface.co
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aarush (2025). arxiv-hard-negatives-cross-encoder [Dataset]. https://huggingface.co/datasets/chungimungi/arxiv-hard-negatives-cross-encoder
Explore at:
Dataset updated
Apr 20, 2025
Authors
Aarush
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains hard negative examples generated using cross-encoders for training dense retrieval models. The data was used in the paper Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval. This dataset is part of the Hugging Face collection: arxiv-hard-negatives-68027bbc601ff6cc8eb1f449 Please cite if you found this dataset useful 🤗 @misc{sinha2025dontretrievegenerateprompting, title={Don't Retrieve, Generate: Prompting LLMs for Synthetic… See the full description on the dataset page: https://huggingface.co/datasets/chungimungi/arxiv-hard-negatives-cross-encoder.
C
Replication Data for: The economics of lost knowledge: modeling the...
dataverse.csuc.cat
portalrecerca.udl.cat
+1more
application/gzip +3
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge Chamorro-Padial; Jorge Chamorro-Padial; Francisco-Javier Rodrigo-Ginés; Francisco-Javier Rodrigo-Ginés; Rosa María Rodríguez Sánchez; Rosa María Rodríguez Sánchez; Rosa Maria Gil Iranzo; Rosa Maria Gil Iranzo; Roberto García González; Roberto García González (2025). Replication Data for: The economics of lost knowledge: modeling the knowledge cost due to non-FAIR data practices [Dataset]. http://doi.org/10.34810/data2382
Explore at:
bin(142832628), application/vnd.sqlite3(81457152), txt(1917), bin(161598299), txt(6698), application/gzip(3776271875)Available download formats
Unique identifier
https://doi.org/10.34810/data2382
Dataset updated
Jun 20, 2025
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Jorge Chamorro-Padial; Jorge Chamorro-Padial; Francisco-Javier Rodrigo-Ginés; Francisco-Javier Rodrigo-Ginés; Rosa María Rodríguez Sánchez; Rosa María Rodríguez Sánchez; Rosa Maria Gil Iranzo; Rosa Maria Gil Iranzo; Roberto García González; Roberto García González
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset funded by
Agencia Estatal de Investigación
Description
This file contains the replication data for the paper "The Economics of Lost Knowledge: Modeling the Knowledge Cost Due to Non-FAIR Data Practices." It includes the two networks used in the paper, arXiv and OpenAlex, a SQLite database to check whether a link is available on the internet, and the raw network as extracted from arXiv. Aquest fitxer conté les dades de replicació de l’article "L’economia del coneixement perdut: modelant el cost del coneixement degut a pràctiques de dades no FAIR." Inclou les dues xarxes utilitzades a l’article, arXiv i OpenAlex, una base de dades SQLite per comprovar si un enllaç està disponible a internet, i la xarxa en brut tal com va ser extreta d’arXiv. Este archivo contiene los datos de replicación del artículo "La economía del conocimiento perdido: modelando el coste del conocimiento debido a prácticas de datos no FAIR." Incluye las dos redes utilizadas en el artículo, arXiv y OpenAlex, una base de datos SQLite para comprobar si un enlace está disponible en internet, y la red en bruto tal como fue extraída de arXiv.
H
Replication Data for: Order Book Queue Hawkes-Markovian Modeling
dataverse.harvard.edu
Updated Jul 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shihao Yang (2021). Replication Data for: Order Book Queue Hawkes-Markovian Modeling [Dataset]. http://doi.org/10.7910/DVN/ZIRH84
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ZIRH84
Dataset updated
Jul 21, 2021
Dataset provided by
Harvard Dataverse
Authors
Shihao Yang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication Data for: Order Book Queue Hawkes-Markovian Modeling. Manuscript available at https://arxiv.org/abs/2107.09629.
Z
Data from: Accelerometer-Based Multivariate Time-Series Dataset for Calf...
data.niaid.nih.gov
Updated Aug 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dissanayake, Oshana; McPherson, Sarah E.; Allyndrée, Joseph; Kennedy, Emer; Cunningham, Padraig; Riaboff, Lucile (2024). Accelerometer-Based Multivariate Time-Series Dataset for Calf Behavior Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13259481
Explore at:
Dataset updated
Aug 13, 2024
Dataset provided by
University College Dublin
VistaMilk SFI Research Centre, Ireland
Authors
Dissanayake, Oshana; McPherson, Sarah E.; Allyndrée, Joseph; Kennedy, Emer; Cunningham, Padraig; Riaboff, Lucile
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AcTBeCalf Dataset Description

The AcTBeCalf dataset is a comprehensive dataset designed to support the classification of pre-weaned calf behaviors from accelerometer data. It contains detailed accelerometer readings aligned with annotated behaviors, providing a valuable resource for research in multivariate time-series classification and animal behavior analysis. The dataset includes accelerometer data collected from 30 pre-weaned Holstein Friesian and Jersey calves, housed in group pens at the Teagasc Moorepark Research Farm, Ireland. Each calf was equipped with a 3D accelerometer sensor (AX3, Axivity Ltd, Newcastle, UK) sampling at 25 Hz and attached to a neck collar from one week of birth over 13 weeks.

This dataset encompasses 27.4 hours of accelerometer data aligned with calf behaviors, including both prominent behaviors like lying, standing, and running, as well as less frequent behaviors such as grooming, social interaction, and abnormal behaviors.

The dataset consists of a single CSV file with the following columns:

dateTime: Timestamp of the accelerometer reading, sampled at 25 Hz.

calfid: Identification number of the calf (1-30).

accX: Accelerometer reading for the X axis (top-bottom direction)*.

accY: Accelerometer reading for the Y axis (backward-forward direction)*.

accZ: Accelerometer reading for the Z axis (left-right direction)*.

behavior: Annotated behavior based on an ethogram of 23 behaviors.

segId: Segment identification number associated with each accelerometer reading/row, representing all readings of the same behavior segment.

the directions are mentioned in relation to the position of the accelerometer sensor on the calf.

Code Files Description

The dataset is accompanied by several code files to facilitate the preprocessing and analysis of the accelerometer data and to support the development and evaluation of machine learning models. The main code files included in the dataset repository are:

accelerometer_time_correction.ipynb: This script corrects the accelerometer time drift, ensuring the alignment of the accelerometer data with the reference time.

shake_pattern_detector.py: This script includes an algorithm to detect shake patterns in the accelerometer signal for aligning the accelerometer time series with reference times.

aligning_accelerometer_data_with_annotations.ipynb: This notebook aligns the accelerometer time series with the annotated behaviors based on timestamps.

manual_inspection_ts_validation.ipynb: This notebook provides a manual inspection process for ensuring the accurate alignment of the accelerometer data with the annotated behaviors.

additional_ts_generation.ipynb: This notebook generates additional time-series data from the original X, Y, and Z accelerometer readings, including Magnitude, ODBA (Overall Dynamic Body Acceleration), VeDBA (Vectorial Dynamic Body Acceleration), pitch, and roll.

genSplit.py: This script provides the logic used for the generalized subject separation for machine learning model training, validation and testing.

active_inactive_classification.ipynb: This notebook details the process of classifying behaviors into active and inactive categories using a RandomForest model, achieving a balanced accuracy of 92%.

four_behv_classification.ipynb: This notebook employs the mini-ROCKET feature derivation mechanism and a RidgeClassifierCV to classify behaviors into four categories: drinking milk, lying, running, and other, achieving a balanced accuracy of 84%.

Kindly cite one of the following papers when using this data:

Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Evaluating ROCKET and Catch22 features for calf behaviour classification from accelerometer data using Machine Learning models. arXiv preprint arXiv:2404.18159.

Dissanayake, O., McPherson, S. E., Allyndrée, J., Kennedy, E., Cunningham, P., & Riaboff, L. (2024). Development of a digital tool for monitoring the behaviour of pre-weaned calves using accelerometer neck-collars. arXiv preprint arXiv:2406.17352
m
Expressive Gaussian mixture models for high-dimensional statistical...
figshare.manchester.ac.uk
opendatalab.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darren Price; Stephen Menary (2023). Expressive Gaussian mixture models for high-dimensional statistical modelling: simulated data and neural network model files [Dataset]. http://doi.org/10.48420/17136839.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.48420/17136839.v1
Dataset updated
May 31, 2023
Dataset provided by
University of Manchester
Authors
Darren Price; Stephen Menary
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Neural network model files and Madgraph event generator outputs used as inputs to the results presented in the paper "Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences" arXiv:2108.11481Code and model files can be found at:https://github.com/darrendavidprice/science-discovery/tree/master/expressive_gaussian_mixture_models
Training CNNs with Low-Rank Filters for Efficient Image Classification:...
zenodo.org
data.niaid.nih.gov
+1more
application/gzip, csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yani Ioannou; Yani Ioannou (2020). Training CNNs with Low-Rank Filters for Efficient Image Classification: Trained Models [Dataset]. http://doi.org/10.5281/zenodo.53189
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.53189
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yani Ioannou; Yani Ioannou
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Models from experiments referenced in the paper "Training CNNs with Low-Rank Filters for Efficient Image Classification", https://arxiv.org/abs/1511.06744

Model names differ from those in the paper, but the csv files for each set of experiments relates the paper's name for the model and the real name of the model here:

cifarma.csv: Network-in-Network CIFAR10 Models

mitma.csv: MIT Places Models

googlenetma.csv: GoogLeNet ILSVRC2012 Models

vggma.csv: VGG-11 ILSVRC2012 Models

Metric	Value
Total Records	~3,000,000+
Columns	2 (`title`, `abstract`)
File Size	4.15 GB
Format	CSV
Duplicates	Removed
Missing Values	Removed

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

Explore at:

10 scholarly articles cite this dataset (View in Google Scholar)

zip(2219742528 bytes)Available download formats

Dataset updated

Dec 5, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

By ccdv (From Huggingface) [source]

About this dataset

The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

How to use the dataset

Introduction:

File Description:

validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

Usage Examples: This dataset can be utilized in various ways:

a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

Research Ideas

Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.

Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.

Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

Clear search

Close search

Google apps

Main menu

CCDV Arxiv Summarization Dataset

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Combined dataset from Arxiv and Wikipedia

unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured...

ArXiv CS Papers Multi-Label Classification (200K)

Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...

Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

References

Description of files and sheets

Version history

Data from: Beyond designer’s knowledge-Generating materials design...

arxiv-image-text

Supplementary Data: Status of the scalar singlet dark matter model...

Data from: Extracting Accurate Materials Data from Research Papers with...

Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities...

Data from: Large Language Model

Summary of citation networks arXiv-HepTh and arXiv-HepPh.

subset_arxiv_papers_with_embeddings

3M+ Academic Papers: Titles & Abstracts

Comprehensive Academic Papers Dataset: 3M+ Research Paper Titles and Abstracts

📋 Overview

🎯 Key Features

📊 Dataset Statistics

🗂️ Dataset Structure

🔄 Data Processing Pipeline

💡 Use Cases

🔗 Data Sources and Attribution

Kaggle Datasets:

Hugging Face Datasets:

🔄 Update Schedule

📄 License and Usage

arxiv-hard-negatives-cross-encoder

Replication Data for: The economics of lost knowledge: modeling the...

Replication Data for: Order Book Queue Hawkes-Markovian Modeling

Data from: Accelerometer-Based Multivariate Time-Series Dataset for Calf...

Expressive Gaussian mixture models for high-dimensional statistical...

Training CNNs with Low-Rank Filters for Efficient Image Classification:...

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License